Converting TEXT columns to JSONB in PostgreSQL

Recently I had a text field being used to store JSON data from my web application. I hadn’t yet discovered the amazingness of the PostgreSQL JSONB datatype.

JSONB is an amazing datatype because Postgres understands how to query the structure and the keys and values can be indexed.

So if you have a text fields you want to convert to JSONB (provided the existing data is in properly formed JSON already) You can change the datatype and convert the existing data using this simple SQL command:

alter table tbl_name alter column col_name type jsonb using col_name::JSON;

You may be inclined to covert the text using the to_json function like: ‘USING to_json(col_name);’ – BUT DON’T DO THAT! What happens with that is you get a single string encapsulating the valid JSON, containing your JSON. Effectively a double-encoding bug.  The former (casting the column to JSON is the correct way).

How to Clone or Duplicate a PostgreSQL Database

Sometimes you may find yourself needing to duplicate a postgres database – complete with schema, data; exactly. Sometimes I need to do this because I want to try out some ideas on an existing database but without the hassle of having to backup and restore or write rollbacks for the changes I want to make.

Luckily, it’s super easy to do this.  First ensure that there are not active connections to the source database; and then open the SQL Terminal of your choice and execute:

CREATE DATABASE newdb WITH TEMPLATE originaldb OWNER dbuser;

This will create a new database, by using the source database as a template.

If you get the message: “ERROR: Database being accessed by other users.” don’t worry; it just means that there are still open database connections, and these will need to be closed before it will work.

SerializationException Running Npgsql Commands on .Net Entity Framework

Recently, while working on a project for a client I was writing a .Net application using the Entity Framework backed onto a PostgreSQL database. All of a sudden I got a System.Runtime.Serialization.SerializationException when I was trying to run my app, and the stack trace (weirdly) didn’t help.

It turns out that while I had added the Npgsql dependency to the project, it wasn’t enough. Entity Framework it seems, needs to have access to Npgsql.dll assembly information.  When EntityFramework tries to use the library, it won’t find it (despite being in the project) unless it is in the Global Assembly Cache (GAC).

The Global Assembly Cache (GAC) is a folder in Windows directory to store the .NET assemblies that are specifically designated to be shared by all applications executed on a system. Assemblies can be shared among multiple applications on the machine by registering them in global Assembly cache(GAC). Source

The best way to resolve this is to use the Npgsql installer that matches the version in your project from: https://github.com/npgsql/npgsql/releases. Those setups take care of registering Npgsql in GAC and set up the machine.config file to include the Npgsql db provider factory.

After I did this, everything worked as it should, and all was right with the world.

PostgreSQL Performance Optimization

Recently, I’ve been dealing with databases at work which have millions if not BILLIONS of records.  So as you can imagine, having Postgres running smoothly and as quickly as possible, is of utmost importance.  So, as a guide, and compiled from a number of sources. Obviously faster, better, bigger hardware will make the database faster, but there are often other steps you can take to get PostgreSQL working a bit smarter, and a bit harder. The first place to start with Postgres optimization is the Postgres configuration. The list below provides a guide (use at your own discretion) of some of the primary settings relating to resource use. Even small tweaks can have a big impact on server performance.

max_connections. This option sets the maximum number of database back end processes to have at any one time. Use this feature to ensure that you do not launch so many back ends that you begin swapping to disk and kill the performance of all the children. Depending on your application it may be better to deny the connection entirely rather than degrade the performance of all of the other children.

shared_buffers. Editing this option is the simplest way to improve the performance of your database server. The default is pretty low for most modern hardware. Shared buffers defines a block of memory that PostgreSQL will use to hold requests that are awaiting attention from the kernel buffer and CPU. The default value is quite low for any real world workload and need to be beefed up. However, unlike databases like Oracle, more is not always better. There is a threshold above which increasing this value can hurt performance.

PLEASE NOTE. PostgreSQL counts a lot on the OS to cache data files and hence does not bother with duplicating its file caching effort. The shared buffers parameter assumes that OS is going to cache a lot of files and hence it is generally very low compared with system RAM. Even for a dataset in excess of 20GB, a setting of 128MB may be too much, if you have only 1GB RAM and an aggressive-at-caching OS like Linux.

Note that on Windows (and on PostgreSQL versions before 8.1), large values for shared_buffers aren’t as effective, and you may find better results keeping it relatively low (at most around 50,000, possibly less) and using the OS cache more instead.

It’s likely you will have to increase the amount of memory your operating system allows you to allocate at once to set the value for shared_buffers this high. If you set it above what’s supported, you’ll get a message like this:

IpcMemoryCreate: shmget(key=5432001, size=415776768, 03600) failed: Invalid argument

This error usually means that PostgreSQL's request for a shared memory
segment exceeded your kernel's SHMMAX parameter. You can either
reduce the request size or reconfigure the kernel with larger SHMMAX.
To reduce the request size (currently 415776768 bytes), reduce
PostgreSQL's shared_buffers parameter (currently 50000) and/or
its max_connections parameter (currently 12).

effective_cache_size. This value tells PostgreSQL’s optimizer how much memory PostgreSQL has available for caching data and helps in determing whether or not it use an index or not. The larger the value increases the likely hood of using an index. effective_cache_size should be set to an estimate of how much memory is available for disk caching by the operating system, after taking into account what’s used by the OS itself, dedicated PostgreSQL memory, and other applications. This is a guideline for how memory you expect to be available in the OS buffer cache, not an allocation! This value is used only by the PostgreSQL query planner to figure out whether plans it’s considering would be expected to fit in RAM or not. If it’s set too low, indexes may not be used for executing queries the way you’d expect.

Setting effective_cache_size to 1/2 of total memory would be a normal conservative setting, and 3/4 of memory is a more aggressive but still reasonable amount. You might find a better estimate by looking at your operating system’s statistics. On UNIX-like systems, add the free+cached numbers from free or top to get an estimate. On Windows see the “System Cache” size in the Windows Task Manager’s Performance tab. Changing this setting does not require restarting the database (HUP is enough). .

work_mem. This option is used to control the amount of memory using in sort operations and hash tables. While you may need to increase the amount of memory if you do a ton of sorting in your application, care needs to be taken. This isn’t a system wide parameter, but a per operation one. So if a complex query has several sort operations in it it will use multiple work_mem units of memory. Not to mention that multiple backends could be doing this at once. This query can often lead your database server to swap if the value is too large. This option was previously called sort_mem in older versions of PostgreSQL.

max_fsm_pages. This option helps to control the free space map. When something is deleted from a table it isn’t removed from the disk immediately, it is simply marked as “free” in the free space map. The space can then be reused for any new INSERTs that you do on the table. If your setup has a high rate of DELETEs and INSERTs it may be necessary increase this value to avoid table bloat.  Sets the maximum number of disk pages for which free space will be tracked in the shared free-space map. Properly adjust upward to make vacuum runs faster and eliminate/reduce the need to “vacuum full” or “reindex”. Should be slightly more than the total number of data pages which will be touched by updates and deletes between vacuums. Requires little memory (6 bytes per slot), so be generous adjusting its size. When running vacuum with “verbose” option, DB engine advises you about the proper size.

fsync. This option determines if all your WAL pages are fsync()’ed to disk before a transactions is committed. Having this on is safer, but can reduce write performance. If fsync is not enabled there is the chance of unrecoverable data corruption. Turn this off at your own risk.

commit_delay = and commit_siblings. These options are used in concert to help improve performance by writing out multiple transactions that are committing at once. If there are commit_siblings number of backends active at the instant your transaction is committing then the server waiting commit_delay microseconds to try and commit multiple transactions at once.

random_page_cost. Sets estimated cost of non-sequentially fetching. Lower it to influence the optimizer to perform index scans over table scans.

Note that many of these options consume shared memory and it will probably be necessary to increase the amount of shared memory allowed on your system to get the most out of these options.

If you are after a more comprehensive list of Postgres’ Tuning and Performance, the PostgreSQL documentation has a great wiki on the subject.

The other place that often gets overlooked for performance enhancement, is the actual database queries themselves.  I must admit ignorance myself to the Postgres inclusing of the ‘EXPLAIN ANALYSE’ keywords preceding any SQL statement which returns a very comprehensive trace of the query through the database entry, including specific timings, index use etc, which can be a big eye opener to tables, sorts or indexes which maybe being used incorrectly, or just being slow.  Here is an example of explain analyse on a SQL statement on a very large database:

EXPLAIN ANALYSE SELECT items.etag, subscriptions.subscription_data
FROM items, subscriptions WHERE items.item_id = subscriptions.item_id;

returns the entire query plan, like:

"Hash Join  (cost=1.29..22.38 rows=50 width=64) (actual time=0.055..0.084 rows=21 loops=1)"
"  Hash Cond: (subscriptions.item_id = items.item_id)"
"  ->  Seq Scan on subscriptions  (cost=0.00..17.70 rows=770 width=36) (actual time=0.010..0.012 rows=21 loops=1)"
"  ->  Hash  (cost=1.13..1.13 rows=13 width=36) (actual time=0.027..0.027 rows=13 loops=1)"
"        ->  Seq Scan on items  (cost=0.00..1.13 rows=13 width=36) (actual time=0.008..0.014 rows=13 loops=1)"
"Total runtime: 0.154 ms"

Extracting Specific Date-time Components in Postgres

When programming, sometimes it’s useful to extract various time components from a Date-time field in SQL queries.  For example, you might want to filter the date field by, hour, or year (or both).  Fortunately, Postgres has a easy way to pull this data out within the query itself:

SELECT EXTRACT(hour FROM a_datetime_field) FROM a_table;

If the field was the value ‘4/5/2009 13:09:00’ then that above query would return “13” in the select.

In a stranger, practical example, I was moving a database from one server to another and for some unknown reason, all the dates in a table were fudged so that instead of being the year 2009, it was 142009, and the seconds were also stuffed up – the result being that any queries I ran against the table threw ADO.NET exceptions because the dates couldn’t be parsed properly.  I needed to run a query like this one to set things right again:

update atable
  set datetimeadded = cast( '2009-' || EXTRACT(month FROM datetimeadded) || '-' ||
    EXTRACT(day FROM datetimeadded) || ' ' || EXTRACT(hour FROM datetimeadded) || ':' ||
    EXTRACT(minute FROM datetimeadded) as timestamp),
  datetimeupdated =  cast('2009-' || EXTRACT(month FROM datetimeupdated) || '-' ||
    EXTRACT(day FROM datetimeupdated) || ' ' || EXTRACT(hour FROM datetimeupdated) || ':' ||
    EXTRACT(minute FROM datetimeupdated) as timestamp);

The casts are needed to get the update to cooperate, but basically this recreates the date time piece-by-piece based on the original date-time value.

Howto Backup PostgreSQL Databases Server With pg_dump command

Recently I had do to a lots of PostgreSQL database administration as I needed to move several databases onto  a production server.  PostgreSQL is one of the most robust, open source database servers available, and for my money, faster and generally better than MySQL. Like MySQL database server, it provides utilities for creating a backup.

Backup database using pg_dump command. pg_dump is a utility for backing up a PostgreSQL database. It dumps only one database at a time.

$ pg_dump table | gzip -c > table.dump.tar.gz

Another option is use to pg_dumpall command. As a name suggest it dumps (backs up) each database, and preserves cluster-wide data such as users and groups. You can use it as follows:

$ pg_dumpall | gzip -c > all.dump.tar.gz


Connecting to Postgres Through a SSH Connection

Production level servers can be tricky.  Mostly because of the inherent, increased security needed to protect online (and exposed?) assets.  Our Postgres server is one such asset.

But opening the necessary ports to allow access to the database server (for maintenance) can expose the server to possible attack.  Fortunately, if your Postgres server is on a UNIX or Linux based machine with SSH installed, you can forward the Postgres port through the SSH tunnel.

ssh -L 1234:localhost:5432 username@server.dns.addressorip.com

In PGAdmin, simply connect to localhost:1234, and you will connect to the production server through the SSH connection without the need to open a port on the external firewall.

Moving the Default PostgreSQL Data Directory (Windows)

PostgreSQL for Windows installs the PGDATA directory by default into “C:\<Program Files>\PostgreSQL\some version\data”.  While is usually ok for development and even some production environments, it can be somewhat limited given the typically large amount of disk activity on (what is normally) the system volume.

This step-by-step post aims to explain the fastest way to change the default PGDATA directory to another location.

Step 1:

Close all application that are currently connected to your database, then go to Windows Services Management and stop the PostgreSQL service.  Naturally, the fastest way to get to Windows Services Management is by right clicking on my computer -> manage, and then click on the services node in the computer management tree:

Once the Postgres service is stopped, leave computer the Computer Management Console open (we’ll need it again in a second).

Step 2:

Copy the “C:\<Program Files>\PostgreSQL\some version\data” directory to the new location.  In my own attempts, moves do not seem to work, but the redundant disk usage should be pretty minimal and at least you will have an old backup in the default location if something goes wrong.

Step 3 (Very Important):

Rick click on the new “data” directory, and click properties, and select the security tab.  Give the local Postgres user (or for advanced users the user account Postgres uses) and give full permission on that direct and all child objects (sometimes you need to reset the child permission inheritance).  Click ok on the properties window.

Step 4:

Open regedit.exe by clicking on the start button and typing “rededit” in either the search box, or start->run box (depending on whether your using Vista or XP) and navigate to the “HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\pgsql-some version” branch. Double click on “ImagePath” and change the directory after the “–D” option to your new location. If the path to your new location contains spaces, you should then enclose it with quotes.

Step 5:

Close regedit and go back to Computer Management and restart the Postgres service. Close Computer management.

After  completeing these steps, and assuming something hasn’t gone wrong, you should now have Postgres running from the new location.

(Reference: http://wiki.postgresql.org/wiki/Change_the_default_PGDATA_directory_on_Windows)