PostgreSQL CDC: WAL-based Setup with Gluesync

Source data from PostgreSQL

Prerequisites

To have Gluesync working on your PostgreSQL instance you will need to have:

Valid user credentials with permissions to read tables, schema and replication slots from the source database.
WAL2Json plugin installed on your PostgreSQL instance.
Role with replication privileges.
WAL level set to logical.
Tables with either REPLICA IDENTITY FULL or REPLICA IDENTITY DEFAULT set (see Turning on replication to your tables).

To create a valid user for Gluesync on your PostgreSQL database with the minimum required privileges, you can run/adapt the following query:

CREATE USER gluesync WITH REPLICATION LOGIN PASSWORD 'youdecide';
GRANT USAGE ON SCHEMA public TO gluesync;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO gluesync;

These permissions allow the user to connect for replication and read data, which is sufficient for most CDC operations. The user does not require SUPERUSER privileges.

Given user gluesync present in the example above is not a mandatory username for Gluesync, you can define whatever user name you’d like.

Setup via Web UI

Hostname / IP Address: DNS record or IP Address of your server;
Port: Optional, defaults to 5432;
Database name: Name of your target database;
Username: Username with read from source tables;
Password: Password belonging to the given username.

Custom properties

Max polling interval: (optional, defaults to 2) Is used by Gluesync to wait for new messages incoming on the replication slot if no new message comes. The default value is 2 seconds.

Specific configuration

This agent has no specific configuration properties.

Setup via Rest APIs

Here following an example of calling the CoreHub’s Rest API via curl to setup the connection for this Agent.

Connect the agent

curl --location --request PUT 'http://core-hub-ip-address:1717/pipelines/{pipelineId}/agents/{agentId}/config/credentials' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' \
--data '{
        "hostCredentials": {
        "connectionName": "myAgentNickName",
        "host": "host-address",
        "port": 5432,
        "databaseName": "db_name",
        "username": "",
        "password": "",
        "maxConnectionsCount": 100,
        "enableTls": true,
        "certificatePath": "/myPath/cert.pem"
      }
}'

Setup steps

Enabling CDC on the source database

First of all, to enable CDC you will need to set the wal_level param to logical. To learn more about what wal, also known as write ahead logs, is you can refer to PostgreSQL’s official documentation by following this link.

After having configured the proper wal_level the other step will be to set the number of desired replication slots that are allowed to run on your PostgreSQL environment. From the docs, max_replication_slots, is the maximum number of simultaneously defined replication slots available for Gluesync to work correctly with CDC.

the minimum replications slots number should be equal to the number of entities in the Gluesync config.json. The value that is already set in PostgreSQL by default is 10.

The last step will be changing max_wal_senders which is the maximum number of simultaneously running WAL sender processes available for Gluesync to work correctly with PostgreSQL CDC. The minimum wal senders number should be equal to the number of replication slots set, by the way, we suggest setting this param slightly higher than the number of replication slots so that the database can better handle orphan connections. The value that is already set in PostgreSQL by default is 10.

There are two ways to achieve that and we have documented the needed steps in the next 2 chapters.

Via SQL commands

To set the wal_level param to logical via SQL command you have to issue the following statement:

alter SYSTEM set wal_level = logical;

Then, after having successfully applied that you can issue the following statement to set the number of desired replication slots:

alter SYSTEM set max_replication_slots = <number_of_desired_replication_slots>;

If this parameter is changed in the future setting a smaller value to the already active replication slots will cause the database not to start!

Lastly, it’s time to change the max_wal_senders value, to do that just hit the following:

alter SYSTEM set max_wal_senders = <number_of_desired_wal_senders>;

Via editing postgresql.conf file

To set the wal_level param to logical through editing the postgresql.conf file you are required to go inside your PostgreSQL directory, with a user that has read/write rights, look for the file named postgresql.conf and open it.

There inside you’ll find a param called wal_level where its given value needs to be changed to logical, if the line is commented please recommend it to let the configuration file parser understand your change.

You can learn more about modifying the postgresql.conf file by following this link;

After having applied that first change please set also max_replication_slots and max_wal_senders params in the same way.

If this parameter is changed in the future setting a smaller value to the already active replication slots will cause the database not to start!

Allowing connections to replica streams

After having applied the above-mentioned changes to your PostgreSQL configuration we need to allow Gluesync to connect over your instance’s replica streams. Gluesync requires connecting with replication privileges to a replication stream. You will need to modify the values in your pg_hba.conf, if not already set, like this:

local   replication   all                   trust
host    replication   all   127.0.0.1/32    md5
host    replication   all   ::1/128         md5

This will allow replication connections from localhost, by a user with the replication privilege.

The IP Address of the first host replication can differ based on your server configuration and how Gluesync will connect to your server. In the example case, Gluesync is running on the same machine of PostgreSQL database which is usually not recommended for a production environment.

Installing Wal2Json

Gluesync requires Wal2Json to work correctly with Logical Replication. Wal2Jon is a plugin for proving logical decoding as output steam to Gluesync that requires it to receive changed tables data in a readable format. This plugin can be installed as a last step, right after the CDC has been enabled.

Depending on your OS and PostgreSQL version you have to install the plugin by issuing one of these commands:

In Red Hat/CentOS:

sudo (if needed) yum install wal2jsonXX

In Debian/Ubuntu:

sudo (if needed) apt-get install postgresql-XX-wal2json

Where XX is one of the supported versions of PostgreSQL: 16, 15, 14, 13, 12, 11 or 10. Depending on your PostgreSQL version you have to replace XX chars with the correct version.

To learn more about Wal2Json you can refer to Wal2Json’s official documentation by following this link.

Turning on replication to your tables

To enable replication to your tables you have to issue the following statement:

ALTER TABLE your_table_name REPLICA IDENTITY DEFAULT;

OR you can run the following statement to enable replication to your table and include the old values of the rows that have been changed:

ALTER TABLE your_table_name REPLICA IDENTITY FULL;

The first option is the default one and it will enable replication to your table, the second option will enable replication to your table and it will also include the old values of the rows that have been changed (see Working with Before & After images).

repeat the same operation for every table you want to replicate as only SUPERUSER or Table owner can perform this operation.

Working with Before & After images

Before & after images are a feature that allows Gluesync to track the changes that have occurred in your database and comparing them with their previous values. This enables Gluesync to apply only the changes that have occurred, saving bandwidth and processing time.

In PostgreSQL this feature can be enabled by following the instructions here below:

ALTER TABLE your_table_name REPLICA IDENTITY FULL;

After having enabled the feature Gluesync will automatically start using it.

Troubleshooting

Verifying replication slots: If you want to verify that the replication slots are correctly set up you can run the following query:

-- Create a replication slot
SELECT * FROM pg_create_logical_replication_slot('test_slot', 'wal2json');

-- Get changes from a replication slot
SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);

Verifying wal2json plugin: If you want to verify that the wal2json plugin is correctly installed you can run the following query:

find /usr -name "wal2json.so" 2>/dev/null

Counting replication slots count: If you are in need to perform a check about how many replication slots are currently in use within your instance you can run the following query:

select * from pg_replication_slots;