Apache HBase CDC with Gluesync: Setup & Configuration

Source data from HBase

Prerequisites

To have Gluesync working on your HBase instance you will need to have:

valid user credentials with permission to read and write to the source database

Setup via Web UI

Hostname / IP Address: DNS SVR record of your HBase cluster or IP Address of one of the nodes (automatic discovery of all other nodes is then applied);
Port: Optional, defaults to 60000;
Username: Username with read & write access role to the database;
Password: Password belonging to the given username.

Custom properties

columnInfoSeparator: (optional, defaults to :) It’s the user-defined separator used between family and column definitions;
maxRecoveryRetry: (optional, defaults to 3) It’s the number of max retries Gluesync will attempt before hanging up the connection with the Zookeeper services on the HBase side.

Setup via Rest APIs

Documentation available soon.

Clustering

When moving large datasets, and especially when there’s a big data store like HBase involved, Gluesync needs to be configured in a cluster configuration.

Clustering Gluesync means that multiple nodes will consume either the same table or multiple tables simultaneously depending on your specific configuration and data density per table. Using that feature will incredibly boost the throughput compared to a single-instance configuration.

Clustering configuration is available through our professional services, soon in GA and manageable by end users.

Source entities in HBase

Source entities in HBase are worth mentioning since they are slightly different from the other supported datastores: data in HBase is organized in columns grouped by families, and their representation when queried comes with a custom separator value that is user-defined.

In the example below we show how a customers table is being represented by having a family c and the separator used is :.

{
  ...
  "sourceEntities": {
      "customers": {
        "mapping": {
          "c:name": "name",
          "c:surname": "surname",
          "c:address": "address",
          "c:gender": "gender",
          "c:phone": "phone",
          "c:email": "email"
        }
      },
    ...
  }
  ...
}

Above we have used the mapping feature available in Gluesync to tell the engine that when it has to take data from the column-family c:name for the table customers it has to map that field using the key name when building the JSON output.

Data types infer

Apache HBase stores data in binary format, this means that for the engine knowing the precise data type and its length (especially when it comes to dates, floating point numbers, …) is a guess. To boost performances and avoid any unwanted behavior while converting from binary to the destination data type we require the user to fill up this config piece to tell the engine how to handle the incoming data for a specific field.

Available supported data types:

Data type

Data type
`STRING`
`INT`
`FLOAT`
`DOUBLE`
`LONG`
`BOOLEAN`
`LOCAL_DATE`
`LOCAL_DATE_TIME`
`LOCAL_TIME`
`OFFSET_DATE_TIME`

STRING

INT

FLOAT

DOUBLE

LONG

BOOLEAN

LOCAL_DATE

LOCAL_DATE_TIME

LOCAL_TIME

OFFSET_DATE_TIME

Got Kerberos or other authentication services not covered here?

We’re working on improving the documentation and will be soon available all the possible combinations of authentication providers commonly used within HBase.