Snapshot: data refresh

One of the core features of Gluesync consists of the initial load of the data present in tables at your source database level. We called this task Snapshot.

How does it work

Snapshot is a core part of Gluesync’s Data ingestion engine, it goes per each of the declared entities and runs an initial data load with support of multi-threading and parallel execution to make this job as fast as possible.

Since the Snapshot feature comes as a component within the Data ingestion engine this means that it also knows how you’d like data to be structured in terms of data model: it is capable of applying the same transformation capabilities on top of the data you’re moving sharing the same data modeling functionalities as per the CDC engine is using.

Here following are described step by step the tasks that are undergoing under the hood while you require Gluesync to perform a Snapshot job.

The process runs all declared entities in parallel threads to highering the throughput and speed up the whole snapshot process.

How to enable

By default, Gluesync does a full snapshot before performing CDC replication. copySourceEntitiesAtStartup from inside the Gluesync configuration file is the param that enables/disables this function. It defaults to true.

Step 1 - Rows count

TableMigrationComponentSetup INFO - Begin migrating (with full table query) of 1234 rows for entity MYSCHEMA.OPEN_SEA

This basically tells you that a certain amount of rows has been discovered in your source table matching the desired criteria for a certain declared data modeling task (if any, otherwise the result is basically the same output as a SELECT * statement).

Step 2 - Begin migrating X amount of data

Right after the above-mentioned output, Gluesync will immediately start fetching a certain amount of rows (or documents) from each source entity. The X amount of rows (or document) can be tuned by defining maxMigrationItemsCountPerIteration key with a numeric value higher than 1. This allows Gluesync to size the window of each fetch operation to match the desired amount of rows/documents.

Data is so then taken from the source and moved to the target in chunks, each of these is larger than the given numeric value declared as maxMigrationItemsCountPerIteration which otherwise defaults to 1000.

TableMigrationComponentSetup INFO - Start migration from row 0 for MYSCHEMA.OPEN_SEA

As soon as a migration task is set to happen Gluesync stores an entry inside its checkpoint table at the source database level, meaning that it performs an INSERT inside the GLUESYNC_MIGRATION_CHECKPOINT table.

maxMigrationItemsCountPerIteration parameter is useful to speed up the snapshot process or in order not to hit the source database.

Step 3 - Migration advancement

In order to monitor and keep track of the migration process Gluesync comes to help by providing you with an output log telling that progress has been made per each of the stored fetched chunks belonging to the respective entity.

Each of these chunks, once confirmed of being successfully stored to the target, produces a log entry that looks like the following example present here below plus an UPDATE command to the specific entity inside the GLUESYNC_MIGRATION_CHECKPOINT table.

TableMigrationComponentSetup INFO - Completed 50.00% in 2s for MYSCHEMA.OPEN_SEA
TableMigrationComponentSetup DEBUG - Saved checkpoint 567 for entity MYSCHEMA.OPEN_SEA

From the given example you can easily see that not only does it tell you that it has completed an X amount of records in terms of percentage (%) but also outputs the amount of seconds it took to perform the job.

Step 4 - Completion

When a snapshot task for an entity comes to an end Gluesync acknowledges its completion via a console log output and then updates its status by following two mandatory steps:

it updates the GLUESYNC_MIGRATION_CHECKPOINT table entry belonging to the migrated entity by advancing the total row counter to the latest count amount reached;
it marks the GLUESYNC_STATE_PRESERVATION table entry belonging to the migrated entity by setting the entry at 0. This will trigger the CDC engine to start working on that entity and enable any subsequential CDC activity.

The console output of a completed task will look like this:

TableMigrationComponentSetup INFO - Completed 100.00% in 4s for MYSCHEMA.OPEN_SEA

From the given example Gluesync tells you that it has reached the end of the snapshot task for the given entity, by saying Completed 100.00%, plus it outputs the total amount of time it took to perform the whole snapshot task for the given entity.

Administration

In the long run, you may want to perform a full data refresh again or add new tables to your configuration. A full guide about the administration of this feature can be found here Administering snapshot tasks.