Static Azure Cosmos DB CDC with Gluesync: Setup & Configuration

Source data from Azure Cosmos DB

Prerequisites

To have Gluesync working on your Azure Cosmos DB instance as a source connector you will need to have:

  • valid URL and a Azure Cosmos DB service key;

  • Gluesync Azure Cosmos DB agent will connect to the URL provided to the port 443 through the HTTPS protocol.

Setup via Web UI

  • Hostname / IP Address: Hostname of your Azure Cosmos DB service;

  • Port: Optional, defaults to 443;

  • Database name: Name of your source database.

Custom host credentials

This can be also set via Rest API by specifying customHostCredentials param.

  • Azure Key: The key of you Azure Cosmos DB service;

  • Container: The name of the container that works as a source;

  • Lease: The name of the lease container created for the Change Feed Processor.

Specific configuration

This agent has no specific configuration properties.

Setup via Rest APIs

Here following an example of calling the CoreHub’s Rest API via curl to setup the connection for this Agent.

Connect the agent

curl --location --request PUT 'http://core-hub-ip-address:1717/pipelines/{pipelineId}/agents/{agentId}/config/credentials' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' \
--data '{
        "hostCredentials": {
        "connectionName": "myAgentNickName",
        "host": "host-address",
        "password": "your azure secret key",
        "port": 443
    },
    "customHostCredentials": {
        "lease": "lease_container_name"
    }
}'

Setup

Setting up Gluesync’s CosmosDB agent

First of all, we need to set up the lease container in the Azure Cosmos DB database. The Lease container is used by Cosmos DB to store the state of the Change Feed Processor. It can be created both manually and dinamically. The following example shows how it can be created manually through the Azure CLI, (at the moment Gluesync does not create it dinamically).

az cosmosdb sql container create -g development --account-name <account_name> --database-name <db_name> --name <lease_name> --partition-key-path /id

The lease container also coordinates the processing of the change feed across multiple workers and it can be stored in the same account as the monitored container or in a separate account.

The monitored container instead holds the data from which the change feed is generated. Any Insert and Update to the monitored container are reflected in the change feed of the container.

As for now, only soft (logical, see "soft-delete" in the below chapter) Delete operations are supported. While there is a public preview of the functionality from Microsoft for its Cosmos DB database, this is not intended for production usage, so this is currently not supported by Gluesync. As soon as the feature will be launched and made publicy available for every customer we will introduce the feature with an updated version of our Azure Cosmos DB Gluesync agent.

Lease container partition key

When setting up the lease container you’re also being asked to specify the partition key to be associated to it. By default the partition key is id (/id, specifically in CosmosDB, / is stripped when setting up Gluesync and automatically added by the engine).

If you require to specify a different partition key you can edit the default setting by explicitly set a different partition key under the Advanced settings tab of your CosmosDB agent connection setting.

Here below a visual example:

Setting a custom lease container partition key

Delete Event

In order to simulate a delete operation on the monitored container, Gluesync relies on a "soft-delete" flag within the documents in place of deletion. The operation is therefore an update with a customizable field name within the document to be set to true.

The documents in place of deletion can be removed from the monitored container through an Azure function triggered by Cosmos DB or any other supported way such as the Cosmos DB TTL. The TTL can be applied at the container or at the item level and Cosmos DB will automatically remove the items after the time period defined.

Our suggestion is to set the TTL to 5 days. Bear in mind that by applying a TTL, Gluesync still needs the soft delete flag in order to delete the items efficiently from the target.

An example of a Azure function to delete the documents is provided below:

from azure.cosmos import CosmosClient
from typing import Any
import azure.functions as func
import logging
import os

DBNAME = "demo"
CONTAINER = "public"
LEASE_CONTAINER = "lease"
CONNECTION = "molo17_DOCUMENTDB"
DELETE_FIELD = "deleted"

app = func.FunctionApp()
client = None

@app.cosmos_db_trigger(arg_name="azcosmosdb", container_name=CONTAINER,
                       lease_container_name=LEASE_CONTAINER, lease_database_name=DBNAME,
                       database_name=DBNAME, connection=CONNECTION, create_lease_container_if_not_exists=False)
def cosmosdb_trigger(azcosmosdb: func.DocumentList):
    for doc in azcosmosdb:
        logging.info(f"start processing document changes..")
        if DELETE_FIELD in doc and doc[DELETE_FIELD] is True:
            logging.info(f"deleting document pkey: {doc['id']} ID: {doc['ID']} type: {doc['type']} and scope: {doc['scope']}")
            delete_document(doc)

def delete_document(doc: Any):
    try:
        cosmosdb_client = get_cosmosdb_client()
        db = cosmosdb_client.get_database_client(DBNAME)
        container = db.get_container_client(CONTAINER)
        container.delete_item(doc, partition_key=doc.get("id"))
    except Exception as e:
        logging.error(f"Error while deleting document: {e}")

def get_cosmosdb_client() -> CosmosClient:
    global client
    if client is not None:
        return client

    parts = os.environ.get(CONNECTION, "").split(";")
    if len(parts) < 2:
        raise Exception(f"invalid connection string, check {CONNECTION}")

    endpoint = parts[0][16:]
    key = parts[1][11:]
    client = CosmosClient(endpoint, key)
    return client

To learn more about the Azure Cosmos DB change feed processor please refer to the following link from the Microsoft’s learning portal.

For further details, please refer to the official Azure Cosmos DB documentation at this link: learn.microsoft.com/en-us/azure/cosmos-db/introduction/