Skip to content

a robust and scalable Change Data Capture (CDC) system for synchronizing data between two PostgreSQL databases in near real-time.

Notifications You must be signed in to change notification settings

mintunitish/repgine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repgine

This project provides a robust and scalable Change Data Capture (CDC) system for synchronizing data between two PostgreSQL databases in near real-time. It is designed to be highly configurable and easy to deploy, making it ideal for a variety of use cases, including data replication, ETL pipelines, and database migrations.

Features

  • Real-time Data Sync: Captures and applies changes as they happen.
  • Declarative Configuration: Define your sync jobs using a simple YAML file.
  • Schema-Aware: Automatically detects your database schema to simplify configuration.
  • Extensible Architecture: The system is composed of a CLI, a Control Plane, and an Agent, which can be scaled and customized independently.
  • LLM-Powered: Leverages Large Language Models for advanced features like conversational configuration edits and automatic schema drift handling.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Installation

  1. Clone the repository:

    git clone [email protected]:mintunitish/repgine.git
    cd repgine
  2. Install Python dependencies:

    pip install -r cli/requirements.txt
  3. Install Node.js dependencies:

    npm install --prefix control/api
    npm install --prefix control/ui
  4. Set up environment variables: The init command requires the SOURCE_DSN environment variable to be set to your source PostgreSQL database connection string.

    export SOURCE_DSN="postgresql://user:password@host:port/dbname"

Usage

The primary way to interact with the system is through the repgine CLI.

1. Initialize Configuration

Generate a sync.yaml file by introspecting your source database.

python cli/main.py init

This will create a sync.yaml file in your project root, pre-populated with the tables and columns from your source database.

2. Review and Customize sync.yaml

Open the generated sync.yaml and customize it to your needs. You can select which tables to sync, rename columns, or apply transformations.

version: "1.0"
source_dsn: "postgresql://user:pass@host:port/sourcedb"
target_dsn: "postgresql://user:pass@host:port/targetdb"
tables:
  - table_name: "public.users"
    columns:
      - "id"
      - "name"
      - "email"
  - table_name: "public.orders"
    # This table will be excluded
    sync: false

3. Validate the Configuration

Check your sync.yaml file for correctness against the schema.

python cli/main.py validate sync.yaml

4. Deploy the Sync Agent

Start the control plane and deploy the agent to begin the synchronization process.

# Start the control plane API
node control/api/index.js &

# Deploy the sync job
python cli/main.py deploy

5. Monitor Status

Check the status and metrics of your running sync job.

python cli/main.py status

Architecture

The system is composed of three core components that work together to provide a seamless data synchronization experience.

+-----------------+      +-----------------+      +---------------+
|       CLI       |----->|  Control Plane  |<-----|     Agent     |
|   (main.py)     |      |   (API & UI)    |      | (reader, etc) |
+-----------------+      +-----------------+      +---------------+
  • CLI (cli): A Python-based command-line interface for initializing configurations, validating schemas, and deploying sync jobs.
  • Control Plane (control): A Node.js application that provides a central API for managing and monitoring sync jobs. It also includes a minimal UI for observability.
  • Agent (agent): A high-performance Go agent that performs the heavy lifting. It connects to the source database, reads the Write-Ahead Log (WAL) using pgoutput, transforms the data according to the sync.yaml configuration, and applies it to the target database.

Streamer Component

The Streamer is a key part of the agent responsible for applying changes from the source database to the target database. It reads WALMessage events (INSERT, UPDATE, DELETE) and constructs the appropriate SQL statements to execute against the target.

Configuration

The Streamer is configured via the sync.yaml file. This file defines which tables to watch and how to handle them. The Streamer specifically uses the tables and pk fields to construct its SQL queries.

Here is an example of a sync.yaml configuration:

tables:
  users:
    name: users
    pk: id
  products:
    name: products
    pk: product_id

In this example, the Streamer will apply changes to the users and products tables, using the id and product_id columns as the primary keys for UPDATE and DELETE operations, respectively.

Testing the Streamer

The Streamer includes both unit and integration tests.

  • Unit Tests: These tests use mocking to verify the SQL generation logic without requiring a live database.
    go test ./agent/stream/...
  • Integration Tests: These tests require live source and target database connections and verify the end-to-end data flow. To run them, set the SOURCE_DB_DSN and TARGET_DB_DSN environment variables:
    export SOURCE_DB_DSN="postgresql://user:pass@host:port/sourcedb"
    export TARGET_DB_DSN="postgresql://user:pass@host:port/targetdb"
    go test -v -run TestStreamer_Integration ./agent/stream

Development

To set up a development environment, ensure you have the prerequisites installed, then follow the installation steps.

Running Tests

The project includes unit and integration tests for each module.

  • CLI Tests:
    pytest tests/cli/
  • Control Plane API Tests:
    npm test --prefix control/api
  • Agent Tests:
    go test ./agent/...

Building with Docker

A Dockerfile is provided to build and run the agent in a containerized environment.

docker build -t repgine-agent .
docker run -e SOURCE_DSN="..." -e TARGET_DSN="..." repgine-agent

Contributing

Contributions are welcome! Please feel free to submit a pull request.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Repgine © 2025 by Nitish Kumar is licensed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/

About

a robust and scalable Change Data Capture (CDC) system for synchronizing data between two PostgreSQL databases in near real-time.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published