Syncing Warehouse Data to ClickHouse with Airbyte
By Trevor Paulsen
In the last post, I laid out 10 design principles for building a modern journey analytics solution. In this post, we're going to focus in on two of them:
- Principle 1: The warehouse is the source of truth - Manage your data from a warehouse, and have your journey analytics sync directly from there.
- Principle 2: Don't build data sync - Data syncing is a solved problem. Don't reinvent it.
But let's first address the obvious question here - if the warehouse is the source of truth, why not just query the warehouse directly for journey analytics? Why sync data into a separate system at all?
To be honest, you can do journey analytics in a warehouse - Snowflake, BigQuery, and Databricks all support the building blocks (window functions, sessionization, sequential pattern matching). But at scale, layering these operations together can become slow and expensive in warehouses designed for general-purpose queries.
Journey analytics is inherently exploratory - users need to build custom funnels, test hypotheses, and slice by arbitrary segments on the fly - and if queries take minutes instead of seconds, people stop exploring. Journey analytics workloads need to support hundreds of users in product, marketing, support, and beyond, and require an engine purpose-built and optimized for it - and that's why we need ClickHouse for this.
So... that brings us to the practical question: how do you actually sync data from a warehouse into ClickHouse, reliably, incrementally, and without building a massive ETL pipeline from scratch?
The answer, for us, is Airbyte.
(Just FYI: I'm not affiliated with Airbyte or ClickHouse and I'm not getting anything from either company - I'm just a fan of what they've built!)
Why Not Build It Ourselves?
It's tempting to think data syncing is simple. Read rows from a source, write them to a destination - how hard could it be? But as soon as you start thinking about it seriously, the complexity mushrooms:
- Schema discovery - What tables and columns exist in the source? What types are they?
- Incremental detection - Which rows are new or updated since the last sync?
- Deduplication - If a row gets updated multiple times between syncs, how do you avoid duplicates?
- Delete handling - How do you know when a row has been deleted from the source?
- Error recovery - What happens when a sync fails halfway through?
- Scheduling - How do you run syncs on a cadence without building your own scheduler?
- Connector maintenance - PostgreSQL, BigQuery, Snowflake, MySQL, S3 - each source has its own quirks, auth mechanisms, and pagination strategies.
Each of these is a real engineering challenge, and together they represent an entire product category. Airbyte accelerates us here given it has lots of connectors maintained by an open-source community and a commercial team. Building even a fraction of that ourselves would be a massive distraction from our actual product: journey analytics.
So instead of building the data sync, we'll build an API layer on top of Airbyte that simplifies the experience for our UI and abstracts away the complexity.
A Quick Primer on Airbyte Concepts
Before we dive into sync modes and our API layer, it helps to understand the core building blocks Airbyte gives you. If you want the full picture, the Airbyte core concepts docs and the API reference are great resources. But here's the quick version:
- Sources - A configured connection to wherever your data lives (PostgreSQL, Snowflake, BigQuery, S3, etc.). You provide credentials and connection details, and Airbyte handles the rest.
- Destinations - Where the data lands. In our case, this is always ClickHouse.
- Streams - The individual tables or data collections available from a source. When you connect a PostgreSQL database, each table shows up as a stream you can choose to sync.
- Connections - The pipeline that ties a source to a destination. A connection defines which streams to sync, how to sync them (the sync mode), and when to sync them (schedule).
- Jobs - The actual execution of a sync. Each time a connection runs - whether on a schedule or triggered manually - it creates a job you can monitor for status, record counts, and errors.
How These Map to Our Solution
We don't want to expose all of these concepts directly to our users. Instead, let's simplify them into two main ideas:
- Source connectors - These map directly to Airbyte sources. A user connects their warehouse by providing credentials, and we create and manage the Airbyte source behind the scenes.
- Datasets - This is our abstraction over Airbyte connections. A dataset represents a single stream being synced from a source into ClickHouse with a specific sync mode and schedule. All the connection configuration, stream selection, and job orchestration is handled by our API layer - the user just picks a table, chooses how they want it synced, and we take care of the rest.
This mapping keeps the UI clean while still giving us the full power of Airbyte's sync engine underneath.
The Three Sync Modes
If you deal in data, you know it comes in all shapes and sizes. Some tables are small and rarely change. Others are massive append-only event streams. Others still are living, evolving records that get updated and deleted over time. There's no single sync strategy that handles all of these well - so we need to offer a few options.
When you set up a dataset, you choose one of three sync modes. Each maps to an underlying Airbyte sync strategy, but I prefer simplifying the naming to make things more intuitive:
1. Full Refresh (Replace Everything)
Airbyte mode: full_refresh_overwrite
This is the simplest, least intelligent approach. Every time a sync runs, it pulls the entire table from the source and overwrites whatever was in ClickHouse before.
What it's useful for:
- Smaller reference/lookup tables (product catalogs, country codes, etc.)
- Data where you don't have a reliable "last updated" timestamp
- Tables small enough that re-syncing everything is cheap and fast
Tradeoffs:
- ✅ Dead simple - no cursor fields, no primary keys, no CDC logic
- ✅ Guaranteed consistency - you always have an exact copy of the source
- ❌ Terrible for large tables - re-syncing tons of rows every time is slow and expensive
2. Append Only (Only New Rows)
Airbyte mode: incremental_append
Append-only sync tracks a cursor field (like a timestamp or incrementing version number) and only pulls rows that are new or updated since the last sync. New data gets appended to the destination table - it never updates or removes existing rows.
Required fields:
- Cursor field (
updatedAt) - A column that increases monotonically, likeupdated_atorevent_timestamp. Airbyte uses this to know where it left off.
When to use it:
- Event streams and logs where rows won't change historically
- Pre-aggregated data that won't be revised historically (daily SEO clickthroughs, aggregated ad impressions, etc.)
Tradeoffs:
- ✅ Efficient - only syncs what's new
- ❌ No updates to historical records - once a row is copied, it's there to stay
- ❌ No delete detection - if a row is deleted from the source, the old version stays in ClickHouse
- ❌ Not a good fit if you need to modify or delete historical data (e.g. GDPR right-to-erasure requests) - those changes won't propagate
Append-only is a natural fit for some types of event data that won't change historically. That said, if you need to handle GDPR or CCPA deletion requests, you'll want to consider mirror mode instead, since append-only won't propagate those changes.
3. Mirror (Full CDC with Deduplication)
Airbyte mode: incremental_deduped_history
Mirror mode is the most sophisticated option and what I'd recommend as the best default option. Like append-only, it uses a cursor field to detect changes incrementally. But it also uses a row ID to deduplicate rows, so you always have the latest version of each record. It effectively maintains a mirror of your source table in ClickHouse that stays in sync over time.
Required fields:
- Cursor field (
updatedAt) - Similar to append-only, used to detect new and changed rows - Primary key (
rowId) - The unique identifier for each row, used for deduplication - Delete flag (
isDeleted) - Optional. A boolean column in the source that indicates whether a row should be deleted
When to use it:
- Dimension tables that get updated over time (user profiles, account info, product details)
- Event data that may be modified or deleted historically (e.g. for GDPR compliance or data corrections)
- Data where rows can be updated or deleted
Tradeoffs:
- ✅ Always up-to-date - deduplication ensures you see the latest version of each row
- ✅ Handles updates and soft deletes gracefully
- ✅ Efficient - still incremental, only pulls what changed
- ⚠️ Requires a reliable primary key and cursor field in the source
- ⚠️ Deletions are a bit more involved - you need to soft-delete before you hard-delete (more on this below)
A note on deletes: If you hard-delete a row from the source, the sync has no way to know it was ever there. So deletes need to be a two-step process: soft-delete first (set an is_deleted flag), let it propagate to ClickHouse, then hard-delete from the source later. When using this mode of data syncing, you'll have to make sure your warehouse deletion workflows account for this.
The API Layer: Simplifying Airbyte for the UI
Airbyte is powerful, but its API is designed for every data syncing case under the sun. It has concepts like incremental_deduped_history, nested stream configurations, workspace management, OAuth token flows, and job orchestration. These are all well-designed, but they're more complexity than our product UI needs to expose.
From a user's perspective, the workflow we want to support is pretty straightforward:
- Connect to your warehouse - Provide credentials for your Postgres, BigQuery, Snowflake, etc.
- Pick the datasets you want to sync - Browse the available tables and select the ones you need.
- Configure each dataset - Choose a sync mode, map the required fields, and tell us what type of data it is (event, lookup, aggregate, or identity map) - all in one step.
- Set a sync schedule - Choose manual, or set a cron schedule.
That's it. To support this workflow (and support proper authentication), we'll use a Firebase Cloud Functions API layer that sits between our React frontend and Airbyte's API. Here's what it does under the hood:
Simplified Sync Mode Names
Instead of asking users to understand Airbyte's native sync modes, we map them to three intuitive labels:
| Our name | Airbyte's name |
|---|---|
full_refresh | full_refresh_overwrite |
append_only | incremental_append |
mirror | incremental_deduped_history |
Unified Dataset Configuration
Airbyte connections require nested stream configurations with arrays of cursor fields and primary keys. On top of that, we need our own dataset type metadata - because journey analytics needs to know not just how to sync the data, but what the data represents. Is it an event stream? A lookup table? An identity map?
Each dataset type has its own required fields:
| Dataset type | Required fields |
|---|---|
| Event | timestamp, primaryUserId |
| Lookup | primaryKey |
| Aggregate | timestamp |
| Identity Map | primaryUserId, cascadingIds |
Our API combines all of this - sync mode, sync fields, dataset type, and type-specific fields - into a single flat configuration object:
{
"name": "User Events",
"sourceId": "src_abc123",
"syncMode": "mirror",
"rowVersion": "updated_at",
"rowId": "event_id",
"streamSelection": {
"streamName": "events",
"namespace": "production"
},
"datasetConfiguration": {
"datasetType": "event",
"requiredFields": {
"timestamp": "event_timestamp",
"primaryUserId": "user_id"
}
},
"schedule": {
"scheduleType": "cron",
"cronExpression": "0 */6 * * *"
}
}Behind the scenes, our createDataset function splits this apart and does the right thing with each piece - building the Airbyte stream config, mapping the sync mode, setting cursor fields and primary keys, configuring the namespace to route data to the correct ClickHouse database, and storing the dataset type metadata for the analytics engine to use later.
The Endpoints
Here's the full set of API endpoints Claude helped me build to support this workflow. If you're building something similar, this is roughly the surface area you'd need:
Source management:
| Endpoint | What it does |
|---|---|
createSource | Provisions a new source in Airbyte with the user's warehouse credentials |
testSource | Validates connectivity and optionally runs schema discovery to verify the connection works |
getSourceStreams | Discovers available tables/streams from a source so the user can pick which ones to sync |
listSources | Returns all configured sources for a tenant |
getSource | Returns details for a specific source |
deleteSource | Removes a source from Airbyte and cleans up metadata |
Dataset management:
| Endpoint | What it does |
|---|---|
createDataset | Creates an Airbyte connection with the right sync mode, stream config, and schedule - then stores dataset type metadata |
updateDataset | Updates name, description, schedule, or status - syncs changes to both Airbyte and our metadata store |
deleteDataset | Three-step teardown: deletes the Airbyte connection, drops the ClickHouse table, and removes metadata |
syncDataset | Manually triggers a sync job |
resetDataset | Clears and re-syncs a dataset from scratch |
listDatasets | Returns all datasets, optionally with live sync status from Airbyte |
getDataset | Returns full dataset details with live job status (last sync time, records synced, etc.) |
getDatasetSyncHistory | Returns the history of sync jobs for a dataset |
That's 14 endpoints total. Most of them are thin wrappers around Airbyte's API with some added validation and metadata management on top.
This API layer might sound like a lot of work to create, but this is exactly the kind of thing Claude Code excels at. Give it a clear spec of what each endpoint should do - inputs, outputs, validation rules, and how it maps to the underlying Airbyte API - and it can build out the Cloud Functions surprisingly fast. In fact, you could use this very blog post as context to get started! The sync mode mappings, dataset type requirements, and field validation rules we just walked through are exactly the kind of structured specification that AI coding tools thrive on.
Tying It All Together
Let's recap the full picture. We started with a practical problem - getting data from a data warehouse into ClickHouse without building a massive ETL pipeline - and arrived at a layered architecture:
- Airbyte handles the hard parts: connector maintenance, incremental detection, schema discovery, error recovery, and scheduling.
- The API layer simplifies the experience: translating intuitive sync mode names, combining sync and dataset configuration into a single step, and routing data to our ClickHouse database.
- The user just connects their warehouse, picks their tables, configures each dataset, and sets a schedule. That's it.
The key takeaway is that our API layer is relatively thin - validation, Airbyte API translation, and metadata storage. The heavy lifting is all Airbyte. That lets us focus our energy on the actual journey analytics product rather than the data plumbing.
And here's what this looks like once everything is tied together:
Setting up a data source and syncing datasets in Trevorwithdata
What's Next
With data flowing from a warehouse into ClickHouse, the next challenge is making that data queryable in a journey analytics context. That means building out the query and processing layers that understands events, sessions, users, and journeys - and can translate high-level analytical questions into fast ClickHouse queries.
That's what we'll dig into in the next post. Stay tuned! 🚀
Trevor Paulsen is a data product professional at UKG and a former product leader of Adobe's Customer Journey Analytics. All views expressed are his own.