How to Build a Unified Events Table for DIY Journey Analytics
By Trevor Paulsen
This is part of a series where we're building a DIY journey analytics platform from scratch. If you're just joining, check out the earlier posts to catch up!
Demo: Building a unified events table with Data Groups
Fair warning: this is a "data plumbing" post. š° We're talking about table schemas, materialized views, and sort orders - not the most glamorous topics. But the decisions we make at this layer have an outsized impact on everything downstream. Get the plumbing wrong, and you'll feel it every time you try to run a query, build a session, or stitch an identity. Get it right, and everything else just... works. So let's dive in.
In the last post, we set up Airbyte to sync warehouse data into ClickHouse. If you followed along, you now have individual ClickHouse tables for each dataset you configured - web events in one table, mobile events in another, maybe CRM records in a third.
But here's the thing: journey analytics is all about analyzing behavior across those data sources. A customer browses your website, opens your mobile app, calls your support line, and then makes a transaction at a physical location. To see that full journey, you can't be querying four separate tables with manual UNIONs - you need one unified view of everything that happened.
That's the problem we're solving in this post. I call this layer a "Data Group" for lack of a better name - it's the step that takes your individual synced datasets and combines them into a single, queryable events table (with associated lookups etc.) that the analytics engine can work with.
The Problem: Same Events, Different Shapes
Let's say you have two event datasets:
Web events table (web_events):
| event_id | cookie_id | event_timestamp | page_url | referrer | revenue |
|---|---|---|---|---|---|
| ev_001 | abc123 | 2026-01-15 10:30 | /products/shoes | google.com | NULL |
| ev_002 | abc123 | 2026-01-15 10:35 | /checkout | NULL | 89.99 |
Mobile events table (mobile_events):
| event_id | device_id | event_timestamp | screen_name | app_version | revenue |
|---|---|---|---|---|---|
| ev_101 | xyz789 | 2026-01-15 11:00 | HomeScreen | 2.1.0 | NULL |
| ev_102 | xyz789 | 2026-01-15 11:05 | ProductDetail | 2.1.0 | 45.00 |
These tables have overlapping fields (event_id, event_timestamp, revenue) but also fields unique to each source (page_url vs. screen_name, cookie_id vs. device_id). A user asking "what's the total revenue across web and mobile?" shouldn't have to think about which table to query or how to UNION them manually.
So, we need a solution that can:
- Discover the schema of each source table automatically
- Merge those schemas into one unified structure
- Route incoming data from each source into the unified table in real-time
- Backfill historical data that was already synced before the Data Group was created
Once we've done this, we'll have a rich set of data that we can then use as a foundation for our future analytics.
Step 1: Schema Inference and Merging
When you create a Data Group, the first thing the system needs to do is inspect the source tables. ClickHouse makes this straightforward - you can query system.columns to get the full schema of any table:
SELECT
name,
type
FROM system.columns
WHERE database = 'my_clickhouse_db'
AND table = 'web_events'The system runs this for each source event dataset we want to include in our data group, then merges the results into one unified schema. The rules are pretty straightforward:
- Fields that exist in all sources keep their type (or get promoted to a wider type if there's a mismatch - for example, if one source has
Int32and another hasFloat64, the unified type becomesFloat64) - Fields that only exist in some sources become nullable in the unified schema (since rows from other sources won't have values for those fields)
- Core fields are standardized - every Data Group gets a consistent set of core fields:
event_id,device_id,person_id,event_timestamp,event_source, andis_identified
For our example, the unified events schema would look something like this:
| Field | Type | Source |
|---|---|---|
| event_id | String | Both |
| device_id | String | Both (mapped from cookie_id / device_id) |
| person_id | String | Both (more on this in a future identity stitching post) |
| event_timestamp | DateTime64(3) | Both |
| event_source | String | System-generated |
| is_identified | UInt8 | System-generated |
| page_url | Nullable(String) | Web only |
| referrer | Nullable(String) | Web only |
| screen_name | Nullable(String) | Mobile only |
| app_version | Nullable(String) | Mobile only |
| revenue | Nullable(Float64) | Both |
Note the two system-generated fields: event_source records which dataset each row came from so you can filter or break down results by source, and is_identified flags whether the row has a real person identifier or arrived without one (more on why this matters when we get to the materialized views). The device_id / person_id mapping is where field configuration comes in - you tell the system which field in each source represents the user identifier, and the Data Group normalizes it into a standard column.
Automatic JSON Detection
One thing I added to the schema inference that's pretty handy: detecting JSON columns automatically. It samples rows from each source table and uses ClickHouse's isValidJSON() function to check whether any String columns actually contain valid JSON. If the sampled values pass the check (and they often do - marketing parameters, product details, geo data all tend to arrive as JSON blobs), the system auto-discovers the nested fields with distinctJSONPathsAndTypes() and makes them available as individual queryable fields. So instead of needing to write JSON_VALUE(marketing, '$.utm_source'), users just see marketing.utm_source as a normal field in the schema.
Step 2: Building the Union Table
With the unified schema figured out, the system creates the actual ClickHouse table that will hold all the combined data. We can't just query the Airbyte source tables directly - they have whatever sort order and partitioning the sync tool gave them, fields might need type coercion, user identifiers need normalizing, and anonymous (cookieless) traffic needs special handling.
Said another way, we need a tightly prepped table where all of that is cleaned up and the physical layout is optimized for the kinds of queries journey analytics actually runs. This union table also becomes the foundation for identity stitching - we'll cover that in a future post, but the short version is that resolving anonymous visitors to known people greatly benefits from a single, unified table to stitch against.
So, here's what the table creation looks like (simplified for our example):
CREATE TABLE unified_data_group (
event_id String,
device_id String,
person_id String,
event_timestamp DateTime64(3),
event_source String,
page_url Nullable(String),
referrer Nullable(String),
screen_name Nullable(String),
app_version Nullable(String),
revenue Nullable(Float64),
is_identified UInt8 DEFAULT 1,
inserted_at DateTime64(3) DEFAULT now64(3),
updated_at DateTime64(3) DEFAULT now64(3),
is_deleted UInt8 DEFAULT 0
)
ENGINE = ReplacingMergeTree(updated_at, is_deleted)
PARTITION BY toYYYYMM(event_timestamp)
ORDER BY (person_id, toDate(event_timestamp), event_id)There's a lot packed into that statement, so let's break down the three key choices:
ReplacingMergeTree(updated_at, is_deleted) - This is the ClickHouse magic that lets our unified table inherit changes from the upstream datasets. As Airbyte syncs updates and deletes from the source warehouse, ReplacingMergeTree uses updated_at to keep only the latest version of each row and is_deleted to drop rows that were removed from the source. This makes sure our data group stays in perfect sync with our source warehouse.
ORDER BY (person_id, toDate(event_timestamp), event_id) - This is arguably the most important line. In ClickHouse, ORDER BY doesn't just affect query results - it controls how data is physically sorted on disk. We're sorting by person first, then date, then event ID. This means all events for the same person on the same day are stored right next to each other on disk. Why does this matter so much? Because journey analytics queries almost always process events per-person in time order (sessionization, attribution, pathing, funnel analysis). Having the data pre-sorted this way means ClickHouse can read it sequentially without any reshuffling. This one design choice is the single biggest factor in downstream query performance, and it'll pay off big time when we get to the query layer (which we'll discuss in a future post).
PARTITION BY toYYYYMM(event_timestamp) - This creates a separate data partition for each month of data. When a query includes a date range (and pretty much every analytics query does), ClickHouse can skip entire months of data that fall outside the range. This is called partition pruning, and it can mean the difference between scanning 12 months of data and scanning just the 2 months you actually care about.
Together, these three choices - the engine, the sort order, and the partitioning - turn what could be a slow, brute-force table scan into a fast, targeted read. This kind of control over physical data layout is what ClickHouse gives you that general-purpose warehouses don't, and it's a key reason we chose it. Journey analytics needs super fast, interactive queries - users dragging and dropping dimensions, applying segments, and expecting sub-second responses. That kind of experience requires the data to be physically optimized for the exact access patterns you're going to run. These foundational choices make the complex query-time processing and transformations we'll cover in a future post actually viable at scale.
Step 3: Materialized Views (The Real-Time Bridge)
Now that we have our unified table, how do we ensure that future data actually gets into it as rows are added or changed? This is where ClickHouse materialized views come in.
If you haven't worked with ClickHouse materialized views before, here's the simple version: a materialized view is like a standing query that runs automatically every time new data is inserted into a source table. It transforms the incoming rows and writes the results into a target table. Think of it like a permanent INSERT INTO ... SELECT that fires whenever new data arrives - no scheduling, no batch jobs, just instant data routing.
For Data Groups, we create one materialized view per source dataset, and all of them write to the same union table:
web_events āāā [MV: web] āāā unified_data_group
mobile_events āāā [MV: mobile] āāā unified_data_group
Here's what the materialized view for web events looks like:
CREATE MATERIALIZED VIEW mv_web_to_unified_data_group
TO unified_data_group
AS
SELECT
-- Use the source event_id if it exists, otherwise generate a unique UUID
-- so rows never collide during ReplacingMergeTree deduplication
coalesce(event_id, toString(generateUUIDv4())) AS event_id,
-- Map cookie_id to the system field device_id
coalesce(cookie_id, '') AS device_id,
-- Default person_id to device_id; fall back to event_id for anonymous events
-- Also handles the literal string 'null' that some sources send
-- (this is where identity stitching will plug in later)
coalesce(nullIf(nullIf(cookie_id, 'null'), ''), event_id) AS person_id,
-- Flag whether this row has a real identifier or is anonymous
-- (same nullIf checks as person_id to strip out blank and 'null' junk values)
if(nullIf(nullIf(cookie_id, 'null'), '') IS NOT NULL, 1, 0) AS is_identified,
-- Parse timestamp flexibly; fall back to now() if malformed or missing
-- (OrNull variant prevents bad timestamps from blocking the entire MV)
coalesce(parseDateTimeBestEffortOrNull(toString(event_timestamp)), now64(3)) AS event_timestamp,
-- Adding the dataset source it came from
'web_events' AS event_source,
-- Adding the data unique to or shared by web_events
page_url,
referrer,
revenue,
-- Putting in NULLs for the data that doesn't apply to web_events
CAST(NULL AS Nullable(String)) AS screen_name,
CAST(NULL AS Nullable(String)) AS app_version,
-- Other system fields
now64(3) AS inserted_at,
now64(3) AS updated_at,
0 AS is_deleted
FROM web_eventsA few things to notice about this materialized view:
- Field mapping:
cookie_idfrom the web table gets mapped todevice_idin the union table. Fields that don't exist in this source (likescreen_name) get NULLs. Each source has its own mapping like this. - Anonymous event handling: Not every event arrives with a user identifier (privacy regulations, browsers blocking cookies, etc.), so
person_iddefaults to thedevice_idand falls back toevent_idwhen even that is missing, keeping each anonymous event distinct. Theis_identifiedflag marks whether the row has a real identifier (1) or a fallback (0), which drives downstream behavior like excluding anonymous rows from sessionization and person-level segments or metrics. - Defensive defaults: Core fields use
COALESCEto prevent null values from slipping through,generateUUIDv4()ensures every row gets a unique event ID even if the source doesn't provide one, andparseDateTimeBestEffortOrNullhandles messy or malformed timestamps without blocking the entire materialized view. - event_source tagging: Every row gets tagged with which dataset it came from, making it easy to filter or break down by source later.
The mobile events MV would look similar but with its own mapping - device_id maps directly, screen_name and app_version are populated, while page_url and referrer get NULLs.
Step 4: Backfilling Historical Data
There's one gap to fill: materialized views only capture data that arrives after the view is created. Any data that was already synced into the source tables before the Data Group was set up won't be picked up automatically.
So after creating the union table and the materialized views, the system runs a one-time backfill:
-- Backfill web events (same SELECT as the MV, run once against full history)
INSERT INTO unified_data_group
SELECT
coalesce(event_id, toString(generateUUIDv4())) AS event_id,
coalesce(cookie_id, '') AS device_id,
coalesce(nullIf(nullIf(cookie_id, 'null'), ''), event_id) AS person_id,
if(nullIf(nullIf(cookie_id, 'null'), '') IS NOT NULL, 1, 0) AS is_identified,
coalesce(parseDateTimeBestEffortOrNull(toString(event_timestamp)), now64(3)) AS event_timestamp,
'web_events' AS event_source,
page_url,
referrer,
revenue,
CAST(NULL AS Nullable(String)) AS screen_name,
CAST(NULL AS Nullable(String)) AS app_version,
now64(3) AS inserted_at,
now64(3) AS updated_at,
0 AS is_deleted
FROM web_events
-- Backfill mobile events
INSERT INTO unified_data_group
SELECT
coalesce(event_id, toString(generateUUIDv4())) AS event_id,
coalesce(device_id, '') AS device_id,
coalesce(nullIf(nullIf(device_id, 'null'), ''), event_id) AS person_id,
if(nullIf(nullIf(device_id, 'null'), '') IS NOT NULL, 1, 0) AS is_identified,
coalesce(parseDateTimeBestEffortOrNull(toString(event_timestamp)), now64(3)) AS event_timestamp,
'mobile_events' AS event_source,
CAST(NULL AS Nullable(String)) AS page_url,
CAST(NULL AS Nullable(String)) AS referrer,
revenue,
screen_name,
app_version,
now64(3) AS inserted_at,
now64(3) AS updated_at,
0 AS is_deleted
FROM mobile_eventsThis is essentially the same SELECT as the materialized views, just executed once against the full historical data. After the backfill completes, the union table has the complete history from all sources - and the materialized views keep it current going forward as new data arrives.
Depending on data volume, backfills can take anywhere from a few seconds to several minutes. The system tracks backfill progress so the UI can show the user how things are coming along.
The Full Picture
Let's zoom out and look at the complete flow from raw warehouse data to a queryable Data Group:
āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā
ā Warehouse āāāāāāā Airbyte āāāāāāā ClickHouse āāāāāāā Data Group ā
ā (BigQuery, ā ā CDC Sync ā ā Source ā ā Union Table ā
ā Snowflake) ā ā ā ā Tables ā ā ā
āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāā¬āāāāāāāā āāāāāāāāāāāāāāāā
ā ā²
ā Materialized ā
āāāāā Views āāāāāāāāā
- Warehouse holds the source of truth
- Airbyte syncs it to individual ClickHouse source tables (with CDC, append-only, or full refresh)
- Materialized Views transform and route data in real-time to the union table
- The Union Table (Data Group) is one unified, queryable table with data from all sources
From the user's perspective, none of this complexity is visible. They see a single Data Group called something like "Web + Mobile Events" with a clean list of fields they can use. They don't need to know that page_url comes from the web dataset and screen_name comes from mobile - they just see both as available fields in their report.
To orchestrate all of this, Claude and I built a handful of Cloud Function APIs that the UI calls:
createDataGroup- The big one. Takes the user's dataset selections and field mappings, infers the unified schema, creates the union table and materialized views, runs the backfill, and enriches the schema with any auto-detected JSON fields. One API call kicks off the entire pipeline.getDataGroupSchema/getSampleData- Returns the unified schema and sample rows so the UI can show users what their Data Group looks like and let them preview data.deleteDataGroup- Tears down the materialized views, dictionaries, and union table in ClickHouse when a Data Group is removed.
Each of these is a Firebase Cloud Function (v2 HTTP) - lightweight, stateless, and easy to deploy. The heavy lifting all happens in ClickHouse; the API layer is mostly just orchestration and configuration management.
The Complexity Budget
If you've been following this series, you might notice a pattern: each layer we build isn't wildly complex on its own. Airbyte handles sync. Materialized views handle routing. ReplacingMergeTree handles deduplication. The interesting part is in how these pieces compose together - and in making the right design choices so that downstream layers (like the query engine) can be as fast as possible.
That physical sort order we chose (person_id, toDate(event_timestamp), event_id) isn't just convenient - it's foundational. Every window function, every sessionization query, every attribution model that will run against this table benefits from data being physically sorted and partitioned by person and time. Getting this right here means the query layer can focus on analytical logic instead of fighting data layout.
In a future post, we'll see exactly how that pays off when we build a layered SQL architecture that turns these unified events into actual journey analytics - sessions, computed metrics, segments, and more. If you thought materialized views were fun, wait until you see what ClickHouse can do with nested window functions. š
If you're interested in following along, I'd love to have you!
Trevor Paulsen is a data product professional at UKG and a former product leader of Adobe's Customer Journey Analytics. All views expressed are his own.