How to Build a Unified Events Table for DIY Journey Analytics

This is part of a series where we're building a DIY journey analytics platform from scratch. If you're just joining, check out the earlier posts to catch up!

Demo: Building a unified events table with Data Groups

Fair warning: this is a "data plumbing" post. 🚰 We're talking about table schemas, materialized views, and sort orders - not the most glamorous topics. But the decisions we make at this layer have an outsized impact on everything downstream. Get the plumbing wrong, and you'll feel it every time you try to run a query, build a session, or stitch an identity. Get it right, and everything else just... works. So let's dive in.

In the last post, we set up Airbyte to sync warehouse data into ClickHouse. If you followed along, you now have individual ClickHouse tables for each dataset you configured - web events in one table, mobile events in another, maybe CRM records in a third.

But here's the thing: journey analytics is all about analyzing behavior across those data sources. A customer browses your website, opens your mobile app, calls your support line, and then makes a transaction at a physical location. To see that full journey, you can't be querying four separate tables with manual UNIONs - you need one unified view of everything that happened.

That's the problem we're solving in this post. I call this layer a "Data Group" for lack of a better name - it's the step that takes your individual synced datasets and combines them into a single, queryable events table (with associated lookups etc.) that the analytics engine can work with.

The Problem: Same Events, Different Shapes

Let's say you have two event datasets:

Web events table (web_events):

event_id	cookie_id	event_timestamp	page_url	referrer	revenue
ev_001	abc123	2026-01-15 10:30	/products/shoes	google.com	NULL
ev_002	abc123	2026-01-15 10:35	/checkout	NULL	89.99

Mobile events table (mobile_events):

event_id	device_id	event_timestamp	screen_name	app_version	revenue
ev_101	xyz789	2026-01-15 11:00	HomeScreen	2.1.0	NULL
ev_102	xyz789	2026-01-15 11:05	ProductDetail	2.1.0	45.00

These tables have overlapping fields (event_id, event_timestamp, revenue) but also fields unique to each source (page_url vs. screen_name, cookie_id vs. device_id). A user asking "what's the total revenue across web and mobile?" shouldn't have to think about which table to query or how to UNION them manually.

So, we need a solution that can:

Discover the schema of each source table automatically
Merge those schemas into one unified structure
Route incoming data from each source into the unified table in real-time
Backfill historical data that was already synced before the Data Group was created

Once we've done this, we'll have a rich set of data that we can then use as a foundation for our future analytics.

Step 1: Schema Inference and Merging

When you create a Data Group, the first thing the system needs to do is inspect the source tables. ClickHouse makes this straightforward - you can query system.columns to get the full schema of any table:

SELECT
  name,
  type
FROM system.columns
WHERE database = 'my_clickhouse_db'
  AND table = 'web_events'

The system runs this for each source event dataset we want to include in our data group, then merges the results into one unified schema. The rules are pretty straightforward:

Fields that exist in all sources keep their type (or get promoted to a wider type if there's a mismatch - for example, if one source has Int32 and another has Float64, the unified type becomes Float64)
Fields that only exist in some sources become nullable in the unified schema (since rows from other sources won't have values for those fields)
Core fields are standardized - every Data Group gets a consistent set of core fields: event_id, device_id, person_id, event_timestamp, event_source, and is_identified

For our example, the unified events schema would look something like this:

Field	Type	Source
event_id	String	Both
device_id	String	Both (mapped from cookie_id / device_id)
person_id	String	Both (more on this in a future identity stitching post)
event_timestamp	DateTime64(3)	Both
event_source	String	System-generated
is_identified	UInt8	System-generated
page_url	Nullable(String)	Web only
referrer	Nullable(String)	Web only
screen_name	Nullable(String)	Mobile only
app_version	Nullable(String)	Mobile only
revenue	Nullable(Float64)	Both

Note the two system-generated fields: event_source records which dataset each row came from so you can filter or break down results by source, and is_identified flags whether the row has a real person identifier or arrived without one (more on why this matters when we get to the materialized views). The device_id / person_id mapping is where field configuration comes in - you tell the system which field in each source represents the user identifier, and the Data Group normalizes it into a standard column.

Automatic JSON Detection

One thing I added to the schema inference that's pretty handy: detecting JSON columns automatically. It samples rows from each source table and uses ClickHouse's isValidJSON() function to check whether any String columns actually contain valid JSON. If the sampled values pass the check (and they often do - marketing parameters, product details, geo data all tend to arrive as JSON blobs), the system auto-discovers the nested fields with distinctJSONPathsAndTypes() and makes them available as individual queryable fields. So instead of needing to write JSON_VALUE(marketing, '$.utm_source'), users just see marketing.utm_source as a normal field in the schema.

Step 2: Building the Union Table

With the unified schema figured out, the system creates the actual ClickHouse table that will hold all the combined data. We can't just query the Airbyte source tables directly - they have whatever sort order and partitioning the sync tool gave them, fields might need type coercion, user identifiers need normalizing, and anonymous (cookieless) traffic needs special handling.

Said another way, we need a tightly prepped table where all of that is cleaned up and the physical layout is optimized for the kinds of queries journey analytics actually runs. This union table also becomes the foundation for identity stitching - we'll cover that in a future post, but the short version is that resolving anonymous visitors to known people greatly benefits from a single, unified table to stitch against.

So, here's what the table creation looks like (simplified for our example):

CREATE TABLE unified_data_group (
  event_id String,
  device_id String,
  person_id String,
  event_timestamp DateTime64(3),
  event_source String,
  page_url Nullable(String),
  referrer Nullable(String),
  screen_name Nullable(String),
  app_version Nullable(String),
  revenue Nullable(Float64),
  is_identified UInt8 DEFAULT 1,
  inserted_at DateTime64(3) DEFAULT now64(3),
  updated_at DateTime64(3) DEFAULT now64(3),
  is_deleted UInt8 DEFAULT 0
)
ENGINE = ReplacingMergeTree(updated_at, is_deleted)
PARTITION BY toYYYYMM(event_timestamp)
ORDER BY (person_id, toDate(event_timestamp), event_id)

There's a lot packed into that statement, so let's break down the three key choices:

ReplacingMergeTree(updated_at, is_deleted) - This is the ClickHouse magic that lets our unified table inherit changes from the upstream datasets. As Airbyte syncs updates and deletes from the source warehouse, ReplacingMergeTree uses updated_at to keep only the latest version of each row and is_deleted to drop rows that were removed from the source. This makes sure our data group stays in perfect sync with our source warehouse.

ORDER BY (person_id, toDate(event_timestamp), event_id) - This is arguably the most important line. In ClickHouse, ORDER BY doesn't just affect query results - it controls how data is physically sorted on disk. We're sorting by person first, then date, then event ID. This means all events for the same person on the same day are stored right next to each other on disk. Why does this matter so much? Because journey analytics queries almost always process events per-person in time order (sessionization, attribution, pathing, funnel analysis). Having the data pre-sorted this way means ClickHouse can read it sequentially without any reshuffling. This one design choice is the single biggest factor in downstream query performance, and it'll pay off big time when we get to the query layer (which we'll discuss in a future post).

PARTITION BY toYYYYMM(event_timestamp) - This creates a separate data partition for each month of data. When a query includes a date range (and pretty much every analytics query does), ClickHouse can skip entire months of data that fall outside the range. This is called partition pruning, and it can mean the difference between scanning 12 months of data and scanning just the 2 months you actually care about.

Together, these three choices - the engine, the sort order, and the partitioning - turn what could be a slow, brute-force table scan into a fast, targeted read. This kind of control over physical data layout is what ClickHouse gives you that general-purpose warehouses don't, and it's a key reason we chose it. Journey analytics needs super fast, interactive queries - users dragging and dropping dimensions, applying segments, and expecting sub-second responses. That kind of experience requires the data to be physically optimized for the exact access patterns you're going to run. These foundational choices make the complex query-time processing and transformations we'll cover in a future post actually viable at scale.

Step 3: Materialized Views (The Real-Time Bridge)

Now that we have our unified table, how do we ensure that future data actually gets into it as rows are added or changed? This is where ClickHouse materialized views come in.

If you haven't worked with ClickHouse materialized views before, here's the simple version: a materialized view is like a standing query that runs automatically every time new data is inserted into a source table. It transforms the incoming rows and writes the results into a target table. Think of it like a permanent INSERT INTO ... SELECT that fires whenever new data arrives - no scheduling, no batch jobs, just instant data routing.

For Data Groups, we create one materialized view per source dataset, and all of them write to the same union table:

web_events    ──→ [MV: web]    ──→ unified_data_group
mobile_events ──→ [MV: mobile] ──→ unified_data_group

Here's what the materialized view for web events looks like:

CREATE MATERIALIZED VIEW mv_web_to_unified_data_group
TO unified_data_group
AS
SELECT
  -- Use the source event_id if it exists, otherwise generate a unique UUID
  -- so rows never collide during ReplacingMergeTree deduplication
  coalesce(event_id, toString(generateUUIDv4())) AS event_id,
 
  -- Map cookie_id to the system field device_id
  coalesce(cookie_id, '') AS device_id,
 
  -- Default person_id to device_id; fall back to event_id for anonymous events
  -- Also handles the literal string 'null' that some sources send
  -- (this is where identity stitching will plug in later)
  coalesce(nullIf(nullIf(cookie_id, 'null'), ''), event_id) AS person_id,
 
  -- Flag whether this row has a real identifier or is anonymous
  -- (same nullIf checks as person_id to strip out blank and 'null' junk values)
  if(nullIf(nullIf(cookie_id, 'null'), '') IS NOT NULL, 1, 0) AS is_identified,
 
  -- Parse timestamp flexibly; fall back to now() if malformed or missing
  -- (OrNull variant prevents bad timestamps from blocking the entire MV)
  coalesce(parseDateTimeBestEffortOrNull(toString(event_timestamp)), now64(3)) AS event_timestamp,
 
  -- Adding the dataset source it came from
  'web_events' AS event_source,
  
  -- Adding the data unique to or shared by web_events
  page_url,
  referrer,
  revenue,
  
  -- Putting in NULLs for the data that doesn't apply to web_events
  CAST(NULL AS Nullable(String))                  AS screen_name,
  CAST(NULL AS Nullable(String))                  AS app_version,
  
  -- Other system fields
  now64(3)                                        AS inserted_at,
  now64(3)                                        AS updated_at,
  0                                               AS is_deleted
FROM web_events

A few things to notice about this materialized view:

Field mapping: cookie_id from the web table gets mapped to device_id in the union table. Fields that don't exist in this source (like screen_name) get NULLs. Each source has its own mapping like this.
Anonymous event handling: Not every event arrives with a user identifier (privacy regulations, browsers blocking cookies, etc.), so person_id defaults to the device_id and falls back to event_id when even that is missing, keeping each anonymous event distinct. The is_identified flag marks whether the row has a real identifier (1) or a fallback (0), which drives downstream behavior like excluding anonymous rows from sessionization and person-level segments or metrics.
Defensive defaults: Core fields use COALESCE to prevent null values from slipping through, generateUUIDv4() ensures every row gets a unique event ID even if the source doesn't provide one, and parseDateTimeBestEffortOrNull handles messy or malformed timestamps without blocking the entire materialized view.
event_source tagging: Every row gets tagged with which dataset it came from, making it easy to filter or break down by source later.

The mobile events MV would look similar but with its own mapping - device_id maps directly, screen_name and app_version are populated, while page_url and referrer get NULLs.

Step 4: Backfilling Historical Data

There's one gap to fill: materialized views only capture data that arrives after the view is created. Any data that was already synced into the source tables before the Data Group was set up won't be picked up automatically.

So after creating the union table and the materialized views, the system runs a one-time backfill:

-- Backfill web events (same SELECT as the MV, run once against full history)
INSERT INTO unified_data_group
SELECT
  coalesce(event_id, toString(generateUUIDv4())) AS event_id,
  coalesce(cookie_id, '') AS device_id,
  coalesce(nullIf(nullIf(cookie_id, 'null'), ''), event_id) AS person_id,
  if(nullIf(nullIf(cookie_id, 'null'), '') IS NOT NULL, 1, 0) AS is_identified,
  coalesce(parseDateTimeBestEffortOrNull(toString(event_timestamp)), now64(3)) AS event_timestamp,
  'web_events' AS event_source,
  page_url,
  referrer,
  revenue,
  CAST(NULL AS Nullable(String))                  AS screen_name,
  CAST(NULL AS Nullable(String))                  AS app_version,
  now64(3)                                        AS inserted_at,
  now64(3)                                        AS updated_at,
  0                                               AS is_deleted
FROM web_events
 
-- Backfill mobile events
INSERT INTO unified_data_group
SELECT
  coalesce(event_id, toString(generateUUIDv4())) AS event_id,
  coalesce(device_id, '') AS device_id,
  coalesce(nullIf(nullIf(device_id, 'null'), ''), event_id) AS person_id,
  if(nullIf(nullIf(device_id, 'null'), '') IS NOT NULL, 1, 0) AS is_identified,
  coalesce(parseDateTimeBestEffortOrNull(toString(event_timestamp)), now64(3)) AS event_timestamp,
  'mobile_events' AS event_source,
  CAST(NULL AS Nullable(String))                  AS page_url,
  CAST(NULL AS Nullable(String))                  AS referrer,
  revenue,
  screen_name,
  app_version,
  now64(3)                                        AS inserted_at,
  now64(3)                                        AS updated_at,
  0                                               AS is_deleted
FROM mobile_events

This is essentially the same SELECT as the materialized views, just executed once against the full historical data. After the backfill completes, the union table has the complete history from all sources - and the materialized views keep it current going forward as new data arrives.

Depending on data volume, backfills can take anywhere from a few seconds to several minutes. The system tracks backfill progress so the UI can show the user how things are coming along.

The Full Picture

Let's zoom out and look at the complete flow from raw warehouse data to a queryable Data Group:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Warehouse   │────→│   Airbyte    │────→│  ClickHouse  │────→│  Data Group  │
│  (BigQuery,  │     │  CDC Sync    │     │   Source     │     │ Union Table  │
│  Snowflake)  │     │              │     │   Tables     │     │              │
└──────────────┘     └──────────────┘     └──────┬───────┘     └──────────────┘
                                                 │                   ▲
                                                 │   Materialized    │
                                                 └──── Views ────────┘

Warehouse holds the source of truth
Airbyte syncs it to individual ClickHouse source tables (with CDC, append-only, or full refresh)
Materialized Views transform and route data in real-time to the union table
The Union Table (Data Group) is one unified, queryable table with data from all sources

From the user's perspective, none of this complexity is visible. They see a single Data Group called something like "Web + Mobile Events" with a clean list of fields they can use. They don't need to know that page_url comes from the web dataset and screen_name comes from mobile - they just see both as available fields in their report.

To orchestrate all of this, Claude and I built a handful of Cloud Function APIs that the UI calls:

createDataGroup - The big one. Takes the user's dataset selections and field mappings, infers the unified schema, creates the union table and materialized views, runs the backfill, and enriches the schema with any auto-detected JSON fields. One API call kicks off the entire pipeline.
getDataGroupSchema / getSampleData - Returns the unified schema and sample rows so the UI can show users what their Data Group looks like and let them preview data.
deleteDataGroup - Tears down the materialized views, dictionaries, and union table in ClickHouse when a Data Group is removed.

Each of these is a Firebase Cloud Function (v2 HTTP) - lightweight, stateless, and easy to deploy. The heavy lifting all happens in ClickHouse; the API layer is mostly just orchestration and configuration management.

The Complexity Budget

If you've been following this series, you might notice a pattern: each layer we build isn't wildly complex on its own. Airbyte handles sync. Materialized views handle routing. ReplacingMergeTree handles deduplication. The interesting part is in how these pieces compose together - and in making the right design choices so that downstream layers (like the query engine) can be as fast as possible.

That physical sort order we chose (person_id, toDate(event_timestamp), event_id) isn't just convenient - it's foundational. Every window function, every sessionization query, every attribution model that will run against this table benefits from data being physically sorted and partitioned by person and time. Getting this right here means the query layer can focus on analytical logic instead of fighting data layout.

In a future post, we'll see exactly how that pays off when we build a layered SQL architecture that turns these unified events into actual journey analytics - sessions, computed metrics, segments, and more. If you thought materialized views were fun, wait until you see what ClickHouse can do with nested window functions. 😄

If you're interested in following along, I'd love to have you!

Trevor Paulsen is a data product professional at UKG and a former product leader of Adobe's Customer Journey Analytics. All views expressed are his own.