Query-Time Transformation: How to Turn Raw Events Into Journey Analytics

This is part of a series where we're building a DIY journey analytics platform from scratch. If you're just joining, check out the earlier posts to catch up!

In the last couple of posts, we built the Data Group layer - combining multiple source datasets into one unified, queryable ClickHouse table - and then added identity stitching to resolve users across devices into a single person ID. But having unified, stitched data is only half the story - the real magic of journey analytics happens when someone actually runs a report.

Fair warning: this is probably the most technical post in the series so far. We're going to get into SQL, window functions, and nested query architecture. If that's your thing, you're in for a treat. If not, the first section on why everything has to happen at query time is still worth reading - it's the key architectural insight that drives everything else.

And it starts with a simple observation: journey analytics queries are way more complex than the SQL most people are used to writing...

Walkthrough of the 6-layer query architecture and how ClickHouse makes it fast

Why Everything Has to Happen at Query Time

If you've worked with a BI tool like Looker or Tableau, you're used to a pattern: define some metrics and dimensions in a semantic layer, and the tool generates a fairly straightforward SELECT ... GROUP BY ... ORDER BY query. Maybe a few JOINs, maybe a subquery. Pretty standard stuff. And for most analytics, you can get away with pre-computing a lot of the hard work - materializing session tables, pre-building aggregates, storing segment membership as flags on rows.

Journey analytics can't do that. The data is just too dynamic. Identity stitching constantly rewrites who did what. Disparate data sources arrive asynchronously - a call center interaction might land hours after the web visit it relates to. Underlying data changes retroactively - prices get updated, returns get processed, records get corrected. History is never really "done," which means anything you pre-compute is at risk of being wrong the moment it's written.

When Historical Data Won't Stay Still

Consider what happens in a typical journey analytics dataset over the course of a single day:

Identity stitching rewrites who did what. When we learn that device_abc and cookie_xyz belong to the same person, every historical event for both identities gets re-attributed to a single unified person ID. Sessions that used to belong to two people now interleave into one timeline. Segments, attribution, and dimension persistence all shift retroactively. We covered identity stitching in depth in the last post, but the key insight is this: it doesn't just add data, it rewrites the context around existing data.
Data sources arrive out of order. A batch of in-store purchase events might not sync until overnight. A call center log might land hours after the web visit it relates to. Every late arrival can change session boundaries, segment membership, and attribution for events that were already "processed."
Underlying records change. Prices get updated, returns get processed, product categories get reclassified, records get corrected. CDC pipelines (like Airbyte) faithfully propagate these changes, but anything derived from the old values is now wrong.

If sessions, segments, or attribution were pre-computed, every one of these changes would require rewriting the affected downstream data. For a busy digital product with multiple data sources and active identity stitching, that's potentially millions of rows being reprocessed continuously - with a constant lag between when the change happens and when the derived data catches up. You'd essentially be building a second data pipeline just to keep your data tables consistent, and it would never fully catch up (not to mention be super expensive!).

Don't Pre-Compute. Recompute.

Query-time transformation sidesteps this problem entirely. The raw events table only stores the facts: what happened, when, and for which user ID. When anything changes - a user ID gets stitched, a late event arrives, a record gets corrected - the raw data reflects it. The next time someone runs a report, sessionization, segments, attribution, and everything else is recomputed fresh against the current truth. No stale materialized views, no backfill jobs, no lag. An analyst can even change definitions on the fly - try a 15-minute session timeout instead of 30, or reclassify marketing channels - and it's just a different parameter on the next query.

The tradeoff is that we're running a lot of expensive calculations on every request. That puts a lot of pressure on both the engine choice and the query structure. So after a lot of back and forth between Claude and I, we landed on the following 6-layer nested SQL query, where each layer computes one piece of the puzzle and then feeds its output into the next:

Raw Events Table
  └─ Layer 1: Sessionization
       └─ Layer 2: Object Array Expansion
            └─ Layer 3: Metrics & Dimensions
                 └─ Layer 4: Segments
                      └─ Layer 5: Object Array Deduplication
                           └─ Layer 6: Final Aggregation

Let's walk through each layer.

Layer 1: Sessionization

Sessionization groups a person's events into "sessions" based on time gaps. If the gap between consecutive events exceeds a configurable timeout (typically 30 minutes), a new session starts. In SQL, this means using window functions: one to look at the previous event's timestamp (LAG), and another to do a cumulative sum that increments at each gap. ClickHouse doesn't allow nesting window functions, so we split this into two sub-layers:

-- Sub-layer 1a: Compute previous timestamp per person
SELECT *,
  lagInFrame(event_timestamp, 1, toDateTime64('1970-01-01', 3))
    OVER (PARTITION BY person_id ORDER BY event_timestamp) as prev_timestamp
FROM tenant.events_table FINAL
 
-- Sub-layer 1b: Compute session_id from time gaps
SELECT *,
  CONCAT(toString(person_id), '_',
    toString(SUM(CASE
      WHEN gap_ms > 1800000 THEN 1 ELSE 0
    END) OVER (
      PARTITION BY person_id
      ORDER BY event_timestamp
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ))
  ) as session_id
FROM sub_layer_1a

A few things worth noting here. We always sessionize - even if the current report doesn't explicitly need sessions. This simplifies the architecture and means session_id is always available for downstream layers. Session IDs are globally unique by prepending the person ID, so COUNT(DISTINCT session_id) works correctly across people. And the session timeout is configurable per query - e.g. 30 minutes for web analytics, 5 minutes for a mobile app, 60 minutes for a B2B product.

This is also where ClickHouse's table design pays off. As we covered in the Data Groups post, our union tables use ORDER BY (person_id, toDate(event_timestamp), event_id), so data is physically sorted on disk in exactly the order these window functions need. ClickHouse reads rows in sequence - no shuffling, no resorting. We'll come back to why this matters in the engine comparison below. The FINAL keyword tells ClickHouse's ReplacingMergeTree engine to deduplicate rows on the fly, handling CDC updates and deletes from Airbyte without separate cleanup jobs.

Layer 2: Array Expansion

Many event datasets contain arrays - products in a shopping cart, items in an order, ads displayed on a page. To analyze these at the individual item level, we need to expand them into separate rows using ClickHouse's LEFT ARRAY JOIN.

SELECT *
FROM sessionized_events
LEFT ARRAY JOIN
  products as product,
  arrayEnumerate(products) as products_idx

This turns one event with 3 products into 3 rows - one per product. The LEFT ensures events with empty arrays still appear (with NULL values for product fields). The arrayEnumerate gives us an index we'll need later for deduplication.

Not every query needs array expansion - it's only added when the requested dimensions or metrics reference array fields. But when it is needed, it fundamentally changes the cardinality of the data, which creates a deduplication challenge we'll address in Layer 5.

Layer 3: Metric and Dimension Computation

This is where the semantic layer comes alive. Every dimension and metric is defined as a SQL expression that gets computed at the row level. There's a lot of depth to how metrics and dimensions work - enough that we'll dedicate a future post to just the semantic layer - but here's a taste of what these expressions look like:

-- A CASE-based dimension: Marketing Channel
CASE
  WHEN referrer LIKE '%google%' AND utm_source = '' THEN 'Organic Search'
  WHEN utm_source != '' THEN 'Paid'
  WHEN referrer != '' THEN 'Referral'
  ELSE 'Direct'
END as marketing_channel,
 
-- A window function dimension: Previous Page
lagInFrame(page_url, 1, '')
  OVER (PARTITION BY person_id ORDER BY event_timestamp)
  as previous_page,
 
-- A JSON field extraction: City
CAST(geo.city AS String) as city,
 
-- A raw metric value (aggregation happens later in Layer 6)
1 as _event_count,
revenue as _total_revenue

Notice that metrics are prefixed with an underscore and contain NO aggregation at this point - they're just raw row-level values. The SUM, COUNT DISTINCT, etc. happens later in Layer 6. This separation is key: it lets us apply segments and deduplication to the raw values before aggregating.

Layer 4: Segment Booleans

Segments are behavioral filters - "people who purchased red shoes" or "sessions that included an organic search visit." Segments are computed before aggregation, which means they can use window functions to look across all of a person's or session's events.

-- Person-scoped segment: "Purchasers"
-- TRUE for ALL events of a person if ANY of their events was a purchase
MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END)
  OVER (PARTITION BY person_id) as _seg_purchasers,
 
-- Session-scoped segment: "Organic Search Sessions"
-- TRUE for all events in a session if any event came from organic search
MAX(CASE WHEN marketing_channel = 'Organic Search' THEN 1 ELSE 0 END)
  OVER (PARTITION BY person_id, session_id) as _seg_organic_sessions

The power here is that segments can reference computed dimensions from Layer 3 (like marketing_channel). They're applied after computation but before aggregation, which is exactly what you need for questions like "show me total revenue, but only from sessions that started with an organic search visit."

In the final aggregation layer, these boolean flags get used as conditional filters on metrics - so you can have one column showing total revenue and another showing revenue from organic sessions, side by side in the same report.

Layer 5: Deduplication Flags

This layer only activates when array expansion happened in Layer 2, and it solves a subtle but critical problem: when we expand a single event into multiple rows (one per product), event-level metrics like revenue get duplicated across those rows.

If a purchase event has revenue of $100 and contains 3 products, after array expansion we'd have 3 rows each showing $100. If we naively SUM(revenue), we'd get $300 instead of $100.

The solution is ROW_NUMBER() window functions that flag which rows should "own" each metric:

-- Flag first row per event (for event-level metric deduplication)
ROW_NUMBER() OVER (
  PARTITION BY person_id, event_id, marketing_channel
  ORDER BY products_idx
) as event_dedup_flag

In the aggregation layer, event-level metrics then use CASE WHEN event_dedup_flag = 1 to only count each event's value once, while array-level metrics (like product price) aggregate normally across all expanded rows.

This gets even more interesting with multiple arrays (products AND tags on the same event) or person-level metrics that need to be attributed without double-counting. The system supports different deduplication strategies depending on the relationship between metric granularity and dimension granularity - person, session, event, or array level.

Layer 6: Final Aggregation

The outermost layer brings it all together with GROUP BY, HAVING, and ORDER BY:

SELECT
  marketing_channel as col_0,
  SUM(COALESCE(_total_revenue, 0)) as col_1,
  COUNT(DISTINCT _unique_sessions) as col_2,
  SUM(IF(_seg_organic_sessions, _total_revenue, 0)) as col_3
FROM layer_5
WHERE _seg_purchasers = 1   -- Table-level segment filter
GROUP BY marketing_channel
HAVING (col_1 != 0 OR col_2 != 0 OR col_3 != 0)
ORDER BY col_1 DESC
LIMIT 1000

A few things happening here. The table-level segment (_seg_purchasers) is applied as a WHERE clause to filter the entire dataset before aggregation. Per-metric segments (_seg_organic_sessions) are applied inline using IF() so different metrics in the same report can use different segment filters. And the HAVING clause automatically filters out rows where all metrics are zero - cleaning up the results without requiring the user to think about it.

Why ClickHouse Is the Right Engine for This

For context: during my time at Adobe, the team invested in building a completely bespoke, non-SQL database engine purpose-built for exactly this kind of layered, query-time journey processing at massive scale. That investment is a testament to how seriously Adobe takes their user experience - and it's a big part of why their journey analytics is as powerful and unique as it is.

General-purpose warehouses like BigQuery and Snowflake simply weren't designed for this workload. But ClickHouse is the closest thing I've seen to an exception to that: an open-source engine whose architecture aligns so well with what journey analytics demands that we can approach purpose-built performance using SQL. Here's why I believe it's the best general-purpose engine for this workload:

Physical sort order makes layered queries viable. As we covered in the sessionization layer, ClickHouse's ORDER BY on the table definition is a hard physical guarantee - data is sorted on disk in exactly the order our window functions need. This is what makes it possible to run 6 layers of nested subqueries with multiple window functions and still get results in seconds rather than minutes. This is a guarantee that most general-purpose analytical databases don't offer.

ReplacingMergeTree handles CDC natively. As we saw with the FINAL keyword, ClickHouse deduplicates rows automatically during background merges, handling CDC updates and deletes from Airbyte and identity stitching without separate cleanup jobs.

Sub-100ms query startup. Every BigQuery query has a minimum 500ms-1s overhead for job scheduling, regardless of data size. When users are interactively toggling dimensions and segments - running maybe 20-30 queries in a single analysis session - that overhead adds up fast. ClickHouse queries begin executing immediately.

The API Layer: From Report Definition to SQL

All the SQL complexity we've walked through so far is generated automatically by our API layer. Nobody - not the UI, not the AI agent, not a developer integrating with the platform - ever writes these queries by hand. Instead, they send a simple JSON request that describes what they want to analyze:

{
  "dimensions": ["marketing_channel", "device_category"],
  "metrics": ["event_count", "total_revenue"],
  "segments": ["organic_sessions"],
  "filters": [{ "dimension": "page_url", "operator": "contains", "value": "/checkout" }],
  "dateRange": { "start": "2026-01-01", "end": "2026-01-31" }
}

The API takes that request, looks up the SQL definitions for each dimension, metric, and segment, and assembles the full 6-layer nested query we've been discussing. The caller gets back aggregated results - they never see sessionization logic, deduplication flags, or window functions.

The lookup step is where Firebase Realtime Database comes in. Every metric, dimension, and segment is stored as a definition in Firebase RTDB - essentially a name, a SQL expression, and some metadata (data type, aggregation function, which data groups it applies to, etc.). When a report request arrives, the API reads those definitions from Firebase, which typically completes in single-digit milliseconds. That matters because these lookups happen on every query request, and we don't want the definition fetch to add meaningful latency on top of the ClickHouse query itself.

For example, the total_revenue metric might be stored as:

{
  "id": "total_revenue",
  "name": "Total Revenue",
  "sql": "revenue",
  "aggregation": "sum",
  "dataType": "number",
  "level": "event"
}

And the organic_sessions segment as:

{
  "id": "organic_sessions",
  "name": "Organic Search Sessions",
  "sql": "marketing_channel = 'Organic Search'",
  "scope": "session"
}

The API reads these definitions, injects the SQL expressions into the appropriate layers of the query, and handles all the plumbing - wrapping segments in the right window functions based on their scope, choosing the correct aggregation function for each metric, generating deduplication flags when array fields are involved, and so on. The result is that adding a new metric or dimension to the platform is just a matter of saving a new definition to Firebase - no code changes, no deployments, no query rewrites. There's a lot more depth to how the metrics and dimensions layer works - we'll dedicate the next post to that.

The Cost Reality

Performance aside, the economics of query-time transformation strongly favor a dedicated engine. BigQuery charges $6.25 per TB scanned on its on-demand model - fine for batch analytics, but journey analytics is interactive. An analyst toggling dimensions and segments might run 20-30 queries in a single session, each scanning the full working set.

At a billion rows, a single 6-layer journey analytics query scans roughly 20 GB on BigQuery, costing about $0.13. That doesn't sound like much - until you multiply it across a team. Five analysts averaging 100 queries each per day would cost about $2,000/month in BigQuery query fees alone. ClickHouse Cloud charges for compute time with auto-idle instead of data scanned, so the cost scales with how long queries actually run rather than how much data they touch - which tends to be significantly cheaper for this kind of interactive workload.

The only tradeoff is if you're not already using ClickHouse - you'll need a data sync pipeline (Airbyte in our case) to get data out of the warehouse, which adds cost and complexity. For a single small set of data, this can make total costs comparable. But as you add more data and query volume scales, BigQuery's per-TB costs compound while ClickHouse's compute-based pricing stays relatively flat - making the economics increasingly lopsided.

The Complexity Is the Point

If you've read this far, you might be thinking: "this is a lot of moving parts for what amounts to a GROUP BY query." And you'd be right - it IS complex. But that complexity is precisely what journey analytics tools exist to abstract away.

The analyst using the UI doesn't see any of this. They drag a dimension onto a report, add a metric, apply a segment, and get results in a couple of seconds. Neither does the AI agent analyzing data on your behalf - it doesn't have to waste tokens and context generating all that layered SQL, because the API handles sessionization, segments, and deduplication automatically.

And that's also the honest tradeoff for anyone thinking about building something like this themselves: the individual concepts (sessionization, window functions, array joins) are well-documented and approachable. But composing them correctly into a single query that handles all edge cases - mixed granularity levels, multiple segments, array deduplication, lookback windows for sessions that span date boundaries - is where the real engineering effort lives.

I'll keep sharing how this evolves as we add more capabilities. Next up, we'll go deeper into the metrics and dimensions layer - how definitions are structured, how computed and window-function-based dimensions work, and how the semantic layer ties it all together. If you're interested in following along, connect with me on LinkedIn!

Trevor Paulsen is a data product professional at UKG and a former product leader of Adobe's Customer Journey Analytics. All views expressed are his own.