The Semantic Layer: How Metrics and Dimensions Turn Messy Events Into Clean Analytics

This is part of a series where we're building a DIY journey analytics platform from scratch. If you're just joining, check out the earlier posts to catch up!

In the last post, we went deep on the query-time architecture that powers our analytics engine - layers of nested SQL that handle sessionization, segments, array expansion, and aggregation. This time, we're going to cover the next important building block: dimensions and metrics.

Demo: Building metrics and dimensions in Trevorwithdata

What Is a Semantic Layer (and Why Do You Need One)?

If you've ever written a SQL query on top of digital event data, you've had to grapple with the gap between the raw data and the actual business answer. The data has fields like utm_source, utm_medium, referrer, and revenue - but there's no "Marketing Channel" field, and that revenue field might be inflated by duplicate page refreshes you didn't even know about. Raw fields need to be merged into business concepts, and sometimes cleaned up before they're trustworthy. That gap - translating raw fields into business-ready analytics - is what a semantic layer bridges.

Traditional BI semantic layers (e.g. LookML, dbt's MetricFlow, DAX, Tableau) are built around column definitions and aggregation rules - and that works great when each row is self-contained - but journey analytics data is more challenging. A marketing touch on event #1 drove the purchase on event #47, and a campaign parameter needs to carry forward to every downstream interaction. SELECT channel, SUM(revenue) GROUP BY channel won't work when the channel and the revenue live on completely different events and your analysis depends on critical sequences of activity, not self contained rows of data.

So for our solution, we'll define dimensions (categorical breakdowns like Marketing Channel) and metrics (aggregated measures like Revenue) with a simple row-level SQL expression (or let our AI assistant generate it from plain English), and we'll have the system handle the hard sequential stuff (carrying values forward, deduplicating, scoping to sessions or people) through configuration settings rather than code.

Let's dive in!

The Starting Point: Messy Event Data

Here's a realistic slice of what our unified events table looks like after identity stitching has resolved users across devices. These are two people browsing a website, each arriving from a different source and eventually making a purchase:

person_id	event_timestamp	event_type	utm_source	utm_medium	referrer	revenue	order_id
bob	10:30:00	page_view	google	cpc	google.com
bob	10:32:00	page_view
bob	10:35:00	page_view
bob	10:36:00	purchase				89.99	ORD-881
bob	10:36:01	purchase				89.99	ORD-881
jane	11:00:00	page_view			google.com
jane	11:05:00	page_view
jane	11:12:00	purchase				45.00	ORD-902

A few things to notice. Bob arrived via a paid Google ad (utm_medium = cpc), but only the first event has the UTM parameters - the rest of his session has no marketing data at all. Jane arrived from Google organically (referrer but no UTMs). And Bob's purchase event appears twice (same order_id, timestamps one second apart) - a page refresh during checkout.

Now imagine you want a simple report: Revenue by Marketing Channel. That involves three challenges hiding in this data:

There's no "Marketing Channel" field - it has to be derived from multiple raw fields
Marketing data only exists on the entry event, but revenue is on the purchase event
One purchase got counted twice

Let's solve each of these, one step at a time.

Step 1: Creating the Marketing Channel Dimension

The first problem is straightforward: we need to create a "Marketing Channel" concept from the raw UTM and referrer fields. In our system, this means creating a dimension with a SQL expression that defines the business logic:

CASE
  WHEN utm_medium IN ('cpc', 'ppc') THEN 'Paid Search'
  WHEN utm_source IN ('facebook', 'instagram', 'linkedin') THEN 'Paid Social'
  WHEN utm_medium = 'email' THEN 'Email'
  WHEN referrer LIKE '%google%' OR referrer LIKE '%bing%' THEN 'Organic Search'
  WHEN referrer != '' THEN 'Referral'
  ELSE 'Direct'
END

This is the only SQL the user (or AI assistant) needs to write. Notice that it's purely row-level logic - it looks at the fields on a single event and decides what to call it. The hard part - carrying this value forward to events that don't have marketing data, deduplicating, scoping to sessions - gets handled through simple configuration toggles, no SQL required. We'll cover those next.

The API layer stores this expression alongside a name, a data type, and which data groups it applies to:

{
  "id": "marketing_channel",
  "name": "Marketing Channel",
  "sql": "CASE WHEN utm_medium IN ('cpc', 'ppc') THEN 'Paid Search' ... END",
  "dataType": "string",
  "dataGroups": ["web_and_mobile"]
}

If you've used derived or calculated fields in analytics tools before, this will feel familiar - except instead of a visual builder with preset functions, you write any valid SQL expression. That means anything ClickHouse SQL supports is fair game: string functions, date math, conditional logic, JSON extraction, you name it.

For example, say your marketing team uses a custom URL parameter like ?cid=spring_email instead of a standard UTM field, and that value never made it into a dedicated column. No problem, you can extract it right in the dimension SQL:

CASE
  WHEN utm_medium IN ('cpc', 'ppc') THEN 'Paid Search'
  WHEN utm_source IN ('facebook', 'instagram', 'linkedin') THEN 'Paid Social'
  WHEN utm_medium = 'email'
    OR extractURLParameter(page_url, 'cid') LIKE '%email%' THEN 'Email'
  WHEN referrer LIKE '%google%' OR referrer LIKE '%bing%' THEN 'Organic Search'
  WHEN referrer != '' THEN 'Referral'
  ELSE 'Direct'
END

The extractURLParameter function pulls the value of cid directly from the raw page URL at query time - no ETL step needed. If the data is anywhere in the event row, you can get to it. This is one of the big advantages of using SQL for dimension definitions rather than a fixed menu of preset functions.

Ok, so we've got our Marketing Channel dimension defined. To illustrate where we're going next, if we were to take our example table from above and run a Revenue by Marketing Channel report as-is, we'd wind up with something like this:

Marketing Channel	Revenue
Paid Search	$0.00
Organic Search	$0.00
Direct	$224.98

Obviously not what we're looking for. All the revenue landed in "Direct" because the purchase events don't have any UTM or referrer data (look back at the raw data - those fields are empty on the purchase rows), so the CASE statement falls through to the ELSE clause. Since we know Bob came from a paid ad and Jane came from organic search, we'd ideally want their revenue attributed to those channels. And we definitely don't want Bob's revenue doubled up from that duplicate purchase event.

So the next two things we need to solve are dimension persistence (carrying marketing data forward to downstream events) and metric deduplication (making sure we don't double-count revenue).

Step 2: Dimension Persistence

This is where things get fun. Instead of trying to bake dimension persistence into the SQL expression (nobody wants to have to write complex windowing functions), we'll solve this at the configuration level with a feature called dimension persistence.

Persistence tells the system: "when this dimension has a value, carry it forward to subsequent events that don't." In practice, you're adding a small configuration to the dimension definition:

{
  "id": "marketing_channel",
  "name": "Marketing Channel",
  "sql": "CASE WHEN utm_medium IN ('cpc', 'ppc') THEN 'Paid Search' ... END",
  "dataType": "string",
  "persistence": {
    "allocation": "last",
    "expiration": "session"
  }
}

Two settings, and they're intuitive:

Allocation: "last" means we use the most recent non-empty value. If the person's first page view has a Marketing Channel of "Paid Search" and their next three events don't have a value, all three inherit "Paid Search." (You can also choose "first" for first-touch attribution, or "accumulate" to collect all values the person has seen - useful for multi-touch analysis.)
Expiration: "session" means the persisted value resets at the start of each new session. Bob's "Paid Search" carries through his current session but won't bleed into a session next week. (Other options include "person" for lifetime persistence, a custom number of days, or even a SQL expression like "expires when order_id IS NOT NULL" to reset after a purchase.)

To appreciate why this is better as a config setting than hand-written SQL, here's a simplified version of what the query engine generates under the hood for last-touch session persistence:

LAST_VALUE(marketing_channel) IGNORE NULLS OVER (
  PARTITION BY person_id, session_id
  ORDER BY event_timestamp
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as marketing_channel

This window function scans each person's session in chronological order, carrying forward the most recent non-null value of marketing_channel to every subsequent row. And remember, these person_id values are the stitched identities we built earlier - so persistence works correctly even when a person has been resolved across multiple devices and data sources.

Now imagine writing this by hand for every dimension that needs persistence, making sure it integrates correctly with sessionization, segments, and aggregation across the rest of the query layer cake, and keeping it consistent across every report. That's exactly the kind of complexity a semantic layer should absorb - and why we made it a simple config toggle instead.

Now if we were to pull that report again:

Marketing Channel	Revenue
Paid Search	$179.98
Organic Search	$45.00

Getting closer! Revenue is attributed to the channels that actually drove it. Bob's purchase correctly shows as "Paid Search" because the value carried forward from his entry event. Jane's purchase shows as "Organic Search" for the same reason. No more misleading "Direct" bucket.

You'll notice we still have a key problem though - Bob's revenue got doubled up because of that duplicate purchase event. $179.98 instead of $89.99. Let's talk about the second configuration toggle: metric deduplication.

Step 3: Metric Deduplication

Bob's two purchase events have the same order_id (ORD-881), timestamps one second apart - a classic page refresh during checkout. Both rows have revenue = 89.99, so when the system sums them up, we get double the actual revenue.

This happens all the time in real-world data. Page refreshes, client-side retry logic, pipeline retries, cross-device event duplication - there are a hundred ways the same event can show up more than once. And it quietly inflates your numbers in ways that are hard to prevent.

The fix is metric deduplication. When we create our Revenue metric, we add a dedup configuration:

{
  "id": "total_revenue",
  "name": "Total Revenue",
  "sql": "revenue",
  "aggregation": "sum",
  "deduplication": {
    "scope": "person",
    "dedupField": "order_id",
    "keepInstance": "first"
  }
}

This tells the system: "for each unique order_id per person, only count the first occurrence." Bob's two purchase events with ORD-881 get deduplicated down to one. Here's a simplified version of the SQL that gets generated:

-- Flag which occurrence to keep per person + order_id
ROW_NUMBER() OVER (
  PARTITION BY person_id, order_id
  ORDER BY event_timestamp ASC
) as _dedup_flag
 
-- Then in the final aggregation layer:
SUM(CASE WHEN _dedup_flag = 1 THEN revenue ELSE 0 END) as total_revenue

Straightforward enough in isolation - but in a real query, this needs to compose correctly with dimension persistence, session boundaries, segment filters, and array expansion across multiple metrics, each with their own dedup rules. It's the kind of thing that's manageable to write once and miserable to maintain by hand across dozens of reports. Much easier to let the system handle it.

You can also skip the dedup field entirely and just say "this metric should only fire once per person" (or once per session) - useful for metrics like "Purchasers" where you want to count whether someone purchased, not how many purchase events they generated.

The Final Report 🎉

Looking at the final report, we'd see something like this:

Marketing Channel	Revenue
Paid Search	$89.99
Organic Search	$45.00

That's the right answer. Clean, correctly attributed, deduplicated.

Let's step back and appreciate what just happened. We started with raw events where marketing data was scattered across multiple fields, only present on entry events, and contaminated with duplicate purchase records. We ended up with a trustworthy Revenue by Marketing Channel report. And the total configuration to get here was:

One SQL expression - the CASE WHEN that defines Marketing Channel
Two persistence settings - last-touch allocation, session expiration
Three dedup settings - person scope, order_id key, keep first

No window functions. No CTEs. No data engineering pipeline. No pre-computed tables that go stale. And because everything runs at query time through our layered query architecture, the user can change any of these settings and immediately see different results. Want to try first-touch attribution instead? Flip the allocation to "first" and re-run. Want to see what happens with person-level expiration? Change one setting. The data doesn't need to be reprocessed - the query just runs differently.

The Semantic Layer: Define Once, Use Everywhere

The Marketing Channel example highlights something deeper about how the semantic layer works. Once you've defined the Marketing Channel dimension with its persistence settings, it's available everywhere - in any report, any segment, any breakdown. You define it once, and it works consistently whether you're looking at Revenue by Marketing Channel, Sessions by Marketing Channel, or building a segment for "people who arrived via Paid Search."

Same with Total Revenue. The deduplication settings travel with the metric definition. Every report that uses Total Revenue automatically gets deduplicated - no one has to remember to add special logic each time.

This is fundamentally different from writing ad-hoc SQL. In a warehouse, every analyst who wants "Revenue by Marketing Channel" would need to write (or copy-paste) the CASE WHEN logic, figure out persistence with window functions, and handle deduplication - and hope they all did it consistently. With a semantic layer, that complexity gets encoded once and becomes an organizational standard.

It's also what makes journey analytics tools different from general-purpose BI platforms. Tools like Looker and Tableau have semantic layers too, but they're built for the SELECT ... GROUP BY world. They don't natively handle the challenges that make journey data uniquely tricky - values that appear on some events but not others, identities that span devices, events that duplicate across systems. Journey analytics semantic layers are purpose-built for exactly these problems.

Wrapping Up

If you're building your own analytics and find yourself writing the same window functions over and over to handle attribution and deduplication, a semantic layer like this can simplify things dramatically. And if you'd rather not build one from scratch - well, that's kind of the whole point of what we're building here 😄

We're continuing to add more capabilities to the metrics and dimensions layer (calculated metrics, date-based dimensions, and more), and I'll keep sharing the journey as it unfolds. If you want to see metrics and dimensions in action, check out the video demo above - I walk through the full flow of creating dimensions with persistence and metrics with deduplication in the Trevorwithdata UI.

If you want to try it out for yourself, head over to trevorwithdata.com to sign up for the waitlist (or just click the link at the bottom of this post). I'd love to get more people kicking the tires and sharing feedback!

Trevor Paulsen is a data product professional at UKG and a former product leader of Adobe's Customer Journey Analytics. All views expressed are his own.