Identity Stitching: How to Resolve Users Across Devices for Journey Analytics
By Trevor Paulsen
This is part of a series where we're building a DIY journey analytics platform from scratch. If you're just joining, check out the earlier posts to catch up!
Demo: Cross-device identity stitching in action
Of all the product capabilities we worked on during my time at Adobe, identity stitching was always the one that generated the most questions - and I can understand why. It's really complicated, and it's foundational to any journey analysis you do. So for this post, we're going to go through what it is, and how I went about creating a version of identity stitching for our DIY journey analytics tool using ClickHouse.
Let's start with a scenario that likely plays out in your data all the time: a person browses your website on their laptop (cookie_id: abc123) and eventually logs in with their email (trevor@email.com). Later, they open your mobile app on their phone (device_id: phone_xyz) and log in with that same email. The next week, they make a purchase in your physical store (CRM ID: CRM-4821). Then they call your support line about the purchase (phone number: 555-0142).
In this example, that one person had four different identifiers across four different systems. If you're trying to understand this person's journey, you need to know that all four of those interactions belong to the same person and map them back to a single person ID.
Without identity stitching, your analytics would see four separate "people." That's the fundamental problem identity stitching solves: taking a collection of event-level identifiers (cookies, device IDs, CRM IDs, phone numbers) and resolving them to a single person_id that ties the full journey together.
It sounds simple on the surface ("just map device IDs to people, right?") but it gets incredibly hard once you start dealing with real-world data at scale. When I was at Adobe, we regularly had companies with really sophisticated data engineering teams tell us they were going to build their own identity stitching - only to come back months later and ask us to do it for them.
So, I'm not going to sit here and tell you that solving this is easy (even with Claude Code), but the good news is that a system covering the fundamentals is very doable by leveraging some key capabilities of ClickHouse. Stitching, at its core, is really just two things:
- Applying existing identities to new events as they arrive. When a web event comes in with a cookie ID, we look up that cookie in an identity map (a lookup of one identity to another) and stamp the event with the best person-level identifier we have at that moment.
- Retroactively applying new identity information to historical events. When we learn something new (e.g. that cookie_abc actually belongs to
trevor@email.com), we need to go back and update all the historical events so they're attributed to the right person. This is called replay.
Fundamentally, that's it. Everything we'll cover in the rest of this post is just the engineering to make those two things work reliably and efficiently, and are done in a way that we can honor privacy and deletion requirements.
So, to dive in you'll want to be familiar with the last post, where we built the Data Group layer - combining multiple source datasets into one unified ClickHouse table. With that background, let's now build the identity layer on top of it.
The Identity Map: Bring Your Own
The first thing you need for identity stitching is an identity map - a dataset that maps a set of primary IDs (typically the device or cookie level identifiers you need to resolve) to "person" IDs. It might look something like this:
| primary_id | crm_id | updated_at | |
|---|---|---|---|
| abc123 | CRM-4821 | trevor@email.com | 2026-01-15 10:30 |
| def456 | CRM-7733 | jane@co.com | 2026-01-14 09:00 |
| ghi012 | NULL | john@email.com | 2026-01-15 12:00 |
Each row maps a primary ID to one or more known person-level identifiers (crm_id, email, etc.). In the case where we have multiple types of IDs that could be used to identify a person, we'll use cascading rules - CRM ID first, then email, then fall back to the raw primary ID. Lastly, the updated_at column is important - it's how we track when mappings change, which drives the replay mechanism we'll also cover below.
How do you generate this table? It depends on your company - some use dedicated identity resolution platforms, others build their own logic in dbt or Spark jobs. The simplest approach (in my opinion) is just a scheduled query that pulls the latest login events and incorporates new ID relationships into the map over time. That said, this is a whole topic unto itself, and maybe something we'll cover in a future blog post!
Regardless of how you build it, here's the key design decision: the identity map lives in the warehouse, not in the analytics system. Our approach is "bring your own identity map" - you construct it however you want, store it in BigQuery/Snowflake/Databricks alongside the rest of your data, and we consume it the same way we consume any other dataset.
Why this approach?
- The warehouse is the source of truth. If your identity resolution logic runs in dbt or a data pipeline, the warehouse already has the canonical output. Duplicating that logic in the analytics system would mean maintaining two sources of truth.
- Separation of concerns. Identity resolution is a complex domain with its own set of tradeoffs (deterministic vs. probabilistic, transitive closure, conflict resolution, graph collapse, edge pruning, etc.). Keeping it separate means the analytics system can focus on what it does well - query-time transformation and interactive analysis.
- Flexibility. Different customers have wildly different identity resolution needs. A B2C company with millions of anonymous visitors has different requirements than a B2B SaaS company where every user has a login. By accepting any identity map as input, we support all of these without opinionated logic baked into the platform.
The tradeoff is that you need to have (or build) an identity map. But in practice, most analytics teams don't have too much trouble creating this.
Part 1: Stitching New Events in Real Time
If you read the last post, you know that a Data Group combines multiple source datasets into a single unified ClickHouse table using materialized views. Identity maps plug into this architecture as a special dataset type. Here's where the Data Group architecture really pays off - remember the table engine?
ENGINE = ReplacingMergeTree(updated_at, is_deleted)
ORDER BY (person_id, toDate(event_timestamp), event_id)Both choices were designed with identity stitching in mind. ORDER BY starts with person_id so that once stitching resolves a person, all their events end up physically sorted together on disk - exactly what downstream queries need. And ReplacingMergeTree lets us soft-delete old versions of events and insert updated ones without expensive UPDATEs - the same mechanism that handles CDC updates from the warehouse handles identity replay too.
To solve the first half of the stitching problem, we plug directly into the materialized views that handle incoming data. Without stitching, the MV defaults person_id to the primary identifier. With stitching enabled, it JOINs against the identity map and cascades through resolution rules (e.g., crm_id first, then email, then fall back to the raw primary ID):
-- With identity stitching - ANY LEFT JOIN against the identity map
SELECT
...
coalesce(
-- Priority 1: Use CRM ID from identity map (strongest signal)
nullIf(im._im_crm_id, ''),
-- Priority 2: Use email from identity map
nullIf(im._im_email, ''),
-- Priority 3: Use the raw cookie ID from the event
nullIf(nullIf(cookie_id, 'null'), ''),
-- Last resort: Use event_id to prevent anonymous events from conflating
event_id
) AS person_id,
...
FROM web_events AS e
ANY LEFT JOIN (
SELECT
toString(primary_id) AS _im_key,
toString(crm_id) AS _im_crm_id,
toString(email) AS _im_email
FROM identity_map_table
) AS im ON e.cookie_id = im._im_keyThe ANY LEFT JOIN returns at most one match per key, while the LEFT ensures events without a match still pass through unstitched. The COALESCE chain checks each identity field in priority order - nullIf(..., '') converts empty strings to NULL so it can skip to the next option. If nothing matches, we fall back to the raw cookie ID, and as a last resort, the event ID.
Because this runs inside the materialized view, it executes automatically for every new event. No batch jobs, no scheduling - events get stitched in real time as they're ingested.
Part 2: Retroactive Stitching (Replay)
Everything above handles the forward-looking case. But identity maps aren't static. New mappings appear when users log in for the first time. Existing mappings merge when two previously separate identities turn out to be the same person. Mappings get corrected when data quality issues are fixed.
When a mapping changes, every historical event stitched using the old mapping is now outdated and needs to be re-stitched. This is the replay problem, and it's what makes identity stitching genuinely hard - you're retroactively rewriting the existing historical data.
The Replay Mechanism: Atomic Soft Delete + Re-Insert
Our replay takes advantage of ReplacingMergeTree with a three-step strategy: detect which identity mappings changed since the last replay cursor, find the affected events, and then - in a single atomic INSERT statement - both soft-delete the old versions and re-insert them with updated person IDs:
-- Single atomic INSERT: soft delete old versions AND re-insert with updated person_id
INSERT INTO unified_data_group
SELECT * FROM (
-- Part 1: Soft delete events with outdated person_id
SELECT * REPLACE(
now64(3) AS updated_at, -- New timestamp so ReplacingMergeTree picks this up
1 AS is_deleted -- Mark as deleted
)
FROM unified_data_group
WHERE device_id IN ('abc123', 'xyz789') -- Changed primary IDs
AND event_source IN ('web_events', 'mobile_events')
AND is_deleted = 0
UNION ALL
-- Part 2: Re-insert events with updated person_id from identity map
SELECT * REPLACE(
coalesce( -- Fresh person_id from updated identity map
nullIf(im._im_crm_id, ''),
nullIf(im._im_email, ''),
e.device_id
) AS person_id,
now64(3) AS inserted_at,
now64(3) AS updated_at,
0 AS is_deleted -- Active row
)
FROM unified_data_group AS e
ANY LEFT JOIN (
SELECT toString(primary_id) AS _im_key,
toString(crm_id) AS _im_crm_id,
toString(email) AS _im_email
FROM identity_map_table
) AS im ON e.device_id = im._im_key
WHERE e.device_id IN ('abc123', 'xyz789')
AND e.event_source IN ('web_events', 'mobile_events')
AND e.is_deleted = 0
)A few key aspects of this approach:
The operation is atomic. Both the soft delete and re-insert happen in a single INSERT. There's never a window where events are "missing" - queries during replay see either the old state or the new state, never a partial state.
ReplacingMergeTree handles cleanup. Soft-deleted rows get superseded by re-inserted rows during ClickHouse's background merges. The FINAL keyword in our query engine (more on that in a future post) ensures correct results even before cleanup runs.
It's entirely incremental. Replay is driven by a CDC cursor - the updated_at value of the last identity mapping we processed - so each cycle only touches events whose primary ID appears in the changed mappings. But it does process 100% of the device history for affected primary IDs - if a cookie was first seen 18 months ago and its mapping just changed today, that 18-month-old event gets re-stitched too.
The APIs: Built in a Weekend with Claude Code
One of the fun parts of this project is seeing how quickly AI coding tools can turn a design into working APIs. Claude and I built four endpoints to manage the full identity stitching lifecycle, all using Firebase Cloud Functions:
1. Configure cascading rules (updateIdentityMapStitching) - This is where you define the priority order for identity resolution on a given identity map. You pass in an array of cascading rules, each with an identityField and a priority number. For example:
{
"cascadingRules": [
{ "identityField": "crm_id", "priority": 1 },
{ "identityField": "email", "priority": 2 }
]
}The API validates that each field actually exists in the identity map's schema before saving - a nice guardrail so you don't end up with stitching rules that reference nonexistent fields.
2. Configure replay scheduling (updateDataGroupIdentityReplay) - Replay frequency is configured at the Data Group level, not per-dataset or per-identity-map. You specify an interval and the API creates a Cloud Scheduler job for that specific Data Group. Each scheduled run is smart about skipping unnecessary work - if a replay is already running, or if nothing changed since the last cursor, it exits early without doing anything.
3. Trigger replay on-demand (replayDataGroupIdentities) - Sometimes you don't want to wait for the next scheduled cycle. This endpoint kicks off a replay immediately for all identity maps in a Data Group, returning detailed results - how many identities were processed, how many events were rekeyed, and how long it took.
Data Governance and Deletion Flows
Quick caveat: I'm not a lawyer, and how your company interprets data deletion requirements under GDPR, CCPA, or other regulations is entirely up to you and your legal team. What we're aiming for here is a system that respects and falls in line with whatever decisions you make in your warehouse.
The core principle here is simple: all data governance changes happen in the warehouse, and we inherit them. Just like event data and identity maps flow from the warehouse into ClickHouse via CDC sync, so do deletions. You don't need to call special deletion APIs or manage data removal in two places - make the change in your warehouse, and it propagates automatically.
In practice, there are two scenarios:
1. Deleting a primary ID (e.g., removing a specific cookie or device). This is done by flagging the events and the identity mapping row as deleted in the warehouse. Then, both changes flow through the existing CDC pipeline into ClickHouse immediately, and ReplacingMergeTree physically removes them during background merges soon after.
2. Unstitching a person ID. This is where you need to break the link between a primary ID and a resolved person without deleting the events. To accomplish this, you'll need to null out the person-level identifiers on the identity map row in the warehouse. On the next replay cycle, the system re-stitches the affected events, and since there's no person ID to resolve to, they fall back to the raw primary ID.
The nice thing about this design is that there's no special deletion machinery. The same CDC sync that handles normal data updates handles deletions. The same ReplacingMergeTree engine that handles replay handles deletion propagation. One consistent pattern all the way through - your warehouse is the single source of truth, and ClickHouse just follows its lead. (Be aware this does require mirror mode syncing for both event and identity map datasets.)
Caveats and Scaling Considerations
I promised at the top that we'd cover the caveats, so let's be honest about what this system assumes and where it starts to strain:
- You need an identity map that updates incrementally. Our entire replay mechanism is built around detecting what changed since the last cursor. If your identity map does a full table drop-and-replace every time it runs, every row looks "new" and you'd have to replay your entire event history on every cycle. That just doesn't hold up as your data grows. The identity map needs to append new and changed rows incrementally, so the CDC pipeline can propagate just the deltas.
- Identity map datasets must use mirror mode (CDC) syncing. This is what preserves the
updated_attimestamps that make incremental replay possible. Full replace and append-only incremental sync modes won't work for identity maps. - For GDPR/CCPA deletion support, event datasets also need mirror mode. If you want deletions in your warehouse to propagate automatically into ClickHouse, those event datasets need to be synced with CDC and a designated deleted column. Without mirror mode, you'd need another mechanism to handle event-level deletions.
- At scale, the atomic delete/insert replay needs batching. The single-INSERT approach we showed works great when a replay cycle touches thousands or even tens of thousands of events. But if a single replay cycle affects millions of events (say, a major identity graph restructuring), you'll want to break it into batches to avoid overwhelming ClickHouse with one massive INSERT. I haven't needed this yet, but it's on the radar as the system grows.
- We don't handle shared devices. Our model assumes a primary ID maps to one person at a time. But in the real world, a family tablet or a shared kiosk might have one cookie that belongs to different people at different points in time. Handling that requires time-windowed identity resolution - knowing that cookie_abc was Person A before 3pm and Person B after - which is a meaningful step up in complexity. It's something I might tackle eventually, but it's not in the system today.
None of these are deal-breakers, but things we can address in the future if needed.
Wrapping Up
Identity stitching is one of those features that's invisible when it works well and painfully obvious when it doesn't. When your analytics tool shows a fractured journey across three "different" visitors, users can't trust the data. When it shows a single, coherent path from research to purchase, the insights practically write themselves 😎.
If you're thinking of building something similar, here are the key takeaways:
- Bring your own identity map - let your analytics system consume identity resolution rather than own it. Keep the warehouse as your source of truth.
- Design your storage for stitching from day one - our
ORDER BY (person_id, ...)and ReplacingMergeTree choices in the data groups layer were made specifically to enable efficient stitching and replay. - Make replay atomic - the soft delete + re-insert in a single INSERT means queries never see partial state, and ReplacingMergeTree handles cleanup automatically.
- Plan for deletion from day one - GDPR and CCPA aren't optional, and the same soft delete pattern that handles replay handles deletion too.
Trevor Paulsen is a data product professional at UKG and a former product leader of Adobe's Customer Journey Analytics. All views expressed are his own.