# Standard Metrics & Dimensions Catalog

Reference catalog of out-of-the-box metrics and dimensions that should be created for every customer
during initial onboarding, regardless of industry vertical. These all use guaranteed system fields —
no customer schema inspection needed.

## How to Use This Catalog

During onboarding or when setting up a new data group:
1. Read this file to know what standard metrics and dimensions to create
2. Create all items below — they use only system-generated fields that exist for every customer
3. For the Event Source dimension: call `listDatasets` to get dataset names, then build the CASE expression dynamically
4. After creating these, inspect the customer's schema for vertical-specific or data-specific
   metrics/dimensions (those are NOT covered here — see vertical catalogs when available)

**Important:** Always follow the metrics-creation-guide.md and dimensions-creation-guide.md rules.

---

## System Fields Reference

These fields are guaranteed to exist for every customer. They are computed automatically by the system.

| Field | Type | Description |
|-------|------|-------------|
| `person_id` | String | Resolved person identifier (from identity stitching) |
| `event_id` | String | Unique event identifier (guaranteed unique per row) |
| `event_timestamp` | DateTime64(3) | Timestamp of the event |
| `event_source` | String | Which dataset the event came from |
| `is_identified` | UInt8 | 1 if the person is identified, 0 if anonymous |
| `device_id` | String | Original device/cookie identifier before identity stitching. Useful for counting unique devices or understanding cross-device behavior. |
| `_session_id` | Nullable(String) | Session ID. NULL for anonymous events. |
| `_session_start_event` | UInt8 | 1 if this is the first event in its session, 0 otherwise. 0 for anonymous rows. |
| `_session_end_event` | UInt8 | 1 if this is the last event in its session, 0 otherwise. 0 for anonymous rows. |
| `_is_first_session` | UInt8 | 1 if this event is in the person's first-ever session, 0 otherwise. Requires `is_identified = 1`. |
| `_time_spent` | Nullable(Int64) | Seconds between this event and the next event in the same session. NULL for the last event in a session and for anonymous rows. |

---

## Standard Metrics

### 1. People
- **id:** `people`
- **name:** People
- **description:** Count of unique individuals
- **sql:** `person_id`
- **aggregation:** `count_distinct`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`

### 2. Sessions
- **id:** `sessions`
- **name:** Sessions
- **description:** Count of unique sessions
- **sql:** `_session_id`
- **aggregation:** `count_distinct`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`

### 3. Events
- **id:** `events`
- **name:** Events
- **description:** Count of events
- **sql:** `event_id`
- **aggregation:** `count_distinct`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`

### 4. Session Starts
- **id:** `session_starts`
- **name:** Session Starts
- **description:** Number of session entry points (first event in each session)
- **sql:** `_session_start_event`
- **aggregation:** `sum`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`

### 5. Session Ends
- **id:** `session_ends`
- **name:** Session Ends
- **description:** Number of session exit points (last event in each session)
- **sql:** `_session_end_event`
- **aggregation:** `sum`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`

### 6. New Sessions
- **id:** `new_sessions`
- **name:** New Sessions
- **description:** Count of unique sessions that are the person's first-ever session
- **sql:** `CASE WHEN _is_first_session = 1 THEN _session_id ELSE NULL END`
- **aggregation:** `count_distinct`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`

### 7. Return Sessions
- **id:** `return_sessions`
- **name:** Return Sessions
- **description:** Count of unique sessions from returning visitors (not their first session)
- **sql:** `CASE WHEN _is_first_session = 0 THEN _session_id ELSE NULL END`
- **aggregation:** `count_distinct`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`

### 8. Single Event Sessions
- **id:** `single_event_sessions`
- **name:** Single Event Sessions
- **description:** Count of sessions where only one event occurred (session start and end on the same event)
- **sql:** `CASE WHEN _session_start_event = 1 AND _session_end_event = 1 THEN _session_id ELSE NULL END`
- **aggregation:** `count_distinct`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`

### 9. Total Seconds Spent
- **id:** `total_seconds_spent`
- **name:** Total Seconds Spent
- **description:** Total time spent in seconds (time between each event and the next event in the same session). NULL for last event in session and anonymous rows.
- **sql:** `_time_spent`
- **aggregation:** `sum`
- **dataType:** `number`
- **format:** `{ type: "number", decimals: 0 }`

### 10. Devices
- **id:** `devices`
- **name:** Devices
- **description:** Count of unique devices (original device/cookie identifier before identity stitching)
- **sql:** `device_id`
- **aggregation:** `count_distinct`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`

---

## Schema-Dependent Metrics

These metrics depend on fields that may or may not exist in the customer's schema. The AI should
inspect the schema and sample data first, then create these when matching fields are detected.

### 1. Page Views
- **id:** `page_views`
- **name:** Page Views
- **description:** Total number of page view events
- **aggregation:** `sum`
- **dataType:** `integer`
- **format:** `{ type: "number", decimals: 0 }`
- **Detection:** Check for one of two patterns (in priority order):
  1. **Event type field with page view variant** — If the schema has an event type field (e.g. `event_type`, `event_name`, `action`, `type`) AND sample data shows a page view value (e.g. `page_view`, `pageview`, `Page View`, `pageView`), use:
     - **sql:** `CASE WHEN {event_type_field} = '{page_view_value}' THEN 1 ELSE 0 END`
  2. **Page name/title field presence** — If no event type field exists but there's a page name or title field (e.g. `page_name`, `page_title`, `title`, `pageName`), count rows where it's non-null:
     - **sql:** `CASE WHEN {page_field} IS NOT NULL THEN 1 ELSE 0 END`

---

## Standard Dimensions

### 1. Day
- **id:** `day`
- **name:** Day
- **description:** Date of the event (YYYY-MM-DD)
- **sql:** `toDate(event_timestamp)`
- **dataType:** `date`

### 2. Day of Week
- **id:** `day_of_week`
- **name:** Day of Week
- **description:** Day of the week the event occurred (e.g. Monday, Tuesday)
- **sql:** `dateName('weekday', event_timestamp)`
- **dataType:** `string`

### 3. Event Depth
- **id:** `event_depth`
- **name:** Event Depth
- **description:** Sequential position of the event within its session (1 = session start, 2 = second event, etc.)
- **sql:** `PER_SESSION(ROW_NUMBER())`
- **dataType:** `integer`

### 4. Hour
- **id:** `hour`
- **name:** Hour
- **description:** Date and hour of the event (YYYY-MM-DD HH:00)
- **sql:** `formatDateTime(event_timestamp, '%Y-%m-%d %H:00')`
- **dataType:** `string`

### 5. Minute
- **id:** `minute`
- **name:** Minute
- **description:** Date, hour, and minute of the event (YYYY-MM-DD HH:MM)
- **sql:** `formatDateTime(event_timestamp, '%Y-%m-%d %H:%i')`
- **dataType:** `string`

### 6. Month
- **id:** `month`
- **name:** Month
- **description:** Month name of the event (e.g. January, February)
- **sql:** `dateName('month', event_timestamp)`
- **dataType:** `string`

### 7. Week
- **id:** `week`
- **name:** Week
- **description:** Week of the event, starting on Sunday (e.g. "Week of 2024-01-07")
- **sql:** `concat('Week of ', formatDateTime(toStartOfWeek(event_timestamp, 0), '%Y-%m-%d'))`
- **dataType:** `string`

### 8. Year
- **id:** `year`
- **name:** Year
- **description:** Year of the event
- **sql:** `toYear(event_timestamp)`
- **dataType:** `integer`

### 9. New vs. Repeat Session
- **id:** `new_vs_repeat_session`
- **name:** New vs. Repeat Session
- **description:** Whether the event's session is the person's first-ever session (New Session) or a return visit (Repeat Session)
- **sql:** `CASE WHEN _is_first_session = 1 THEN 'New Session' ELSE 'Repeat Session' END`
- **dataType:** `string`

### 10. Event Source
- **id:** `event_source_{dataGroupId}` (data-group-specific because the dataset ID → name mapping differs per data group)
- **name:** Event Source
- **description:** Friendly name of the dataset the event came from
- **sql:** *(dynamically built by the API)* — The API builds a CASE expression mapping dataset IDs to their friendly names.
- **dataType:** `string`

### 11. Time Spent per Session
- **id:** `time_spent_per_session`
- **name:** Time Spent per Session
- **description:** Time spent in the session, bucketed into human-readable ranges
- **sql:**
  ```
  CASE
    WHEN PER_SESSION(SUM(_time_spent)) IS NULL THEN NULL
    WHEN PER_SESSION(SUM(_time_spent)) < 10 THEN '01 - Under 10 seconds'
    WHEN PER_SESSION(SUM(_time_spent)) < 60 THEN '02 - 10 seconds - 1 minute'
    WHEN PER_SESSION(SUM(_time_spent)) < 180 THEN '03 - 1 - 3 minutes'
    WHEN PER_SESSION(SUM(_time_spent)) < 600 THEN '04 - 3 - 10 minutes'
    WHEN PER_SESSION(SUM(_time_spent)) < 1800 THEN '05 - 10 - 30 minutes'
    WHEN PER_SESSION(SUM(_time_spent)) < 3600 THEN '06 - 30 minutes - 1 hour'
    ELSE '07 - 1+ hours'
  END
  ```
- **dataType:** `string`

### 12. Time Spent per Person
- **id:** `time_spent_per_person`
- **name:** Time Spent per Person
- **description:** Total cumulative time the person has spent across all sessions, bucketed into ranges
- **sql:**
  ```
  CASE
    WHEN PER_PERSON(SUM(_time_spent)) IS NULL THEN NULL
    WHEN PER_PERSON(SUM(_time_spent)) < 300 THEN '01 - Under 5 minutes'
    WHEN PER_PERSON(SUM(_time_spent)) < 900 THEN '02 - 5 - 15 minutes'
    WHEN PER_PERSON(SUM(_time_spent)) < 1800 THEN '03 - 15 - 30 minutes'
    WHEN PER_PERSON(SUM(_time_spent)) < 3600 THEN '04 - 30 minutes - 1 hour'
    WHEN PER_PERSON(SUM(_time_spent)) < 7200 THEN '05 - 1 - 2 hours'
    WHEN PER_PERSON(SUM(_time_spent)) < 18000 THEN '06 - 2 - 5 hours'
    WHEN PER_PERSON(SUM(_time_spent)) < 36000 THEN '07 - 5 - 10 hours'
    WHEN PER_PERSON(SUM(_time_spent)) < 72000 THEN '08 - 10 - 20 hours'
    WHEN PER_PERSON(SUM(_time_spent)) < 180000 THEN '09 - 20 - 50 hours'
    WHEN PER_PERSON(SUM(_time_spent)) < 360000 THEN '10 - 50 - 100 hours'
    ELSE '11 - 100+ hours'
  END
  ```
- **dataType:** `string`

---

## Schema-Dependent Dimensions

These dimensions depend on fields that may or may not exist in the customer's schema. The AI should
inspect the schema and sample data first, then create these when matching fields are detected.
For dimensions marked **"Requires user input"**, the AI should ask the customer for clarification
before creating.

### 1. Page Name
- **id:** `page_name`
- **name:** Page Name
- **description:** The name or title of the page where the event occurred
- **dataType:** `string`
- **Detection:** Look for fields named similar to `page_name`, `page_title`, `title`, `pageName`, `pageTitle`, `document_title`. May be a JSON subfield — CAST if needed.
- **sql:** `{page_name_field}` (use the detected field directly, or CAST if JSON)

### 2. Page URL
- **id:** `page_url`
- **name:** Page URL
- **description:** The full URL of the page where the event occurred
- **dataType:** `string`
- **Detection:** Look for fields named similar to `page_url`, `url`, `page_path`, `pageUrl`, `page_location`, `current_url`, `web_url`. May be a JSON subfield — CAST if needed.
- **sql:** `{page_url_field}`

### 3. Browser Type
- **id:** `browser_type`
- **name:** Browser Type
- **description:** Web browser name without version (e.g. Chrome, Firefox, Safari)
- **dataType:** `string`
- **Detection:** Look for fields named similar to `browser_type`, `browser_name`, `browserName`, `browser`, `user_agent_browser`, `device.browser`. May be JSON — CAST if needed. Sample the data to understand what the field contains:
  - If there's a dedicated browser type/name field (no version info) — use it directly
  - If the field contains version info (e.g. "Chrome 120") — truncate to just the name
  - If the only available field is the full browser with version — derive Browser Type by stripping the version
- **sql:** Use the field directly if it already contains just the name. Otherwise sample the data and strip version info as needed.

### 4. Browser
- **id:** `browser`
- **name:** Browser
- **description:** Web browser with version (e.g. Chrome 120, Safari 17.2)
- **dataType:** `string`
- **Detection:** Look for fields named similar to `browser`, `browser_name`, `browserName`, `user_agent_browser`, `device.browser`, and separately for version fields like `browser_version`, `browserVersion`. May be JSON — CAST if needed. Sample the data to understand what's available:
  - If a single field already contains browser + version (e.g. "Chrome 120") — use it directly
  - If browser name and version are in separate fields — concatenate them: `concat({name_field}, ' ', {version_field})`
  - If only a browser name field exists with no version available — this dimension may not be creatable (just use Browser Type instead)
- **sql:** Depends on what fields are available — see detection notes above.

### 5. Tracking Code
- **id:** `tracking_code`
- **name:** Tracking Code
- **description:** Marketing campaign or tracking code associated with the event
- **dataType:** `string`
- **Requires user input:** Ask the customer:
  - "Do you have a campaign or tracking code field in your events? What is it called?"
  - If no dedicated field exists but there's a page URL, offer to extract from URL query params: `extractURLParameter({page_url_field}, 'utm_campaign')` or whatever query param they use.
- **Detection:** Look for fields named similar to `tracking_code`, `campaign`, `utm_campaign`, `campaign_name`, `campaign_code`, `trackingCode`, `marketing.campaign`, `marketing.utm_campaign`, `trafficSource.campaign`. May be JSON — CAST if needed.
- **sql:** `{tracking_code_field}` or `COALESCE({tracking_code_field}, NULL)` — or extracted from URL if needed.

### 6. City
- **id:** `city`
- **name:** City
- **description:** City where the event originated (from geo data)
- **dataType:** `string`
- **Detection:** Look for fields named similar to `city`, `geo_city`, `geo.city`, `location_city`, `geoCity`. Often a JSON subfield — CAST if needed.
- **sql:** `{city_field}` (CAST if JSON)

### 7. Country
- **id:** `country`
- **name:** Country
- **description:** Country where the event originated (from geo data)
- **dataType:** `string`
- **Detection:** Look for fields named similar to `country`, `geo_country`, `geo.country`, `location_country`, `geoCountry`. Often a JSON subfield — CAST if needed.
- **sql:** `{country_field}` (CAST if JSON)

### 8. Page Domain
- **id:** `page_domain`
- **name:** Page Domain
- **description:** Domain extracted from the page URL (e.g. "www.example.com")
- **dataType:** `string`
- **Detection:** First look for a field similar to (`domain`, `hostname`, `page_domain`). If none exists but a page URL field is available, extract the domain from it.
- **sql:** If dedicated field exists, use it directly. Otherwise extract from URL:
  `domainWithoutWWW({page_url_field})` — ClickHouse's built-in URL function.

### 9. Marketing Channel
- **id:** `marketing_channel`
- **name:** Marketing Channel
- **description:** High-level marketing channel classification (e.g. Paid Search, Organic Social, Email)
- **dataType:** `string`
- **Requires user input:** This dimension requires a dedicated conversation with the customer. The AI should:
  1. Ask what marketing channels they use (paid search, organic search, social, email, display, affiliates, etc.)
  2. Ask how each channel is identified — UTM parameters, referrer patterns, custom fields, etc.
  3. Propose a CASE expression and get confirmation before creating
- **Common OOTB channels to suggest:**
  - **Paid Search** — typically `utm_medium = 'cpc'` or `'paid search'`, or referrer from Google/Bing with ad params
  - **Organic Search** — referrer from known search engines (google.com, bing.com, yahoo.com, duckduckgo.com, baidu.com, yandex.com) without paid indicators
  - **AI Answer Engines** — referrer from ChatGPT (chatgpt.com, chat.openai.com), Perplexity (perplexity.ai), Claude (claude.ai), Gemini (gemini.google.com), Copilot (copilot.microsoft.com)
  - **Paid Social** — `utm_medium = 'paid social'` or `'paidsocial'`, or referrer from social platforms with ad params
  - **Organic Social** — referrer from social platforms (facebook.com, instagram.com, twitter.com, x.com, linkedin.com, tiktok.com, reddit.com, youtube.com, pinterest.com) without paid indicators
  - **Email** — `utm_medium = 'email'`, or referrer containing 'mail' domains
  - **Display** — `utm_medium = 'display'` or `'banner'` or `'cpm'`
  - **Affiliates** — `utm_medium = 'affiliate'`, or customer-specific affiliate referrer patterns
  - **Other Referring Domains** — has a referrer but doesn't match any of the above
  - **Direct / None** — no referrer and no UTM params (use NULL per system conventions)
- **sql:** *(dynamically built)* — Build a CASE expression based on the customer's answers. Example structure:
  ```
  CASE
    WHEN {utm_medium_field} = 'cpc' THEN 'Paid Search'
    WHEN domainWithoutWWW({referrer_field}) IN ('www.google.com', 'www.bing.com', ...) THEN 'Organic Search'
    WHEN domainWithoutWWW({referrer_field}) IN ('chatgpt.com', 'perplexity.ai', ...) THEN 'AI Answer Engines'
    WHEN domainWithoutWWW({referrer_field}) IN ('www.facebook.com', 'www.instagram.com', ...) THEN 'Organic Social'
    WHEN {utm_medium_field} = 'email' THEN 'Email'
    WHEN {referrer_field} IS NOT NULL THEN 'Other Referring Domains'
    ELSE NULL
  END
  ```

### 10. Operating System
- **id:** `operating_system`
- **name:** Operating System
- **description:** Operating system of the user's device
- **dataType:** `string`
- **Detection:** Look for fields named similar to `os`, `operating_system`, `os_name`, `osName`, `device.os`, `platform_os`. May be JSON — CAST if needed.
- **sql:** `{os_field}` (CAST if JSON)

### 11. Referrer Type
- **id:** `referrer_type`
- **name:** Referrer Type
- **description:** Category of the referring source (e.g. Search Engine, Social Network, AI Answer Engine, Other Site)
- **dataType:** `string`
- **Requires user input:** Ask the customer if they want this and if they have referrer categories beyond the defaults.
- **Detection:** Requires a referrer field — look for fields similar to `referrer`, `referer`, `referring_url`, `referrer_url`, `page_referrer`.
- **Common categories:**
  - **Search Engines** — google.com, bing.com, yahoo.com, duckduckgo.com, baidu.com, yandex.com, ecosia.org
  - **AI Answer Engines** — chatgpt.com, chat.openai.com, perplexity.ai, claude.ai, gemini.google.com, copilot.microsoft.com
  - **Social Networks** — facebook.com, instagram.com, twitter.com, x.com, linkedin.com, tiktok.com, reddit.com, youtube.com, pinterest.com, threads.net
  - **Email** — domains containing 'mail' (gmail, outlook, yahoo mail, etc.)
  - **Other Sites** — has referrer but doesn't match above categories
  - **Direct** — no referrer (use NULL)
- **sql:** *(dynamically built)* — CASE expression on `domainWithoutWWW({referrer_field})` matching against known domain lists.

### 12. Search Engine
- **id:** `search_engine`
- **name:** Search Engine
- **description:** The search engine or AI answer engine that referred the user
- **dataType:** `string`
- **Detection:** Requires a referrer field (same detection as Referrer Type). Extract and classify the referrer domain.
- **sql:** *(dynamically built)* — Example:
  ```
  CASE
    WHEN domainWithoutWWW({referrer_field}) LIKE '%google.%' THEN 'Google'
    WHEN domainWithoutWWW({referrer_field}) LIKE '%bing.%' THEN 'Bing'
    WHEN domainWithoutWWW({referrer_field}) LIKE '%yahoo.%' THEN 'Yahoo'
    WHEN domainWithoutWWW({referrer_field}) LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
    WHEN domainWithoutWWW({referrer_field}) LIKE '%baidu.%' THEN 'Baidu'
    WHEN domainWithoutWWW({referrer_field}) LIKE '%yandex.%' THEN 'Yandex'
    WHEN domainWithoutWWW({referrer_field}) IN ('chatgpt.com', 'chat.openai.com') THEN 'ChatGPT'
    WHEN domainWithoutWWW({referrer_field}) = 'perplexity.ai' THEN 'Perplexity'
    WHEN domainWithoutWWW({referrer_field}) = 'claude.ai' THEN 'Claude'
    WHEN domainWithoutWWW({referrer_field}) LIKE '%gemini.google.%' THEN 'Gemini'
    WHEN domainWithoutWWW({referrer_field}) LIKE '%copilot.microsoft.%' THEN 'Copilot'
    ELSE NULL
  END
  ```

### 13. Search Keywords
- **id:** `search_keywords`
- **name:** Search Keywords
- **description:** Search terms used to find the site (if available)
- **dataType:** `string`
- **Requires user input:** Ask the customer about this one. Most organic search keywords are no longer available since Google moved to HTTPS in 2011 (shows as "(not provided)"). However:
  - Some customers may have keyword data from **paid search** (Google Ads, Bing Ads) passed via URL params
  - Some may have **internal site search** keywords in their data
  - AI answer engines and some smaller search engines may still pass keywords in the referrer URL
  - The AI should do a web search to check the latest status on keyword availability if the customer asks
- **Detection:** Look for fields similar to (`search_keyword`, `keyword`, `search_term`, `query`). If none exist, check if keywords can be extracted from referrer URL params (`extractURLParameter({referrer_field}, 'q')` for Google, `'p'` for Yahoo, etc.) — but note most will be empty.
- **sql:** Depends on what's available — use dedicated field if it exists, otherwise extract from URL params.

### 14. Site Section
- **id:** `site_section`
- **name:** Site Section
- **description:** Top-level section of the site derived from the URL path (e.g. "products", "blog", "support")
- **dataType:** `string`
- **Detection:** First look for a field similar to (`site_section`, `content_group`, `page_category`, `section`). If none exists but a page URL field is available, derive from the first path segment of the URL.
- **sql:** If dedicated field exists, use it directly. Otherwise extract from URL path:
  ```
  CASE
    WHEN extractURLPath({page_url_field}) IN ('/', '') THEN 'Home'
    ELSE splitByChar('/', extractURLPath({page_url_field}))[2]
  END
  ```
  The `[2]` index is because the split produces an empty string at index 1 from the leading `/`.
