# Dimensions Creation Guide Read this guide BEFORE creating any dimension. Follow every step — skipping steps leads to broken dimensions. --- ## What Is a Dimension? A dimension is a categorical grouping used in GROUP BY clauses. It breaks down metrics by categories like date, country, event type, etc. - **`sql`** — A row-level expression that produces a categorical value - **No aggregation** — Dimensions NEVER contain SUM(), COUNT(), AVG(), or any aggregation function - **Exception**: Aggregates inside `PER_SESSION()` / `PER_PERSON()` scope functions ARE allowed — these expand into window functions, not raw aggregations - Dimensions are used in the GROUP BY clause of the outer query --- ## Step-by-Step Workflow ### Step 1: Inspect the Schema **Always call `inspectSchema` first.** Never guess field names. ``` operation: "inspectSchema" dataGroupId: "" ``` This returns every field with its name, type, and whether it's a JSON field or array field. ### Step 2: Sample the Data (When Needed) If you need to see actual values (to understand what a field contains): ``` operation: "sampleData" dataGroupId: "" fields: ["event_type", "geo.country"] ``` Returns top 20 most frequent values per field. Useful for understanding cardinality and value patterns. ### Step 3: Write the SQL Expression The `sql` field is a **row-level expression**. It gets placed inside a SELECT in the inner computation query, then used in GROUP BY in the outer query. #### Simple field reference ``` sql: "event_type" ``` #### JSON / nested fields — just use the field name directly The system auto-casts JSON fields based on the `dataType` parameter. Do NOT manually CAST. ``` sql: "geo.city" dataType: "string" ``` If the schema shows a dotted path like `geo.city` or `marketing.utm_source`, just use it as-is. The system handles all necessary casting internally. **Edge case — mixed types in one expression:** If your expression uses JSON fields of DIFFERENT types, explicitly CAST the one that differs from the overall `dataType`: ``` sql: "IF(CAST(geo.country_code AS String) = 'US', geo.state, geo.city)" dataType: "string" ``` **Never use `JSONExtractString()` or `JSONExtractFloat()` — those don't work with ClickHouse's JSON type.** #### Date extraction ``` sql: "toDate(event_timestamp)" -- Date only (2024-01-15) sql: "toHour(event_timestamp)" -- Hour of day (0-23) sql: "toDayOfWeek(event_timestamp)" -- Day of week (1=Monday, 7=Sunday) sql: "toMonth(event_timestamp)" -- Month number (1-12) sql: "formatDateTime(event_timestamp, '%Y-%m')" -- Year-month (2024-01) ``` #### CASE expressions (bucketing/categorization) ``` sql: "CASE WHEN order_value < 50 THEN 'Low' WHEN order_value < 200 THEN 'Medium' ELSE 'High' END" ``` #### NULL handling ``` sql: "COALESCE(marketing.utm_source, NULL)" ``` **IMPORTANT:** Never use placeholder strings like 'Unknown', 'Direct', 'None', '-', or 'Other' as default values. Always use NULL — the system automatically groups NULL values into a display bucket (shown as '-' in the UI). #### Scope functions (session/person level) These are the **one exception** to the "no aggregation" rule — aggregates inside scope functions are OK because they expand into window functions, not raw aggregations. ``` sql: "PER_SESSION(SUM(_time_spent))" -- Session duration sql: "PER_SESSION(ROW_NUMBER())" -- Event position in session sql: "PER_SESSION(FIRST_VALUE(page_url))" -- Entry page sql: "PER_SESSION(LAST_VALUE(page_url))" -- Exit page sql: "PER_PERSON(COUNT(DISTINCT _session_id))" -- Sessions per person ``` ### Step 4: Choose the Data Type | dataType | When to Use | Examples | |----------|-------------|---------| | `string` | Categorical text values (default) | City, event type, campaign name | | `number` | Numeric grouping values | Latitude, score | | `integer` | Whole number grouping values | Hour of day, day of week | | `date` | Date values | `toDate(event_timestamp)` | | `datetime` | Timestamp values | Full timestamp grouping | | `boolean` | True/false grouping | Is mobile, is returning user | ### Step 5: Create the Dimension ``` operation: "create" id: "geo_city" name: "City" description: "City from geo data" sql: "geo.city" dataType: "string" dataGroups: ["dg_abc123"] ``` --- ## Array Dimensions If the data has array fields (like `items[]`, `products[]`), you need two additional fields: - **`level`**: Set to `"array"` - **`arrayPath`**: The array column name (e.g., `"products"`) The system applies `LEFT ARRAY JOIN` to expand each array element into its own row. Use the original array name to reference sub-fields. ``` operation: "create" id: "product_name" name: "Product Name" sql: "items.name" dataType: "string" level: "array" arrayPath: "items" dataGroups: ["dg_abc123"] ``` --- ## ID Naming Rules - Must start with a lowercase letter - Only lowercase letters, numbers, and underscores - Pattern: `^[a-z][a-z0-9_]*$` - Good: `geo_city`, `event_date`, `product_category` - Bad: `GeoCity`, `123_dim`, `geo-city` --- ## Scope Functions Reference Use `PER_SESSION()` and `PER_PERSON()` to compute values scoped to a session or person. These replace raw window function syntax. | Pattern | Example | What it computes | |---------|---------|-----------------| | `PER_SESSION(SUM(field))` | `PER_SESSION(SUM(_time_spent))` | Total per session | | `PER_SESSION(ROW_NUMBER())` | `PER_SESSION(ROW_NUMBER())` | Position in session (defaults to ORDER BY event_timestamp) | | `PER_SESSION(FIRST_VALUE(field))` | `PER_SESSION(FIRST_VALUE(page_url))` | First value in session | | `PER_SESSION(LAST_VALUE(field))` | `PER_SESSION(LAST_VALUE(page_url))` | Last value in session | | `PER_PERSON(COUNT(DISTINCT field))` | `PER_PERSON(COUNT(DISTINCT _session_id))` | Distinct count per person | **Default ORDER BY:** ROW_NUMBER, FIRST_VALUE, LAST_VALUE, RANK, LAG, LEAD automatically default to `ORDER BY event_timestamp ASC`. Only specify ORDER BY to override (e.g., `PER_SESSION(ROW_NUMBER() ORDER BY revenue DESC)`). **Important**: Never write raw `OVER (PARTITION BY ...)` clauses. Always use scope functions instead. --- ## Critical Rules 1. **NEVER put aggregation functions in `sql`** — Exception: aggregates inside PER_SESSION()/PER_PERSON() are OK because they become window functions. 2. **ALWAYS inspect schema first** — Never guess field names. 3. **JSON fields need no manual CAST** — Just use the dotted field name directly (e.g., `geo.city`). The system auto-casts based on `dataType`. 4. **Flattened fields need no CAST** — If schema shows String/Float64/Int64 type, use directly. 5. **Array dimensions need `level` and `arrayPath`** — Both are required for sub-event data. 6. **Use PER_SESSION() / PER_PERSON() for scoped computations** — Never write raw OVER/PARTITION BY clauses. 7. **Use NULL for missing values** — NEVER use placeholder strings like 'Unknown', 'Direct', 'None', or '-'. The system automatically groups NULLs into a display bucket. 8. **Use persistence settings for carrying values forward** — Don't write LAST_VALUE/FIRST_VALUE IGNORE NULLS manually. Use the `persistence` field instead (`{ allocation: "last", expiration: "session" }`).