# Dimensions Creation Guide

Read this guide BEFORE creating any dimension. Follow every step — skipping steps leads to broken dimensions.

---

## What Is a Dimension?

A dimension is a categorical grouping used in GROUP BY clauses. It breaks down metrics by categories like date, country, event type, etc.

- **`sql`** — A row-level expression that produces a categorical value
- **No aggregation** — Dimensions NEVER contain SUM(), COUNT(), AVG(), or any aggregation function
- **Exception**: Aggregates inside `PER_SESSION()` / `PER_PERSON()` scope functions ARE allowed — these expand into window functions, not raw aggregations
- Dimensions are used in the GROUP BY clause of the outer query

---

## Step-by-Step Workflow

### Step 1: Inspect the Schema

**Always call `inspectSchema` first.** Never guess field names.

```
operation: "inspectSchema"
dataGroupId: "<the data group>"
```

This returns every field with its name, type, and whether it's a JSON field or array field.

### Step 2: Sample the Data (When Needed)

If you need to see actual values (to understand what a field contains):

```
operation: "sampleData"
dataGroupId: "<the data group>"
fields: ["event_type", "geo.country"]
```

Returns top 20 most frequent values per field. Useful for understanding cardinality and value patterns.

### Step 3: Write the SQL Expression

The `sql` field is a **row-level expression**. It gets placed inside a SELECT in the inner computation query, then used in GROUP BY in the outer query.

#### Simple field reference
```
sql: "event_type"
```

#### JSON / nested fields — just use the field name directly
The system auto-casts JSON fields based on the `dataType` parameter. Do NOT manually CAST.
```
sql: "geo.city"
dataType: "string"
```

If the schema shows a dotted path like `geo.city` or `marketing.utm_source`, just use it as-is. The system handles all necessary casting internally.

**Edge case — mixed types in one expression:** If your expression uses JSON fields of DIFFERENT types, explicitly CAST the one that differs from the overall `dataType`:
```
sql: "IF(CAST(geo.country_code AS String) = 'US', geo.state, geo.city)"
dataType: "string"
```

**Never use `JSONExtractString()` or `JSONExtractFloat()` — those don't work with ClickHouse's JSON type.**

#### Date extraction
```
sql: "toDate(event_timestamp)"           -- Date only (2024-01-15)
sql: "toHour(event_timestamp)"           -- Hour of day (0-23)
sql: "toDayOfWeek(event_timestamp)"      -- Day of week (1=Monday, 7=Sunday)
sql: "toMonth(event_timestamp)"          -- Month number (1-12)
sql: "formatDateTime(event_timestamp, '%Y-%m')"  -- Year-month (2024-01)
```

#### CASE expressions (bucketing/categorization)
```
sql: "CASE WHEN order_value < 50 THEN 'Low' WHEN order_value < 200 THEN 'Medium' ELSE 'High' END"
```

#### NULL handling
```
sql: "COALESCE(marketing.utm_source, NULL)"
```
**IMPORTANT:** Never use placeholder strings like 'Unknown', 'Direct', 'None', '-', or 'Other' as default values. Always use NULL — the system automatically groups NULL values into a display bucket (shown as '-' in the UI).

#### Scope functions (session/person level)

These are the **one exception** to the "no aggregation" rule — aggregates inside scope functions are OK because they expand into window functions, not raw aggregations.

```
sql: "PER_SESSION(SUM(_time_spent))"       -- Session duration
sql: "PER_SESSION(ROW_NUMBER())"           -- Event position in session
sql: "PER_SESSION(FIRST_VALUE(page_url))"  -- Entry page
sql: "PER_SESSION(LAST_VALUE(page_url))"   -- Exit page
sql: "PER_PERSON(COUNT(DISTINCT _session_id))"  -- Sessions per person
```

### Step 4: Choose the Data Type

| dataType | When to Use | Examples |
|----------|-------------|---------|
| `string` | Categorical text values (default) | City, event type, campaign name |
| `number` | Numeric grouping values | Latitude, score |
| `integer` | Whole number grouping values | Hour of day, day of week |
| `date` | Date values | `toDate(event_timestamp)` |
| `datetime` | Timestamp values | Full timestamp grouping |
| `boolean` | True/false grouping | Is mobile, is returning user |

### Step 5: Create the Dimension

```
operation: "create"
id: "geo_city"
name: "City"
description: "City from geo data"
sql: "geo.city"
dataType: "string"
dataGroups: ["dg_abc123"]
```

---

## Array Dimensions

If the data has array fields (like `items[]`, `products[]`), you need two additional fields:

- **`level`**: Set to `"array"`
- **`arrayPath`**: The array column name (e.g., `"products"`)

The system applies `LEFT ARRAY JOIN` to expand each array element into its own row. Use the original array name to reference sub-fields.

```
operation: "create"
id: "product_name"
name: "Product Name"
sql: "items.name"
dataType: "string"
level: "array"
arrayPath: "items"
dataGroups: ["dg_abc123"]
```

---

## ID Naming Rules

- Must start with a lowercase letter
- Only lowercase letters, numbers, and underscores
- Pattern: `^[a-z][a-z0-9_]*$`
- Good: `geo_city`, `event_date`, `product_category`
- Bad: `GeoCity`, `123_dim`, `geo-city`

---

## Scope Functions Reference

Use `PER_SESSION()` and `PER_PERSON()` to compute values scoped to a session or person. These replace raw window function syntax.

| Pattern | Example | What it computes |
|---------|---------|-----------------|
| `PER_SESSION(SUM(field))` | `PER_SESSION(SUM(_time_spent))` | Total per session |
| `PER_SESSION(ROW_NUMBER())` | `PER_SESSION(ROW_NUMBER())` | Position in session (defaults to ORDER BY event_timestamp) |
| `PER_SESSION(FIRST_VALUE(field))` | `PER_SESSION(FIRST_VALUE(page_url))` | First value in session |
| `PER_SESSION(LAST_VALUE(field))` | `PER_SESSION(LAST_VALUE(page_url))` | Last value in session |
| `PER_PERSON(COUNT(DISTINCT field))` | `PER_PERSON(COUNT(DISTINCT _session_id))` | Distinct count per person |

**Default ORDER BY:** ROW_NUMBER, FIRST_VALUE, LAST_VALUE, RANK, LAG, LEAD automatically default to `ORDER BY event_timestamp ASC`. Only specify ORDER BY to override (e.g., `PER_SESSION(ROW_NUMBER() ORDER BY revenue DESC)`).

**Important**: Never write raw `OVER (PARTITION BY ...)` clauses. Always use scope functions instead.

---

## Critical Rules

1. **NEVER put aggregation functions in `sql`** — Exception: aggregates inside PER_SESSION()/PER_PERSON() are OK because they become window functions.
2. **ALWAYS inspect schema first** — Never guess field names.
3. **JSON fields need no manual CAST** — Just use the dotted field name directly (e.g., `geo.city`). The system auto-casts based on `dataType`.
4. **Flattened fields need no CAST** — If schema shows String/Float64/Int64 type, use directly.
5. **Array dimensions need `level` and `arrayPath`** — Both are required for sub-event data.
6. **Use PER_SESSION() / PER_PERSON() for scoped computations** — Never write raw OVER/PARTITION BY clauses.
7. **Use NULL for missing values** — NEVER use placeholder strings like 'Unknown', 'Direct', 'None', or '-'. The system automatically groups NULLs into a display bucket.
8. **Use persistence settings for carrying values forward** — Don't write LAST_VALUE/FIRST_VALUE IGNORE NULLS manually. Use the `persistence` field instead (`{ allocation: "last", expiration: "session" }`).
