# Segment SQL Authoring Guide

Read this guide BEFORE creating any segment. Follow the rules exactly — incorrect syntax causes query failures.

---

## What Is a Segment?

A segment is a behavioral filter that identifies subsets of events, sessions, or persons. Segments are evaluated **after** dimensions/metrics are computed but **before** final aggregation.

A segment has three core fields:

| Field | Type | Description |
|-------|------|-------------|
| `scope` | `"event"` \| `"session"` \| `"person"` | Controls granularity of evaluation |
| `sql` | string | The segment expression (see syntax below) |
| `exclude` | boolean (optional) | `true` = exclusion segment, `false`/omitted = inclusion |

There are five segment types, determined by SQL syntax:

| Type | Syntax Indicator | Scope Allowed |
|------|-----------------|---------------|
| Standard | No THEN, no temporal modifiers | All three |
| Sequential | Contains `THEN` keyword | Session, Person only |
| Sequential + Aggregate | `(THEN steps) AND aggregate` | Session, Person only |
| Temporal | Starts with `AFTER`/`FROM`/`BEFORE`/`UNTIL` | Session, Person only |
| Temporal + Sequential | Both modifiers and THEN | Session, Person only |

---

## Standard Segments

**IMPORTANT: All column references in segment SQL must use `{brace}` syntax to reference existing dimensions or metrics.** Bare column names (e.g., `event_type`, `items.category`) are NOT allowed. You must use `{dimension_id}` or `{metric_id}` syntax so that columns are properly computed with array joins, lookups, and type casting before segment evaluation.

### Option A: Row-Level Boolean (Simple)

Write a simple boolean expression. The system wraps it in `max() OVER (PARTITION BY ...)` for session/person scope automatically.

| Scope | System Generates | Meaning |
|-------|-----------------|---------|
| `event` | `your_sql` (no wrapping) | Condition applies to each individual event |
| `session` | `max(your_sql) OVER (PARTITION BY person_id, session_id)` | Session contains at least one matching event |
| `person` | `max(your_sql) OVER (PARTITION BY person_id)` | Person has at least one matching event |

Examples:
```
{ scope: 'person', sql: "{event_type} = 'purchase'" }
{ scope: 'session', sql: "{device_category} = 'Mobile'" }
{ scope: 'event', sql: "{revenue} > 100" }
```

### Option B: Convenience Keywords (Any Scope)

These keywords simplify common comparisons. They expand to standard SQL and work at any scope (event, session, or person). The raw SQL equivalents (LIKE, IN, BETWEEN, etc.) still work too.

| Keyword | Meaning | Compiles to |
|---------|---------|-------------|
| `CONTAINS(field, 'value')` | String contains value | `field LIKE '%value%'` |
| `NOT_CONTAINS(field, 'value')` | String does not contain value | `field NOT LIKE '%value%'` |
| `STARTS_WITH(field, 'value')` | String starts with value | `field LIKE 'value%'` |
| `ENDS_WITH(field, 'value')` | String ends with value | `field LIKE '%value'` |
| `IS_EMPTY(field)` | Null or empty string | `(field IS NULL OR field = '')` |
| `IS_NOT_EMPTY(field)` | Has a value | `(field IS NOT NULL AND field != '')` |
| `BETWEEN(field, low, high)` | Value in range (inclusive) | `field BETWEEN low AND high` |
| `IN_LIST(field, val1, val2, ...)` | Matches any value in list | `field IN (val1, val2, ...)` |
| `MATCHES(field, 'pattern')` | Regex match (re2 syntax) | `match(field, 'pattern')` |
| `NOT_MATCHES(field, 'pattern')` | Negated regex match | `NOT match(field, 'pattern')` |
| `CONTAINS_ALL(field, v1, v2, ...)` | All substrings present | `(field LIKE '%v1%' AND field LIKE '%v2%' AND ...)` |
| `CONTAINS_ANY(field, v1, v2, ...)` | Any substring present | `(field LIKE '%v1%' OR field LIKE '%v2%' OR ...)` |

Examples:
```
{ scope: 'event', sql: "CONTAINS({page_url}, '/pricing')" }
{ scope: 'event', sql: "STARTS_WITH({channel}, 'Paid')" }
{ scope: 'event', sql: "IS_NOT_EMPTY({coupon_code})" }
{ scope: 'event', sql: "BETWEEN({revenue}, 10, 100)" }
{ scope: 'event', sql: "IN_LIST({country}, 'US', 'CA', 'UK')" }
{ scope: 'event', sql: "MATCHES({page_url}, '^/products/[0-9]+')" }
{ scope: 'event', sql: "CONTAINS_ANY({referrer}, 'google', 'bing', 'yahoo')" }
{ scope: 'event', sql: "CONTAINS_ALL({page_url}, 'checkout', 'step')" }
{ scope: 'person', sql: "CONTAINS({page_url}, '/checkout') AND {revenue} > 50" }
```

### Option C: Aggregate Keywords (Session or Person Scope)

For cross-row logic (e.g., "3+ purchases", "never did X", "did X AND Y as separate events"), use aggregate keyword wrappers. **These require session or person scope.**

| Keyword | Meaning | Compiles to |
|---------|---------|-------------|
| `ANY(condition)` | At least one row matches | `countIf(condition) > 0` |
| `EVERY(condition)` | All rows match | `countIf(NOT(condition)) = 0` |
| `NONE(condition)` | No rows match | `countIf(condition) = 0` |
| `COUNT(condition)` | Number of matching rows | `countIf(condition)` |
| `UNIQUE(expr)` | Count of distinct values | `uniqExact(expr)` |
| `SUM(expr)` | Sum across rows | `sum(expr)` |
| `AVG(expr)` | Average across rows | `avg(expr)` |
| `MIN(expr)` | Minimum value | `min(expr)` |
| `MAX(expr)` | Maximum value | `max(expr)` |
| `FIRST(expr)` | Value of expr at first event (by time) | `argMax(expr, -timestamp)` |
| `LAST(expr)` | Value of expr at last event (by time) | `argMax(expr, timestamp)` |

Examples:
```
{ scope: 'person', sql: "ANY({event_type} = 'purchase') AND ANY({event_type} = 'page_view')" }
{ scope: 'person', sql: "NONE({event_type} = 'purchase')" }
{ scope: 'person', sql: "COUNT({event_type} = 'purchase') >= 3" }
{ scope: 'person', sql: "SUM({order_value}) > 500" }
{ scope: 'session', sql: "UNIQUE({page_url}) >= 5" }
{ scope: 'person', sql: "FIRST({marketing_channel}) = 'Organic Search'" }
```

Keywords can be combined freely — including convenience + aggregate:
```
ANY(CONTAINS({page_url}, '/checkout')) AND SUM({order_value}) > 200 AND NONE({event_type} = 'refund')
```

### IMPORTANT: Do NOT Use Raw Aggregate Functions

Always use keyword wrappers instead of raw ClickHouse aggregates:

| Do NOT write | Write instead |
|---|---|
| `countIf({event_type} = 'purchase') > 0` | `ANY({event_type} = 'purchase')` |
| `sum({revenue}) > 500` | `SUM({revenue}) > 500` |
| `uniqExact({page_url}) >= 5` | `UNIQUE({page_url}) >= 5` |
| `{page_url} LIKE '%/pricing%'` | `CONTAINS({page_url}, '/pricing')` |
| `{coupon_code} IS NOT NULL AND {coupon_code} != ''` | `IS_NOT_EMPTY({coupon_code})` |
| `match({page_url}, '^/products')` | `MATCHES({page_url}, '^/products')` |
| `{url} LIKE '%google%' OR {url} LIKE '%bing%'` | `CONTAINS_ANY({url}, 'google', 'bing')` |
| `{url} LIKE '%shop%' AND {url} LIKE '%sale%'` | `CONTAINS_ALL({url}, 'shop', 'sale')` |

### When to Use Row-Level Boolean vs Keywords

| User Intent | Approach | SQL |
|---|---|---|
| "URLs containing /pricing" | Convenience keyword | `CONTAINS({page_url}, '/pricing')` |
| "Users from US, CA, or UK" | Convenience keyword | `IN_LIST({country}, 'US', 'CA', 'UK')` |
| "Orders between $10 and $100" | Convenience keyword | `BETWEEN({revenue}, 10, 100)` |
| "Events with a coupon code" | Convenience keyword | `IS_NOT_EMPTY({coupon_code})` |
| "Product URLs matching a pattern" | Convenience keyword | `MATCHES({page_url}, '^/products/[0-9]+')` |
| "Referrer from any search engine" | Convenience keyword | `CONTAINS_ANY({referrer}, 'google', 'bing', 'yahoo')` |
| "URL has both 'checkout' and 'step'" | Convenience keyword | `CONTAINS_ALL({page_url}, 'checkout', 'step')` |
| "Users who purchased" | Row-level boolean | `{event_type} = 'purchase'` |
| "Users who purchased electronics" | Row-level boolean | `{event_type} = 'purchase' AND {item_category} = 'Electronics'` |
| "Users who purchased AND viewed a page" (two events) | Aggregate keywords | `ANY({event_type} = 'purchase') AND ANY({event_type} = 'page_view')` |
| "Users who never purchased" | Aggregate keywords | `NONE({event_type} = 'purchase')` |
| "Users with 3+ purchases" | Aggregate keywords | `COUNT({event_type} = 'purchase') >= 3` |
| "Users who spent over $500" | Aggregate keywords | `SUM({order_value}) > 500` |

**Rule of thumb**: Use convenience keywords for string matching, null checks, range checks, and multi-value matching. Use aggregate keywords when the logic requires cross-row aggregation (counting, summing, negation). Only use `ANY()` when combining it with other keyword wrappers (e.g., `ANY(x) AND NONE(y)`) — for a single condition, a plain row-level boolean is simpler.

---

## {Brace} References

**All column references in segment SQL must use `{brace}` syntax.** Bare column names are NOT allowed.

| Reference Type | Resolves To | Example |
|---|---|---|
| `{dimension_id}` | Dimension's computed column | `{event_type}` → `event_type` |
| `{metric_id}` | Underscore-prefixed metric column | `{revenue}` → `_revenue` |

All `{brace}` references are validated at create/update time. Unknown references return an error. If a needed dimension/metric doesn't exist yet, create it first.

---

## Sequential Segments (THEN / WITHIN)

Define ordered event patterns using the `THEN` keyword. Compiles to ClickHouse `sequenceMatch()`.

```
-- Two-step sequence
{event_type} = 'page_view' THEN {event_type} = 'purchase'

-- With time constraint
{event_type} = 'page_view' THEN WITHIN 30m {event_type} = 'purchase'

-- Three-step funnel with per-step constraints
{event_type} = 'product_view' THEN WITHIN 5m {event_type} = 'add_to_cart' THEN WITHIN 30m {event_type} = 'purchase'

-- WITHIN SESSION: sequence must happen within the same session, but evaluated at person scope
-- This finds PEOPLE who had a session where paid search led to a purchase
{last_touch_marketing_channel} = 'Paid Search' THEN WITHIN SESSION {event_type} = 'purchase'
```

### Combining Sequential with Aggregate Conditions

You can combine a sequential pattern with an aggregate condition by wrapping the sequential part in parentheses:

```
-- People who viewed then purchased AND spent over $100 total
({event_type} = 'page_view' THEN {event_type} = 'purchase') AND SUM({order_value}) > 100

-- Three-step funnel AND made 3+ purchases
({event_type} = 'browse' THEN {event_type} = 'add_to_cart' THEN {event_type} = 'purchase') AND COUNT({event_type} = 'purchase') >= 3

-- WITHIN constraints work inside the parens
({event_type} = 'page_view' THEN WITHIN 30m {event_type} = 'purchase') AND SUM({order_value}) > 100

-- With temporal modifiers (temporal scopes both the sequence and the aggregate)
AFTER FIRST {event_type} = 'signup': ({event_type} = 'page_view' THEN {event_type} = 'purchase') AND SUM({order_value}) > 100
```

**Operator precedence:** AND binds tighter than THEN (consistent with SQL convention). So without parentheses, `A THEN B AND C` means step 1 = A, step 2 = (B AND C) — not (A THEN B) AND C. **Parentheses are required** to separate the sequential part from an aggregate companion.

**Rules:**
- Scope must be `session` or `person` (not event)
- Each THEN step must be a row-level boolean — no aggregates in steps
- To combine a sequence with an aggregate, wrap the THEN part in parens: `(A THEN B) AND SUM(x) > 100`. The aggregate uses keyword wrappers.
- Max 32 steps (ClickHouse limit)
- WITHIN format: `<number><unit>` where unit is `s`/`m`/`h`/`d`
- WITHIN goes between THEN and the condition: `THEN WITHIN 30m <condition>`
- First step cannot have WITHIN
- WITHIN SESSION: sequence must occur within a single session. **Must use person scope.** Result promoted to person level (person matches if ANY session had the sequence). Can combine with time: `THEN WITHIN SESSION WITHIN 30m <condition>`

**Scope Behavior:**
- `session` scope: all steps must occur within the same session. Result is per-session.
- `person` scope: steps can span across sessions. All of the person's events are flagged.
- `person` scope + `WITHIN SESSION`: steps must occur within the same session, but result is promoted to person scope. Person is flagged if ANY session had the matching sequence.

**When to use WITHIN SESSION vs session scope:**
- "People who came from paid search and purchased in the same session" → person scope + WITHIN SESSION
- "Sessions where user viewed then purchased" → session scope (no WITHIN SESSION)
- Using session scope flags sessions, not people. Use person scope + WITHIN SESSION when you want people.

---

## Temporal Modifiers (AFTER / FROM / BEFORE / UNTIL)

Restrict the event stream to a time window relative to a specific event.

**Syntax:** `MODIFIER ANCHOR boundary_condition: main_expression`

| Modifier | Inclusive? | Description |
|----------|-----------|-------------|
| `AFTER` | No | Events strictly after the anchor |
| `FROM` | Yes | Events at or after the anchor |
| `BEFORE` | No | Events strictly before the anchor |
| `UNTIL` | Yes | Events up to and including the anchor |

**Anchor:** `FIRST` (earliest match) or `LAST` (latest match) — always required.

```
-- After first purchase, did they browse more?
AFTER FIRST {event_type} = 'purchase': {event_type} = 'product_view'

-- Between signup and churn (stacked modifiers, max 5)
FROM FIRST {event_type} = 'signup': UNTIL LAST {event_type} = 'churn': {event_type} = 'page_view'

-- Combined with THEN (scope must be session or person)
AFTER FIRST {event_type} = 'signup': {event_type} = 'product_view' THEN {event_type} = 'purchase'

-- Temporal + aggregate: use aggregate keywords in the main expression
FROM FIRST {event_type} = 'purchase': ANY({event_type} = 'return')
FROM FIRST {channel} = 'organic': BEFORE LAST {event_type} = 'signup': {revenue} > 0

-- Temporal-only (no main expression) — defaults to 1 = 1, meaning "all events in this window"
-- Last modifier does NOT need a trailing colon
AFTER FIRST {event_type} = 'purchase': BEFORE LAST {event_type} = 'churn'
AFTER FIRST {event_type} = 'signup'
```

**Rules:**
- Scope must be `session` or `person` (event scope not allowed)
- Main expression is optional — if omitted, defaults to `1 = 1` (all events in the window)
- Colon not needed after the last modifier if there's no main expression

---

## The Exclude Field

| exclude | Behavior |
|---------|----------|
| `false` or omitted | Include — filter TO matching rows |
| `true` | Exclude — filter OUT matching rows |

Must be a boolean. Can be overridden per-report using `{ id, exclude }` object format.

---

## Segment ID Naming

- Lowercase alphanumeric + underscores, starting with a letter
- `seg_` prefix is NOT required — name segments descriptively
- System internally prefixes with `_seg_` for column names

Good: `mobile_users`, `purchasers`, `high_value_customers`, `browse_then_purchase`
Bad: `Mobile Users`, `123_segment`, `my-segment`

---

## Using Segments in Reports

**Column-level** — apply to individual metrics (compare segmented vs base):
```javascript
metrics: [
  { metric_id: 'events', segment_ids: [] },                    // Base
  { metric_id: 'events', segment_ids: ['mobile_users'] },      // Mobile only
  { metric_id: 'events', segment_ids: ['mobile_users', 'purchasers'] }  // Mobile AND purchasers
]
```

**Table-level** — filter entire report:
```javascript
tableSegments: ['purchasers']
```

**Exclude override** — override per-report:
```javascript
tableSegments: [{ id: 'purchasers', exclude: true }]
```

---

## AI Decision Flowchart

1. **Include or exclude?** → Set `exclude: true` for exclusion
2. **Time sequence?** ("A then B") → Use THEN. Scope = session or person.
   - **Also need an aggregate condition?** ("viewed then purchased AND spent > $100") → Wrap THEN in parens: `(A THEN B) AND SUM(x) > 100`
3. **Time window?** ("after first purchase") → Use temporal modifiers (AFTER/FROM/BEFORE/UNTIL)
4. **Cross-row logic?** ("3+ purchases", "never did X") → Use keywords. Scope = session or person.
5. **What level?** Event property = event, session behavior = session, person lifetime = person.
6. **All column references must use `{braces}`** — no bare column names.

---

## Validation Errors

| Error | Fix |
|-------|-----|
| `references unknown dimension or metric: {ref}` | Check spelling, verify dimension/metric exists |
| `Sequential segments require session or person scope` | Change scope from event |
| `Sequential step conditions must be row-level expressions` | Wrap sequential part in parens: `(A THEN B) AND SUM(x) > 100` |
| `Aggregate functions require session or person scope` | Change scope from event |
| `Segment SQL contains forbidden keyword: X` | Remove DROP/DELETE/UPDATE/etc. |
| `Do not include PARTITION BY or OVER` | Remove — system auto-generates these |
| `Temporal modifiers require FIRST or LAST` | Add FIRST or LAST after modifier keyword |
