# Metrics Creation Guide

Read this guide BEFORE creating any metric. Follow every step — skipping steps leads to broken metrics.

---

## What Is a Metric?

A metric is a numeric aggregation applied to event data. There are two types:

### Standard Metrics
Have two parts that work together:
1. **`sql`** — A row-level expression (evaluated per event row)
2. **`aggregation`** — How to aggregate those row values across rows

The system computes `sql` per row in an inner query, then applies `aggregation` in the outer GROUP BY query. **Never put aggregation functions inside `sql`.**

### Calculated Metrics
Post-aggregation formulas that combine standard metrics. They use a `formula` field with `{metric_id}` references instead of sql/aggregation. Use the `createCalculated` operation for these.

---

## Step-by-Step Workflow

### Step 1: Inspect the Schema

**Always call `inspectSchema` first.** Never guess field names.

```
operation: "inspectSchema"
dataGroupId: "<the data group>"
```

This returns every field with its name, type, and whether it's a JSON field or array field. Use exact field names from the schema.

### Step 2: Sample the Data (When Needed)

If you need to understand actual values (for CASE expressions, filters, etc.):

```
operation: "sampleData"
dataGroupId: "<the data group>"
fields: ["event_type", "order_value"]
```

Returns top 20 most frequent values per field with counts, plus null counts and distinct counts.

### Step 3: Write the SQL Expression

The `sql` field is a **row-level expression only**. It gets placed inside a SELECT in the inner computation query.

#### Simple field reference
```
sql: "order_value"
aggregation: "sum"
```

#### Counting events (use count_distinct on event_id for deduplication)
```
sql: "event_id"
aggregation: "count_distinct"
```

#### Counting unique values (e.g., unique users)
```
sql: "person_id"
aggregation: "count_distinct"
```

#### Conditional counting (e.g., only purchases)
```
sql: "CASE WHEN event_type = 'purchase' THEN event_id ELSE NULL END"
aggregation: "count_distinct"
```

#### JSON / nested fields — just use the field name directly
The system auto-casts JSON fields based on the `dataType` parameter. Do NOT manually CAST.
```
sql: "product_viewed.price"
dataType: "number"
aggregation: "sum"
```

If the schema shows a dotted path like `geo.city` or `product.price`, just use it as-is. The system handles all necessary casting internally.

**Edge case — mixed types in one expression:** If your expression uses JSON fields of DIFFERENT types, explicitly CAST the one that differs from the overall `dataType`:
```
sql: "IF(CAST(geo.city AS String) = 'NYC', product.price, 0)"
dataType: "number"
```

**Never use `JSONExtractString()` or `JSONExtractFloat()` — those don't work with ClickHouse's JSON type.**

#### Flattened fields — use directly
If the schema shows a field with type `String`, `Float64`, `Int64`, etc. (not `Nullable(JSON)`), use it directly:
```
sql: "order_value"
aggregation: "sum"
```

### Step 4: Choose the Aggregation

There are three valid aggregation types:

| Aggregation | Use For |
|-------------|---------|
| `sum` | Totals (revenue, quantities, computed values) |
| `count` | Count non-NULL rows from the SQL expression |
| `count_distinct` | Unique counts (unique users, unique sessions, event counts) |

**For averages, ratios, and other derived calculations** — create a calculated metric instead. Build the standard metric building blocks first (e.g., `total_revenue` + `order_count`), then combine them with a calculated metric formula like `ROUND({total_revenue} / {order_count}, 2)`.

**Important**: For `count_distinct`, the `sql` field should be the field you're counting distinct values of (e.g., `person_id`, `event_id`, `_session_id`).

### Step 5: Choose the Data Type

| dataType | When to Use |
|----------|-------------|
| `number` | Decimal values (revenue, percentages) |
| `integer` | Whole numbers (counts, quantities) |

### Step 6: Set Format (Optional)

```javascript
// Currency
format: { type: "currency", decimals: 2, prefix: "$" }

// Percentage
format: { type: "percent", decimals: 1, suffix: "%" }

// Plain number
format: { type: "number", decimals: 0 }
```

### Step 7: Create the Metric

```
operation: "create"
id: "total_revenue"
name: "Total Revenue"
description: "Sum of all order values"
sql: "order_value"
aggregation: "sum"
dataType: "number"
format: { type: "currency", decimals: 2, prefix: "$" }
dataGroups: ["dg_abc123"]
```

---

## Array Metrics

If the data has array fields (like `items[]`, `products[]`), you need two additional fields:

- **`level`**: Set to `"array"`
- **`arrayPath`**: The array column name (e.g., `"items"`)

The system applies `LEFT ARRAY JOIN` to expand each array element into its own row. Use the original array name to reference sub-fields.

```
operation: "create"
id: "total_product_revenue"
name: "Total Product Revenue"
sql: "items.price"
aggregation: "sum"
dataType: "number"
level: "array"
arrayPath: "items"
format: { type: "currency", decimals: 2, prefix: "$" }
dataGroups: ["dg_abc123"]
```

---

## ID Naming Rules

- Must start with a lowercase letter
- Only lowercase letters, numbers, and underscores
- Pattern: `^[a-z][a-z0-9_]*$`
- Good: `total_revenue`, `unique_users`, `event_count`
- Bad: `TotalRevenue`, `123_metric`, `total-revenue`

---

## Critical Rules

1. **NEVER put aggregation functions in `sql`** — No SUM(), COUNT(), AVG() in the sql field. Put aggregation in the `aggregation` field.
2. **ALWAYS inspect schema first** — Never guess field names. Use exact names from inspectSchema.
3. **JSON fields need no manual CAST** — Just use the dotted field name directly (e.g., `geo.city`). The system auto-casts based on `dataType`.
4. **Flattened fields need no CAST** — If schema shows String/Float64/Int64 type, use directly.
5. **Use CASE WHEN for conditional metrics** — Return the value or NULL based on conditions.
6. **count_distinct needs the field in `sql`** — Set sql to the field being counted.
7. **Count events with count_distinct on event_id** — Use `sql: "event_id"` with `aggregation: "count_distinct"` to correctly handle deduplication.
8. **Use calculated metrics for ratios/averages** — Don't try to compute averages in a single standard metric. Create building-block metrics and combine with a formula.
