SQL для когортного анализа
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Когортный анализ — must-have для любого consumer-продукта. Показывает retention (% пользователей возвращаются в день N). Без когорт — нельзя понять, улучшился ли продукт.
На собесах задача «построить retention matrix в SQL» встречается регулярно. Middle-аналитик должен писать за 10 минут.
Что такое cohort analysis
Группируем users по когорте (обычно signup month), tracking их behavior over time.
Типичный вывод — heatmap:
| Cohort | Month 0 | Month 1 | Month 2 | Month 3 |
|---|---|---|---|---|
| Jan | 100% | 45% | 30% | 25% |
| Feb | 100% | 50% | 32% | 27% |
Каждая строка — cohort. Каждая колонка — offset месяц.
Полный SQL
WITH cohorts AS (
-- Определить cohort (signup month) каждого пользователя
SELECT
user_id,
DATE_TRUNC('month', MIN(created_at)) AS cohort_month
FROM users
GROUP BY user_id
),
user_activity AS (
-- Активность каждого пользователя по месяцам
SELECT
user_id,
DATE_TRUNC('month', created_at) AS activity_month
FROM events
GROUP BY 1, 2
),
cohort_sizes AS (
-- Размер каждой когорты
SELECT cohort_month, COUNT(*) AS cohort_size
FROM cohorts
GROUP BY 1
)
SELECT
c.cohort_month,
EXTRACT(month FROM AGE(ua.activity_month, c.cohort_month)) +
EXTRACT(year FROM AGE(ua.activity_month, c.cohort_month)) * 12 AS months_since_signup,
COUNT(DISTINCT ua.user_id) AS active_users,
cs.cohort_size,
COUNT(DISTINCT ua.user_id) * 100.0 / cs.cohort_size AS retention_pct
FROM cohorts c
JOIN user_activity ua ON ua.user_id = c.user_id
JOIN cohort_sizes cs ON cs.cohort_month = c.cohort_month
WHERE ua.activity_month >= c.cohort_month
GROUP BY 1, 2, cs.cohort_size
ORDER BY 1, 2;Пошаговое разъяснение
Step 1: Cohort assignment
WITH cohorts AS (
SELECT user_id, DATE_TRUNC('month', MIN(created_at)) AS cohort_month
FROM users
GROUP BY user_id
)Cohort = месяц первой регистрации.
Step 2: Activity per month
user_activity AS (
SELECT user_id, DATE_TRUNC('month', created_at) AS activity_month
FROM events
GROUP BY 1, 2
)Active = был event в месяце.
Step 3: Offset
activity_month - cohort_month в месяцах:
EXTRACT(month FROM AGE(activity_month, cohort_month)) +
EXTRACT(year FROM AGE(activity_month, cohort_month)) * 12Postgres-style. MySQL: TIMESTAMPDIFF(MONTH, cohort_month, activity_month).
Step 4: Aggregate
Count unique users per (cohort, offset).
Step 5: % of cohort
Normalize by cohort size.
В ClickHouse
WITH cohorts AS (
SELECT user_id, toStartOfMonth(min(created_at)) AS cohort_month
FROM users
GROUP BY user_id
),
user_activity AS (
SELECT DISTINCT user_id, toStartOfMonth(created_at) AS activity_month
FROM events
)
SELECT
c.cohort_month,
dateDiff('month', c.cohort_month, ua.activity_month) AS months_since,
uniqExact(ua.user_id) AS active
FROM cohorts c
JOIN user_activity ua USING user_id
GROUP BY 1, 2;Retention variants
Strict retention
Active именно в month N:
WHERE activity_month = cohort_month + INTERVAL '30 days'Cumulative retention
Active в month N или позже. Часто более informative.
Rolling retention
Active в last N days относительно сегодня.
Different cohort definitions
By signup week
DATE_TRUNC('week', MIN(created_at))Weekly cohorts — more granular.
By first purchase
DATE_TRUNC('month', MIN(purchase_at)) AS purchase_cohortFor transactional products.
By acquisition channel
cohort = (signup_month, acquisition_channel)Compare organic vs paid cohorts.
Pivoted output
Для visualization нужен wide format:
SELECT
cohort_month,
MAX(CASE WHEN months_since = 0 THEN retention_pct END) AS m0,
MAX(CASE WHEN months_since = 1 THEN retention_pct END) AS m1,
MAX(CASE WHEN months_since = 2 THEN retention_pct END) AS m2,
MAX(CASE WHEN months_since = 3 THEN retention_pct END) AS m3
FROM cohort_data
GROUP BY cohort_month;Dashboard считается
Retention curve
Avg retention по месяцам — один line.
Cohort heatmap
Color-coded matrix. Green good, red bad.
Cohort ridgeline
Plot каждый cohort as curve.
Performance tips
Incremental
Daily: compute only new cohorts / new months. Store в materialized view.
Sampling
Для exploration — 10% sample достаточно, faster.
Index
Index events(user_id, created_at) — critical.
Pre-aggregation
Daily active users — pre-compute → cohort analysis на этом.
Частые ошибки
Mistake: cohort = current_date
Cohort — signup time, не today.
Missing activities
Если event не fire для некоторых users — retention выглядит хуже. Check event tracking.
Time zone
User signup в 23:50 UTC → 02:50 MSK (next day). Use consistent timezone.
Mixed definitions
«Active» должен быть consistent. Event open, or transaction, or something else — choose и stick.
Анализ результатов
Retention по cohort
Sequential cohorts should trend similarly. Sharp drop — product change?
Improvement cohort on cohort
Best sign: retention ↑ для newer cohorts → product improving.
Cliff
Big drop между month N и N+1 — what's breaking?
На собесе
«Напишите SQL для retention matrix».
Walk through:
- Cohort assignment (CTE 1)
- Activity per user month (CTE 2)
- Join, compute offset
- Aggregate per (cohort, offset)
- Normalize by cohort size
10 минут — ok для middle.
Связанные темы
FAQ
Weekly или monthly?
Monthly — standard. Weekly — для B2C с high frequency.
Rolling vs static?
Static cohort — better для trend analysis.
Время computation?
10M users: несколько минут. Pre-aggregate для dashboards.
Тренируйте SQL — откройте тренажёр с 1500+ вопросами для собесов.