SQL для когортного анализа

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Когортный анализ — must-have для любого consumer-продукта. Показывает retention (% пользователей возвращаются в день N). Без когорт — нельзя понять, улучшился ли продукт.

На собесах задача «построить retention matrix в SQL» встречается регулярно. Middle-аналитик должен писать за 10 минут.

Что такое cohort analysis

Группируем users по когорте (обычно signup month), tracking их behavior over time.

Типичный вывод — heatmap:

Cohort Month 0 Month 1 Month 2 Month 3
Jan 100% 45% 30% 25%
Feb 100% 50% 32% 27%

Каждая строка — cohort. Каждая колонка — offset месяц.

Полный SQL

WITH cohorts AS (
    -- Определить cohort (signup month) каждого пользователя
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(created_at)) AS cohort_month
    FROM users
    GROUP BY user_id
),
user_activity AS (
    -- Активность каждого пользователя по месяцам
    SELECT
        user_id,
        DATE_TRUNC('month', created_at) AS activity_month
    FROM events
    GROUP BY 1, 2
),
cohort_sizes AS (
    -- Размер каждой когорты
    SELECT cohort_month, COUNT(*) AS cohort_size
    FROM cohorts
    GROUP BY 1
)
SELECT
    c.cohort_month,
    EXTRACT(month FROM AGE(ua.activity_month, c.cohort_month)) +
        EXTRACT(year FROM AGE(ua.activity_month, c.cohort_month)) * 12 AS months_since_signup,
    COUNT(DISTINCT ua.user_id) AS active_users,
    cs.cohort_size,
    COUNT(DISTINCT ua.user_id) * 100.0 / cs.cohort_size AS retention_pct
FROM cohorts c
JOIN user_activity ua ON ua.user_id = c.user_id
JOIN cohort_sizes cs ON cs.cohort_month = c.cohort_month
WHERE ua.activity_month >= c.cohort_month
GROUP BY 1, 2, cs.cohort_size
ORDER BY 1, 2;

Пошаговое разъяснение

Step 1: Cohort assignment

WITH cohorts AS (
    SELECT user_id, DATE_TRUNC('month', MIN(created_at)) AS cohort_month
    FROM users
    GROUP BY user_id
)

Cohort = месяц первой регистрации.

Step 2: Activity per month

user_activity AS (
    SELECT user_id, DATE_TRUNC('month', created_at) AS activity_month
    FROM events
    GROUP BY 1, 2
)

Active = был event в месяце.

Step 3: Offset

activity_month - cohort_month в месяцах:

EXTRACT(month FROM AGE(activity_month, cohort_month)) +
EXTRACT(year FROM AGE(activity_month, cohort_month)) * 12

Postgres-style. MySQL: TIMESTAMPDIFF(MONTH, cohort_month, activity_month).

Step 4: Aggregate

Count unique users per (cohort, offset).

Step 5: % of cohort

Normalize by cohort size.

В ClickHouse

WITH cohorts AS (
    SELECT user_id, toStartOfMonth(min(created_at)) AS cohort_month
    FROM users
    GROUP BY user_id
),
user_activity AS (
    SELECT DISTINCT user_id, toStartOfMonth(created_at) AS activity_month
    FROM events
)
SELECT
    c.cohort_month,
    dateDiff('month', c.cohort_month, ua.activity_month) AS months_since,
    uniqExact(ua.user_id) AS active
FROM cohorts c
JOIN user_activity ua USING user_id
GROUP BY 1, 2;

Retention variants

Strict retention

Active именно в month N:

WHERE activity_month = cohort_month + INTERVAL '30 days'

Cumulative retention

Active в month N или позже. Часто более informative.

Rolling retention

Active в last N days относительно сегодня.

Different cohort definitions

By signup week

DATE_TRUNC('week', MIN(created_at))

Weekly cohorts — more granular.

By first purchase

DATE_TRUNC('month', MIN(purchase_at)) AS purchase_cohort

For transactional products.

By acquisition channel

cohort = (signup_month, acquisition_channel)

Compare organic vs paid cohorts.

Pivoted output

Для visualization нужен wide format:

SELECT
    cohort_month,
    MAX(CASE WHEN months_since = 0 THEN retention_pct END) AS m0,
    MAX(CASE WHEN months_since = 1 THEN retention_pct END) AS m1,
    MAX(CASE WHEN months_since = 2 THEN retention_pct END) AS m2,
    MAX(CASE WHEN months_since = 3 THEN retention_pct END) AS m3
FROM cohort_data
GROUP BY cohort_month;

Dashboard считается

Retention curve

Avg retention по месяцам — один line.

Cohort heatmap

Color-coded matrix. Green good, red bad.

Cohort ridgeline

Plot каждый cohort as curve.

Performance tips

Incremental

Daily: compute only new cohorts / new months. Store в materialized view.

Sampling

Для exploration — 10% sample достаточно, faster.

Index

Index events(user_id, created_at) — critical.

Pre-aggregation

Daily active users — pre-compute → cohort analysis на этом.

Частые ошибки

Mistake: cohort = current_date

Cohort — signup time, не today.

Missing activities

Если event не fire для некоторых users — retention выглядит хуже. Check event tracking.

Time zone

User signup в 23:50 UTC → 02:50 MSK (next day). Use consistent timezone.

Mixed definitions

«Active» должен быть consistent. Event open, or transaction, or something else — choose и stick.

Анализ результатов

Retention по cohort

Sequential cohorts should trend similarly. Sharp drop — product change?

Improvement cohort on cohort

Best sign: retention ↑ для newer cohorts → product improving.

Cliff

Big drop между month N и N+1 — what's breaking?

На собесе

«Напишите SQL для retention matrix».

Walk through:

  1. Cohort assignment (CTE 1)
  2. Activity per user month (CTE 2)
  3. Join, compute offset
  4. Aggregate per (cohort, offset)
  5. Normalize by cohort size

10 минут — ok для middle.

Связанные темы

FAQ

Weekly или monthly?

Monthly — standard. Weekly — для B2C с high frequency.

Rolling vs static?

Static cohort — better для trend analysis.

Время computation?

10M users: несколько минут. Pre-aggregate для dashboards.


Тренируйте SQL — откройте тренажёр с 1500+ вопросами для собесов.