Как сделать Gap Analysis в SQL

Закрепи формулу gap analysis в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать gap analysis в Telegram

Зачем Gap Analysis

«Gaps and Islands» — classic SQL pattern. Найти group consecutive events (islands) и breaks between (gaps). Streaks, missing data, sequential analysis — все это gap analysis.

Gaps and Islands pattern

Key trick: assign group_id используя ROW_NUMBER() difference. Consecutive rows have same diff.

WITH activity AS (
    SELECT
        user_id,
        active_date,
        active_date - INTERVAL '1 day' * ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY active_date) AS streak_anchor
    FROM user_activity
)
SELECT
    user_id,
    streak_anchor,
    COUNT(*) AS streak_length,
    MIN(active_date) AS streak_start,
    MAX(active_date) AS streak_end
FROM activity
GROUP BY user_id, streak_anchor
ORDER BY streak_length DESC;

active_date - row_number constant для consecutive dates.

Missing dates

Find gaps в date series:

WITH ordered AS (
    SELECT
        DATE,
        LAG(DATE) OVER (ORDER BY DATE) AS prev_date,
        DATE - LAG(DATE) OVER (ORDER BY DATE) AS gap_days
    FROM daily_metrics
)
SELECT
    prev_date AS gap_start,
    DATE AS gap_end,
    gap_days - 1 AS missing_days
FROM ordered
WHERE gap_days > 1
ORDER BY gap_start;

Consecutive active days

Streak per user:

WITH grouped AS (
    SELECT
        user_id,
        active_date,
        active_date - INTERVAL '1 day' * ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY active_date) AS group_key
    FROM user_activity
),
streaks AS (
    SELECT
        user_id,
        group_key,
        COUNT(*) AS streak_length,
        MIN(active_date) AS start_date,
        MAX(active_date) AS end_date
    FROM grouped
    GROUP BY user_id, group_key
)
SELECT
    user_id,
    MAX(streak_length) AS longest_streak,
    MAX(streak_length) FILTER (WHERE end_date = CURRENT_DATE - 1) AS current_streak
FROM streaks
GROUP BY user_id
ORDER BY longest_streak DESC;
Закрепи формулу gap analysis в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать gap analysis в Telegram

Detect gaps в sequence

Sequential IDs с пропусками:

WITH ordered AS (
    SELECT
        id,
        LAG(id) OVER (ORDER BY id) AS prev_id
    FROM events
)
SELECT
    prev_id AS missing_after,
    id AS missing_before,
    id - prev_id - 1 AS missing_count
FROM ordered
WHERE id - prev_id > 1
ORDER BY missing_count DESC;

Islands by attribute

Когда user status меняется во времени:

WITH status_changes AS (
    SELECT
        user_id,
        status_date,
        status,
        LAG(status) OVER (PARTITION BY user_id ORDER BY status_date) AS prev_status,
        SUM(CASE
            WHEN LAG(status) OVER (PARTITION BY user_id ORDER BY status_date) = status THEN 0
            ELSE 1
        END) OVER (PARTITION BY user_id ORDER BY status_date) AS island_id
    FROM user_status_history
)
SELECT
    user_id,
    status,
    island_id,
    MIN(status_date) AS island_start,
    MAX(status_date) AS island_end,
    COUNT(*) AS island_length_days
FROM status_changes
GROUP BY user_id, status, island_id;

Частые ошибки

Ошибка 1. ROW_NUMBER vs RANK. RANK даёт ties same number → broken для consecutive logic. Use ROW_NUMBER.

Ошибка 2. Date type mismatch. date − integer ≠ date − interval. Cast carefully.

Ошибка 3. Missing partition. Forgot PARTITION BY → cross-user contamination.

Ошибка 4. Edge dates. First row LAG = NULL → check.

Ошибка 5. Gap interpretation. 1-day gap vs 7-day gap — different meaning. Threshold.

Связанные темы

FAQ

Gaps and Islands — что это?

Classic SQL pattern. Islands = consecutive sequences. Gaps = breaks between.

ROW_NUMBER vs RANK?

ROW_NUMBER каждой строке unique. RANK дает ties same rank. ROW_NUMBER нужен.

Why date - row_number?

Consecutive dates produce constant value. Trick для grouping.

Sequential IDs или dates?

Same logic. Adjust interval (1 day или 1 unit).

Real use case?

Streaks (Duolingo). Active periods. Missing data alerts. Time-series gaps detection.