Как сделать Gap Analysis в SQL
Содержание:
Зачем Gap Analysis
«Gaps and Islands» — classic SQL pattern. Найти group consecutive events (islands) и breaks between (gaps). Streaks, missing data, sequential analysis — все это gap analysis.
Gaps and Islands pattern
Key trick: assign group_id используя ROW_NUMBER() difference. Consecutive rows have same diff.
WITH activity AS (
SELECT
user_id,
active_date,
active_date - INTERVAL '1 day' * ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY active_date) AS streak_anchor
FROM user_activity
)
SELECT
user_id,
streak_anchor,
COUNT(*) AS streak_length,
MIN(active_date) AS streak_start,
MAX(active_date) AS streak_end
FROM activity
GROUP BY user_id, streak_anchor
ORDER BY streak_length DESC;active_date - row_number constant для consecutive dates.
Missing dates
Find gaps в date series:
WITH ordered AS (
SELECT
DATE,
LAG(DATE) OVER (ORDER BY DATE) AS prev_date,
DATE - LAG(DATE) OVER (ORDER BY DATE) AS gap_days
FROM daily_metrics
)
SELECT
prev_date AS gap_start,
DATE AS gap_end,
gap_days - 1 AS missing_days
FROM ordered
WHERE gap_days > 1
ORDER BY gap_start;Consecutive active days
Streak per user:
WITH grouped AS (
SELECT
user_id,
active_date,
active_date - INTERVAL '1 day' * ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY active_date) AS group_key
FROM user_activity
),
streaks AS (
SELECT
user_id,
group_key,
COUNT(*) AS streak_length,
MIN(active_date) AS start_date,
MAX(active_date) AS end_date
FROM grouped
GROUP BY user_id, group_key
)
SELECT
user_id,
MAX(streak_length) AS longest_streak,
MAX(streak_length) FILTER (WHERE end_date = CURRENT_DATE - 1) AS current_streak
FROM streaks
GROUP BY user_id
ORDER BY longest_streak DESC;Detect gaps в sequence
Sequential IDs с пропусками:
WITH ordered AS (
SELECT
id,
LAG(id) OVER (ORDER BY id) AS prev_id
FROM events
)
SELECT
prev_id AS missing_after,
id AS missing_before,
id - prev_id - 1 AS missing_count
FROM ordered
WHERE id - prev_id > 1
ORDER BY missing_count DESC;Islands by attribute
Когда user status меняется во времени:
WITH status_changes AS (
SELECT
user_id,
status_date,
status,
LAG(status) OVER (PARTITION BY user_id ORDER BY status_date) AS prev_status,
SUM(CASE
WHEN LAG(status) OVER (PARTITION BY user_id ORDER BY status_date) = status THEN 0
ELSE 1
END) OVER (PARTITION BY user_id ORDER BY status_date) AS island_id
FROM user_status_history
)
SELECT
user_id,
status,
island_id,
MIN(status_date) AS island_start,
MAX(status_date) AS island_end,
COUNT(*) AS island_length_days
FROM status_changes
GROUP BY user_id, status, island_id;Частые ошибки
Ошибка 1. ROW_NUMBER vs RANK. RANK даёт ties same number → broken для consecutive logic. Use ROW_NUMBER.
Ошибка 2. Date type mismatch. date − integer ≠ date − interval. Cast carefully.
Ошибка 3. Missing partition. Forgot PARTITION BY → cross-user contamination.
Ошибка 4. Edge dates. First row LAG = NULL → check.
Ошибка 5. Gap interpretation. 1-day gap vs 7-day gap — different meaning. Threshold.
Связанные темы
- Window functions advanced
- Как посчитать active cohort в SQL
- Как посчитать sessionization в SQL
- Cumulative distinct на собесе DE
FAQ
Gaps and Islands — что это?
Classic SQL pattern. Islands = consecutive sequences. Gaps = breaks between.
ROW_NUMBER vs RANK?
ROW_NUMBER каждой строке unique. RANK дает ties same rank. ROW_NUMBER нужен.
Why date - row_number?
Consecutive dates produce constant value. Trick для grouping.
Sequential IDs или dates?
Same logic. Adjust interval (1 day или 1 unit).
Real use case?
Streaks (Duolingo). Active periods. Missing data alerts. Time-series gaps detection.