Как сделать sessionization в SQL

Закрепи формулу sessionization в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать sessionization в Telegram

Зачем sessionization

«Сессия» — серия user events близкая по времени. GA standard — 30 min idle = new session. Без session_id невозможно посчитать average pages/session, session duration, bounce rate.

Базовая логика

Algorithm: assign session_id based on gap between consecutive events.

1. Order events by user + timestamp
2. Compute gap to previous event
3. If gap > threshold → new session
4. Cumulative sum of «new session» flags = session_id

Алгоритм через gap

WITH events_with_gap AS (
    SELECT
        user_id,
        event_time,
        event_name,
        EXTRACT(EPOCH FROM (event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time))) / 60 AS gap_min
    FROM events
    WHERE event_time >= CURRENT_DATE - INTERVAL '7 days'
),
session_marks AS (
    SELECT
        *,
        CASE
            WHEN gap_min IS NULL OR gap_min > 30 THEN 1
            ELSE 0
        END AS new_session_flag
    FROM events_with_gap
)
SELECT
    user_id,
    event_time,
    event_name,
    SUM(new_session_flag) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id_within_user
FROM session_marks;

Глобальный session_id

WITH numbered AS (
    SELECT
        user_id,
        event_time,
        event_name,
        SUM(new_session_flag) OVER (PARTITION BY user_id ORDER BY event_time) AS user_session_num
    FROM (
        -- ...session_marks...
    ) sm
)
SELECT
    user_id || '_' || user_session_num AS session_id,
    *
FROM numbered;
Закрепи формулу sessionization в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать sessionization в Telegram

Session metrics

WITH sessioned AS (
    -- ...sessionization query...
),
session_agg AS (
    SELECT
        session_id,
        user_id,
        MIN(event_time) AS session_start,
        MAX(event_time) AS session_end,
        EXTRACT(EPOCH FROM (MAX(event_time) - MIN(event_time))) / 60 AS duration_min,
        COUNT(*) AS event_count,
        COUNT(*) FILTER (WHERE event_name = 'page_view') AS page_views
    FROM sessioned
    GROUP BY session_id, user_id
)
SELECT
    COUNT(*) AS sessions,
    AVG(duration_min) AS avg_duration,
    AVG(event_count) AS avg_events,
    AVG(page_views) AS avg_page_views,
    COUNT(*) FILTER (WHERE event_count = 1)::NUMERIC * 100 / COUNT(*) AS bounce_rate_pct
FROM session_agg;

Multi-device sessions

User logged in on desktop + mobile within 30 min — one session или two?

Если бизнес-вопрос про «engagement journey» — merge multi-device by user_id (ignoring device). Если про UX per device — keep separate.

-- Per-device sessions (default)
PARTITION BY user_id, device_type

-- Cross-device sessions
PARTITION BY user_id

Частые ошибки

Ошибка 1. Threshold не universal. 30 min — GA convention. SaaS dashboard может быть 15 min. Mobile shopping — 60 min.

Ошибка 2. Time-zone confusion. event_time в local tz, threshold в UTC — гymn.

Ошибка 3. Mid-night cutoff. Session crosses midnight. Some systems force-cut, some don't. Be consistent.

Ошибка 4. Bot traffic. Bots have predictable patterns — could be 1 event sessions. Filter pre-sessionize.

Ошибка 5. Forgot LAG window. LAG без PARTITION BY user_id → wrong. Other user's last event used.

Связанные темы

FAQ

Какой threshold?

30 min — стандарт (GA). News sites — 5 min. Banking — 5 min. Mobile games — 60 min.

Cross-device merge?

Если есть стабильный user_id (login). Иначе hard.

Sessionization во views vs raw?

Compute once, store sessionized. Re-computing на каждом query — heavy.

Realtime sessionization?

Stream processing (Flink, Spark Structured Streaming). Hard в pure SQL.

Session vs visit?

Часто синонимы. «Visit» — web analytics term. «Session» — broader (mobile, web, server).