Как сделать sessionization в SQL
Содержание:
Зачем sessionization
«Сессия» — серия user events близкая по времени. GA standard — 30 min idle = new session. Без session_id невозможно посчитать average pages/session, session duration, bounce rate.
Базовая логика
Algorithm: assign session_id based on gap between consecutive events.
1. Order events by user + timestamp
2. Compute gap to previous event
3. If gap > threshold → new session
4. Cumulative sum of «new session» flags = session_idАлгоритм через gap
WITH events_with_gap AS (
SELECT
user_id,
event_time,
event_name,
EXTRACT(EPOCH FROM (event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time))) / 60 AS gap_min
FROM events
WHERE event_time >= CURRENT_DATE - INTERVAL '7 days'
),
session_marks AS (
SELECT
*,
CASE
WHEN gap_min IS NULL OR gap_min > 30 THEN 1
ELSE 0
END AS new_session_flag
FROM events_with_gap
)
SELECT
user_id,
event_time,
event_name,
SUM(new_session_flag) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id_within_user
FROM session_marks;Глобальный session_id
WITH numbered AS (
SELECT
user_id,
event_time,
event_name,
SUM(new_session_flag) OVER (PARTITION BY user_id ORDER BY event_time) AS user_session_num
FROM (
-- ...session_marks...
) sm
)
SELECT
user_id || '_' || user_session_num AS session_id,
*
FROM numbered;Session metrics
WITH sessioned AS (
-- ...sessionization query...
),
session_agg AS (
SELECT
session_id,
user_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
EXTRACT(EPOCH FROM (MAX(event_time) - MIN(event_time))) / 60 AS duration_min,
COUNT(*) AS event_count,
COUNT(*) FILTER (WHERE event_name = 'page_view') AS page_views
FROM sessioned
GROUP BY session_id, user_id
)
SELECT
COUNT(*) AS sessions,
AVG(duration_min) AS avg_duration,
AVG(event_count) AS avg_events,
AVG(page_views) AS avg_page_views,
COUNT(*) FILTER (WHERE event_count = 1)::NUMERIC * 100 / COUNT(*) AS bounce_rate_pct
FROM session_agg;Multi-device sessions
User logged in on desktop + mobile within 30 min — one session или two?
Если бизнес-вопрос про «engagement journey» — merge multi-device by user_id (ignoring device). Если про UX per device — keep separate.
-- Per-device sessions (default)
PARTITION BY user_id, device_type
-- Cross-device sessions
PARTITION BY user_idЧастые ошибки
Ошибка 1. Threshold не universal. 30 min — GA convention. SaaS dashboard может быть 15 min. Mobile shopping — 60 min.
Ошибка 2. Time-zone confusion. event_time в local tz, threshold в UTC — гymn.
Ошибка 3. Mid-night cutoff. Session crosses midnight. Some systems force-cut, some don't. Be consistent.
Ошибка 4. Bot traffic. Bots have predictable patterns — could be 1 event sessions. Filter pre-sessionize.
Ошибка 5. Forgot LAG window. LAG без PARTITION BY user_id → wrong. Other user's last event used.
Связанные темы
- Как посчитать bounce rate в SQL
- Как посчитать exit rate в SQL
- Как посчитать page views в SQL
- Cumulative distinct на собесе DE
FAQ
Какой threshold?
30 min — стандарт (GA). News sites — 5 min. Banking — 5 min. Mobile games — 60 min.
Cross-device merge?
Если есть стабильный user_id (login). Иначе hard.
Sessionization во views vs raw?
Compute once, store sessionized. Re-computing на каждом query — heavy.
Realtime sessionization?
Stream processing (Flink, Spark Structured Streaming). Hard в pure SQL.
Session vs visit?
Часто синонимы. «Visit» — web analytics term. «Session» — broader (mobile, web, server).