Как посчитать ticket volume anomaly в SQL
Содержание:
Зачем мониторить volume
Спайк тикетов — первый сигнал инцидента, который случился ДО того, как мониторинг alert'ит. Если за час пришло 200 «can't login» — это outage, не нужно ждать health-check alert. Volume anomaly check — main early warning для CS.
Базовая метрика
Число тикетов по часу или дню — старт.
SELECT
DATE_TRUNC('hour', created_at)::TIMESTAMP AS hour,
COUNT(*) AS tickets
FROM support_tickets
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE_TRUNC('hour', created_at)
ORDER BY hour;Z-score алерт
Сравнение с 28-дневным baseline:
WITH hourly AS (
SELECT
DATE_TRUNC('hour', created_at)::TIMESTAMP AS hour,
COUNT(*) AS tickets
FROM support_tickets
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE_TRUNC('hour', created_at)
),
rolling AS (
SELECT
hour,
tickets,
AVG(tickets) OVER w AS rolling_avg,
STDDEV_SAMP(tickets) OVER w AS rolling_std
FROM hourly
WINDOW w AS (ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING) -- 7 days × 24 hours
)
SELECT
hour,
tickets,
rolling_avg,
(tickets - rolling_avg) / NULLIF(rolling_std, 0) AS z_score,
CASE
WHEN (tickets - rolling_avg) / NULLIF(rolling_std, 0) > 3 THEN 'CRITICAL_SPIKE'
WHEN (tickets - rolling_avg) / NULLIF(rolling_std, 0) > 2 THEN 'warning'
ELSE 'ok'
END AS verdict
FROM rolling
WHERE hour >= CURRENT_DATE - INTERVAL '1 day'
ORDER BY hour DESC;z > 3 — точно incident. PagerDuty alert.
По категориям
Тикет volume spike в одной категории — конкретная проблема:
WITH hourly_by_cat AS (
SELECT
category,
DATE_TRUNC('hour', created_at)::TIMESTAMP AS hour,
COUNT(*) AS tickets
FROM support_tickets
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY category, DATE_TRUNC('hour', created_at)
),
with_z AS (
SELECT
category,
hour,
tickets,
AVG(tickets) OVER (PARTITION BY category ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING) AS baseline,
STDDEV_SAMP(tickets) OVER (PARTITION BY category ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING) AS std
FROM hourly_by_cat
)
SELECT
category,
hour,
tickets,
baseline,
(tickets - baseline) / NULLIF(std, 0) AS z
FROM with_z
WHERE hour >= CURRENT_DATE - INTERVAL '1 day'
AND ABS((tickets - baseline) / NULLIF(std, 0)) > 2.5;Если billing spike z=5 — обычно payment processor проблема.
Day-of-week
Понедельник всегда больше тикетов, чем выходные. Стратифицируйте:
WITH hourly AS (
SELECT
EXTRACT(DOW FROM created_at)::INT AS dow,
EXTRACT(HOUR FROM created_at)::INT AS hour_of_day,
DATE(created_at) AS day,
COUNT(*) AS tickets
FROM support_tickets
WHERE created_at >= CURRENT_DATE - INTERVAL '60 days'
GROUP BY dow, hour_of_day, DATE(created_at)
),
baseline_by_slot AS (
SELECT
dow, hour_of_day,
AVG(tickets) AS slot_avg,
STDDEV_SAMP(tickets) AS slot_std
FROM hourly
WHERE day < CURRENT_DATE - INTERVAL '7 days'
GROUP BY dow, hour_of_day
)
SELECT
h.day,
h.dow,
h.hour_of_day,
h.tickets,
b.slot_avg,
(h.tickets - b.slot_avg) / NULLIF(b.slot_std, 0) AS z_vs_slot
FROM hourly h
JOIN baseline_by_slot b USING (dow, hour_of_day)
WHERE h.day >= CURRENT_DATE - INTERVAL '7 days';Частые ошибки
Ошибка 1. Без day-of-week stratification. Пик в понедельник 9 утра — норма, не anomaly.
Ошибка 2. Слишком чувствительный threshold. z > 2 = alert каждый день. Спецы устают. z > 3 — стандарт для critical alerts.
Ошибка 3. Window не учитывает sustained growth. Если ticket count steady растёт (новые юзеры), z отрицательный для baseline-старого. Используйте relative growth.
Ошибка 4. Marketing-campaigns spike → false positive. Major campaign → expected ticket spike. Whitelist known events.
Ошибка 5. Только volume, не severity. 30 тикетов «critical» хуже 100 «easy». Включайте severity-weight.
Связанные темы
- Как посчитать data volume anomaly в SQL
- Как посчитать anomaly detection в SQL
- Как посчитать z-score outliers в SQL
- Как посчитать first response time в SQL
FAQ
Каков baseline?
7-28 дней rolling, stratified by hour-of-week.
Алерт куда?
PagerDuty / Slack channel. Auto-page on-call CS lead.
Z или MAD?
MAD устойчивее к outliers. На volatile metrics предпочтительнее.
Real-time или batch?
Real-time: 5-минутные windows. Batch: hourly.
Spike в одной категории — что значит?
Чаще — bug в feature. Иногда — рассылка/campaign trigger.