Как посчитать ticket volume anomaly в SQL

Закрепи формулу ticket volume anomaly в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать ticket volume anomaly в Telegram

Зачем мониторить volume

Спайк тикетов — первый сигнал инцидента, который случился ДО того, как мониторинг alert'ит. Если за час пришло 200 «can't login» — это outage, не нужно ждать health-check alert. Volume anomaly check — main early warning для CS.

Базовая метрика

Число тикетов по часу или дню — старт.

SELECT
    DATE_TRUNC('hour', created_at)::TIMESTAMP AS hour,
    COUNT(*) AS tickets
FROM support_tickets
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE_TRUNC('hour', created_at)
ORDER BY hour;

Z-score алерт

Сравнение с 28-дневным baseline:

WITH hourly AS (
    SELECT
        DATE_TRUNC('hour', created_at)::TIMESTAMP AS hour,
        COUNT(*) AS tickets
    FROM support_tickets
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY DATE_TRUNC('hour', created_at)
),
rolling AS (
    SELECT
        hour,
        tickets,
        AVG(tickets) OVER w AS rolling_avg,
        STDDEV_SAMP(tickets) OVER w AS rolling_std
    FROM hourly
    WINDOW w AS (ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING)  -- 7 days × 24 hours
)
SELECT
    hour,
    tickets,
    rolling_avg,
    (tickets - rolling_avg) / NULLIF(rolling_std, 0) AS z_score,
    CASE
        WHEN (tickets - rolling_avg) / NULLIF(rolling_std, 0) > 3 THEN 'CRITICAL_SPIKE'
        WHEN (tickets - rolling_avg) / NULLIF(rolling_std, 0) > 2 THEN 'warning'
        ELSE 'ok'
    END AS verdict
FROM rolling
WHERE hour >= CURRENT_DATE - INTERVAL '1 day'
ORDER BY hour DESC;

z > 3 — точно incident. PagerDuty alert.

По категориям

Тикет volume spike в одной категории — конкретная проблема:

WITH hourly_by_cat AS (
    SELECT
        category,
        DATE_TRUNC('hour', created_at)::TIMESTAMP AS hour,
        COUNT(*) AS tickets
    FROM support_tickets
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY category, DATE_TRUNC('hour', created_at)
),
with_z AS (
    SELECT
        category,
        hour,
        tickets,
        AVG(tickets) OVER (PARTITION BY category ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING) AS baseline,
        STDDEV_SAMP(tickets) OVER (PARTITION BY category ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING) AS std
    FROM hourly_by_cat
)
SELECT
    category,
    hour,
    tickets,
    baseline,
    (tickets - baseline) / NULLIF(std, 0) AS z
FROM with_z
WHERE hour >= CURRENT_DATE - INTERVAL '1 day'
  AND ABS((tickets - baseline) / NULLIF(std, 0)) > 2.5;

Если billing spike z=5 — обычно payment processor проблема.

Закрепи формулу ticket volume anomaly в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать ticket volume anomaly в Telegram

Day-of-week

Понедельник всегда больше тикетов, чем выходные. Стратифицируйте:

WITH hourly AS (
    SELECT
        EXTRACT(DOW FROM created_at)::INT AS dow,
        EXTRACT(HOUR FROM created_at)::INT AS hour_of_day,
        DATE(created_at) AS day,
        COUNT(*) AS tickets
    FROM support_tickets
    WHERE created_at >= CURRENT_DATE - INTERVAL '60 days'
    GROUP BY dow, hour_of_day, DATE(created_at)
),
baseline_by_slot AS (
    SELECT
        dow, hour_of_day,
        AVG(tickets) AS slot_avg,
        STDDEV_SAMP(tickets) AS slot_std
    FROM hourly
    WHERE day < CURRENT_DATE - INTERVAL '7 days'
    GROUP BY dow, hour_of_day
)
SELECT
    h.day,
    h.dow,
    h.hour_of_day,
    h.tickets,
    b.slot_avg,
    (h.tickets - b.slot_avg) / NULLIF(b.slot_std, 0) AS z_vs_slot
FROM hourly h
JOIN baseline_by_slot b USING (dow, hour_of_day)
WHERE h.day >= CURRENT_DATE - INTERVAL '7 days';

Частые ошибки

Ошибка 1. Без day-of-week stratification. Пик в понедельник 9 утра — норма, не anomaly.

Ошибка 2. Слишком чувствительный threshold. z > 2 = alert каждый день. Спецы устают. z > 3 — стандарт для critical alerts.

Ошибка 3. Window не учитывает sustained growth. Если ticket count steady растёт (новые юзеры), z отрицательный для baseline-старого. Используйте relative growth.

Ошибка 4. Marketing-campaigns spike → false positive. Major campaign → expected ticket spike. Whitelist known events.

Ошибка 5. Только volume, не severity. 30 тикетов «critical» хуже 100 «easy». Включайте severity-weight.

Связанные темы

FAQ

Каков baseline?

7-28 дней rolling, stratified by hour-of-week.

Алерт куда?

PagerDuty / Slack channel. Auto-page on-call CS lead.

Z или MAD?

MAD устойчивее к outliers. На volatile metrics предпочтительнее.

Real-time или batch?

Real-time: 5-минутные windows. Batch: hourly.

Spike в одной категории — что значит?

Чаще — bug в feature. Иногда — рассылка/campaign trigger.