Как посчитать latency percentiles в SQL

Закрепи формулу latency percentiles в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать latency percentiles в Telegram

Зачем percentiles

Average latency обманчив: 99 запросов по 100ms + 1 запрос за 10s даёт avg 199ms. Но 1% юзеров видит 10 секунд. P95 / p99 показывают tail — это «boundary worst case» для большинства юзеров. P99 — golden metric SRE.

Формула

p50 = median, p95 = top 5% slowest, p99 = top 1% slowest

PostgreSQL: PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms).

Percentiles в SQL

SELECT
    DATE_TRUNC('minute', request_timestamp) AS minute,
    COUNT(*) AS requests,
    PERCENTILE_CONT(0.5)  WITHIN GROUP (ORDER BY latency_ms) AS p50,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99,
    PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY latency_ms) AS p999
FROM api_requests
WHERE request_timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY DATE_TRUNC('minute', request_timestamp)
ORDER BY minute;

P50 100ms, p95 500ms, p99 2s — типичные numbers для веб-API.

По endpoint

SELECT
    endpoint,
    COUNT(*) AS requests,
    PERCENTILE_CONT(0.5)  WITHIN GROUP (ORDER BY latency_ms) AS p50,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM api_requests
WHERE request_timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY endpoint
HAVING COUNT(*) >= 1000
ORDER BY p99 DESC
LIMIT 10;

Top endpoint с p99 проблемой — оптимизация.

Закрепи формулу latency percentiles в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать latency percentiles в Telegram

Tail latency

SELECT
    endpoint,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99,
    PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY latency_ms) AS p999,
    PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY latency_ms)
    / NULLIF(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms), 0) AS p999_to_p99_ratio
FROM api_requests
GROUP BY endpoint
HAVING COUNT(*) >= 10000
ORDER BY p999_to_p99_ratio DESC;

P999/P99 > 5× — heavy tail. Тут отдельная investigation.

Частые ошибки

Ошибка 1. Average latency. Скрывает tail. P95+ требуется.

Ошибка 2. Считать на small samples. P99 на 100 точек ненадёжен. Минимум 10k-100k для accurate.

Ошибка 3. Wide window. Day-long aggregate скрывает min-by-min spikes. Per-minute buckets.

Ошибка 4. Сравнивать percentiles разных endpoint. GET /healthcheck vs POST /search — разные complexity, разные expectations.

Ошибка 5. P99 как goal. P99 dramatically variable. SLOs обычно на p99.5 или p99.9 (более стабильные).

Связанные темы

FAQ

Какие percentiles tracked?

P50, p95, p99 стандарт. P99.9 для high-traffic services.

P50 vs avg?

P50 = median. Robust to outliers. Avg manipulated by tail.

Sample size?

P99: 1000+ points minimum. P999: 100k+.

Different tools = different numbers?

Yes — DataDog, Grafana может aggregate differently. Pick один и stick.

Latency SLO target?

API: p99 меньше 500ms. Backend service: p99 меньше 100ms.