Как посчитать latency percentiles в SQL
Содержание:
Зачем percentiles
Average latency обманчив: 99 запросов по 100ms + 1 запрос за 10s даёт avg 199ms. Но 1% юзеров видит 10 секунд. P95 / p99 показывают tail — это «boundary worst case» для большинства юзеров. P99 — golden metric SRE.
Формула
p50 = median, p95 = top 5% slowest, p99 = top 1% slowestPostgreSQL: PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms).
Percentiles в SQL
SELECT
DATE_TRUNC('minute', request_timestamp) AS minute,
COUNT(*) AS requests,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY latency_ms) AS p50,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99,
PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY latency_ms) AS p999
FROM api_requests
WHERE request_timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY DATE_TRUNC('minute', request_timestamp)
ORDER BY minute;P50 100ms, p95 500ms, p99 2s — типичные numbers для веб-API.
По endpoint
SELECT
endpoint,
COUNT(*) AS requests,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY latency_ms) AS p50,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM api_requests
WHERE request_timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY endpoint
HAVING COUNT(*) >= 1000
ORDER BY p99 DESC
LIMIT 10;Top endpoint с p99 проблемой — оптимизация.
Tail latency
SELECT
endpoint,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99,
PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY latency_ms) AS p999,
PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY latency_ms)
/ NULLIF(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms), 0) AS p999_to_p99_ratio
FROM api_requests
GROUP BY endpoint
HAVING COUNT(*) >= 10000
ORDER BY p999_to_p99_ratio DESC;P999/P99 > 5× — heavy tail. Тут отдельная investigation.
Частые ошибки
Ошибка 1. Average latency. Скрывает tail. P95+ требуется.
Ошибка 2. Считать на small samples. P99 на 100 точек ненадёжен. Минимум 10k-100k для accurate.
Ошибка 3. Wide window. Day-long aggregate скрывает min-by-min spikes. Per-minute buckets.
Ошибка 4. Сравнивать percentiles разных endpoint. GET /healthcheck vs POST /search — разные complexity, разные expectations.
Ошибка 5. P99 как goal. P99 dramatically variable. SLOs обычно на p99.5 или p99.9 (более стабильные).
Связанные темы
- Как посчитать API latency в SQL
- Как посчитать error rate в SQL
- Как посчитать перцентили в SQL
- Как посчитать z-score outliers в SQL
FAQ
Какие percentiles tracked?
P50, p95, p99 стандарт. P99.9 для high-traffic services.
P50 vs avg?
P50 = median. Robust to outliers. Avg manipulated by tail.
Sample size?
P99: 1000+ points minimum. P999: 100k+.
Different tools = different numbers?
Yes — DataDog, Grafana может aggregate differently. Pick один и stick.
Latency SLO target?
API: p99 меньше 500ms. Backend service: p99 меньше 100ms.