Как посчитать Error Rate в SQL

Закрепи формулу error rate в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать error rate в Telegram

Зачем Error Rate

Error Rate — % requests с error response (HTTP 5xx, 4xx). 5xx = server error. 4xx = client error. SLO часто включает «success rate ≥ 99.9%».

Формула

Error Rate = error_requests / total_requests × 100%
Success Rate = 100 - Error Rate

Базовый расчёт

SELECT
    DATE_TRUNC('hour', TIMESTAMP) AS hour,
    COUNT(*) AS total,
    COUNT(*) FILTER (WHERE response_code >= 500) AS errors_5xx,
    COUNT(*) FILTER (WHERE response_code BETWEEN 400 AND 499) AS errors_4xx,
    COUNT(*) FILTER (WHERE response_code BETWEEN 500 AND 599)::NUMERIC * 100 / COUNT(*) AS error_5xx_pct,
    COUNT(*) FILTER (WHERE response_code BETWEEN 400 AND 499)::NUMERIC * 100 / COUNT(*) AS error_4xx_pct
FROM api_logs
WHERE TIMESTAMP >= NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY 1;

По endpoints

SELECT
    endpoint,
    COUNT(*) AS requests,
    COUNT(*) FILTER (WHERE response_code >= 500) AS errors_5xx,
    COUNT(*) FILTER (WHERE response_code >= 500)::NUMERIC * 100 / COUNT(*) AS error_rate_pct,
    MAX(response_code) AS worst_code
FROM api_logs
WHERE TIMESTAMP >= NOW() - INTERVAL '24 hours'
GROUP BY endpoint
HAVING COUNT(*) >= 100  -- meaningful
ORDER BY error_rate_pct DESC
LIMIT 30;

По status code distribution

SELECT
    response_code,
    COUNT(*) AS occurrences,
    COUNT(*)::NUMERIC * 100 / SUM(COUNT(*)) OVER () AS pct
FROM api_logs
WHERE TIMESTAMP >= NOW() - INTERVAL '1 hour'
  AND response_code >= 400
GROUP BY response_code
ORDER BY occurrences DESC;
Закрепи формулу error rate в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать error rate в Telegram

Error budget

SLO 99.9% success → 0.1% error budget. Например, 1M requests/day → budget 1000 errors:

WITH stats AS (
    SELECT
        DATE_TRUNC('day', TIMESTAMP) AS day,
        COUNT(*) AS total,
        COUNT(*) FILTER (WHERE response_code >= 500) AS errors
    FROM api_logs
    WHERE TIMESTAMP >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY 1
)
SELECT
    day,
    total,
    errors,
    errors::NUMERIC * 100 / total AS error_pct,
    total * 0.001 AS budget_99_9pct,
    errors - total * 0.001 AS over_budget,
    CASE
        WHEN errors <= total * 0.001 THEN 'WITHIN BUDGET'
        WHEN errors <= total * 0.005 THEN 'WARNING'
        ELSE 'BREACH'
    END AS status
FROM stats
ORDER BY day DESC;

Trend

SELECT
    DATE_TRUNC('hour', TIMESTAMP) AS hour,
    COUNT(*) FILTER (WHERE response_code >= 500)::NUMERIC * 100 / COUNT(*) AS error_rate,
    AVG(COUNT(*) FILTER (WHERE response_code >= 500)::NUMERIC * 100 / COUNT(*)) OVER (
        ORDER BY DATE_TRUNC('hour', TIMESTAMP)
        ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) AS error_rate_ma24h
FROM api_logs
WHERE TIMESTAMP >= NOW() - INTERVAL '7 days'
GROUP BY 1
ORDER BY 1;

Spike vs baseline easily visible.

Частые ошибки

Ошибка 1. 4xx vs 5xx mixed. 4xx — client error (bad input). 5xx — server error. Different actions.

Ошибка 2. Retries inflate. Failed request retried 3 times → 4 logged errors. Dedup by request_id.

Ошибка 3. Ignore 401/403. Auth errors — could be legit (expired session) или attack. Track separately.

Ошибка 4. Background jobs не counted. Async jobs могут fail silently. Track too.

Ошибка 5. Health check 200 OK. Healthcheck returns 200 always — masks real error rate. Exclude.

Связанные темы

FAQ

Какой Error Rate ok?

Critical API: < 0.1%. Public API: < 0.5%. Internal: < 1%. > 5% — emergency.

4xx как error?

Some — yes (5xx обязательно). 401, 403 — auth issues, count separately.

Error budget concept?

SRE practice. Defines acceptable failure (e.g., 0.1% per quarter). Spend на innovation, не availability.

Spike of errors — first action?

  1. Check recent deploys. 2) DB health. 3) External dependencies (downstream services). 4) Rollback если recent change.

Error rate per user?

User-impact metric. # users hitting errors > # error events.