Как посчитать Standard Deviation в SQL

Закрепи формулу standard deviation в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать standard deviation в Telegram

Зачем Standard Deviation

SD = насколько values разбросаны вокруг mean. Mean 100 ± 5 vs 100 ± 50 — разные продукты. SD — основа для CI, z-test, outlier detection, A/B-планирования.

STDDEV vs POP vs SAMP

STDDEV(x)       -- alias для STDDEV_SAMP в PostgreSQL
STDDEV_POP(x)   -- popular (deviation от mean / N)
STDDEV_SAMP(x)  -- sample (deviation от mean / (N-1))

Sample SD больше (denominator N-1 < N). Bessel's correction.

При reporting: используй SAMP (default), it's unbiased estimator.

Базовый расчёт

SELECT
    AVG(amount) AS mean,
    STDDEV(amount) AS sd,
    MIN(amount) AS min_val,
    MAX(amount) AS max_val,
    COUNT(*) AS n
FROM transactions
WHERE status = 'paid'
  AND created_at >= CURRENT_DATE - INTERVAL '30 days';

По группам

SELECT
    country,
    COUNT(*) AS n,
    AVG(amount) AS mean,
    STDDEV(amount) AS sd,
    -- Coefficient of variation
    STDDEV(amount) / NULLIF(AVG(amount), 0) AS cv,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median
FROM transactions
JOIN users USING (user_id)
WHERE status = 'paid'
  AND created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY country
HAVING COUNT(*) >= 30  -- meaningful SD
ORDER BY cv;  -- most consistent first
Закрепи формулу standard deviation в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать standard deviation в Telegram

Coefficient of variation

CV = SD / Mean. Сравнивает variability across groups с разными scales:

SELECT
    metric_name,
    AVG(value) AS mean,
    STDDEV(value) AS sd,
    STDDEV(value) / NULLIF(AVG(value), 0) AS cv,
    -- CV interpretation
    CASE
        WHEN STDDEV(value) / NULLIF(AVG(value), 0) < 0.1 THEN 'very consistent'
        WHEN STDDEV(value) / NULLIF(AVG(value), 0) < 0.3 THEN 'moderate variance'
        WHEN STDDEV(value) / NULLIF(AVG(value), 0) < 1.0 THEN 'high variance'
        ELSE 'extreme variance'
    END AS variance_level
FROM metric_data
GROUP BY metric_name;

Rolling SD

SELECT
    DATE,
    daily_revenue,
    AVG(daily_revenue) OVER (ORDER BY DATE ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS ma_30d,
    STDDEV(daily_revenue) OVER (ORDER BY DATE ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS sd_30d,
    -- Z-score (anomaly indicator)
    (daily_revenue - AVG(daily_revenue) OVER (ORDER BY DATE ROWS BETWEEN 29 PRECEDING AND CURRENT ROW))
    / NULLIF(STDDEV(daily_revenue) OVER (ORDER BY DATE ROWS BETWEEN 29 PRECEDING AND CURRENT ROW), 0) AS z_score
FROM daily_revenue
ORDER BY DATE;

|z| > 2 — possible anomaly.

Частые ошибки

Ошибка 1. SD на skewed data. Heavy right tail — SD inflated. Use IQR / MAD для robust spread.

Ошибка 2. STDDEV vs VAR. VAR = SD². SD same units as data. SD more interpretable.

Ошибка 3. SAMP vs POP confusion. Sample (default) почти всегда правильный.

Ошибка 4. Compare SDs different scales. SD revenue (USD) vs SD count — apples-oranges. CV normalize.

Ошибка 5. Outliers скew SD. Один whale × SD doubles. Winsorize или trim.

Связанные темы

FAQ

STDDEV vs STDDEV_POP?

STDDEV alias for SAMP в PostgreSQL. SAMP — unbiased estimator (divide by N-1).

SD vs Variance?

Variance = SD². SD has same units as data → more interpretable.

SD vs IQR?

SD assumes normal distribution. IQR robust для skewed.

CV percent?

Yes — multiply by 100. CV 15% = SD = 15% of mean.

When 0 SD?

All values identical. Or N=1 (sample) → SAMP undefined.