Как посчитать Standard Deviation в SQL
Содержание:
Зачем Standard Deviation
SD = насколько values разбросаны вокруг mean. Mean 100 ± 5 vs 100 ± 50 — разные продукты. SD — основа для CI, z-test, outlier detection, A/B-планирования.
STDDEV vs POP vs SAMP
STDDEV(x) -- alias для STDDEV_SAMP в PostgreSQL
STDDEV_POP(x) -- popular (deviation от mean / N)
STDDEV_SAMP(x) -- sample (deviation от mean / (N-1))Sample SD больше (denominator N-1 < N). Bessel's correction.
При reporting: используй SAMP (default), it's unbiased estimator.
Базовый расчёт
SELECT
AVG(amount) AS mean,
STDDEV(amount) AS sd,
MIN(amount) AS min_val,
MAX(amount) AS max_val,
COUNT(*) AS n
FROM transactions
WHERE status = 'paid'
AND created_at >= CURRENT_DATE - INTERVAL '30 days';По группам
SELECT
country,
COUNT(*) AS n,
AVG(amount) AS mean,
STDDEV(amount) AS sd,
-- Coefficient of variation
STDDEV(amount) / NULLIF(AVG(amount), 0) AS cv,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median
FROM transactions
JOIN users USING (user_id)
WHERE status = 'paid'
AND created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY country
HAVING COUNT(*) >= 30 -- meaningful SD
ORDER BY cv; -- most consistent firstCoefficient of variation
CV = SD / Mean. Сравнивает variability across groups с разными scales:
SELECT
metric_name,
AVG(value) AS mean,
STDDEV(value) AS sd,
STDDEV(value) / NULLIF(AVG(value), 0) AS cv,
-- CV interpretation
CASE
WHEN STDDEV(value) / NULLIF(AVG(value), 0) < 0.1 THEN 'very consistent'
WHEN STDDEV(value) / NULLIF(AVG(value), 0) < 0.3 THEN 'moderate variance'
WHEN STDDEV(value) / NULLIF(AVG(value), 0) < 1.0 THEN 'high variance'
ELSE 'extreme variance'
END AS variance_level
FROM metric_data
GROUP BY metric_name;Rolling SD
SELECT
DATE,
daily_revenue,
AVG(daily_revenue) OVER (ORDER BY DATE ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS ma_30d,
STDDEV(daily_revenue) OVER (ORDER BY DATE ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS sd_30d,
-- Z-score (anomaly indicator)
(daily_revenue - AVG(daily_revenue) OVER (ORDER BY DATE ROWS BETWEEN 29 PRECEDING AND CURRENT ROW))
/ NULLIF(STDDEV(daily_revenue) OVER (ORDER BY DATE ROWS BETWEEN 29 PRECEDING AND CURRENT ROW), 0) AS z_score
FROM daily_revenue
ORDER BY DATE;|z| > 2 — possible anomaly.
Частые ошибки
Ошибка 1. SD на skewed data. Heavy right tail — SD inflated. Use IQR / MAD для robust spread.
Ошибка 2. STDDEV vs VAR. VAR = SD². SD same units as data. SD more interpretable.
Ошибка 3. SAMP vs POP confusion. Sample (default) почти всегда правильный.
Ошибка 4. Compare SDs different scales. SD revenue (USD) vs SD count — apples-oranges. CV normalize.
Ошибка 5. Outliers скew SD. Один whale × SD doubles. Winsorize или trim.
Связанные темы
- Как посчитать confidence interval в SQL
- Как посчитать z-score в SQL
- Как посчитать anomaly detection в SQL
- Медиана vs среднее
FAQ
STDDEV vs STDDEV_POP?
STDDEV alias for SAMP в PostgreSQL. SAMP — unbiased estimator (divide by N-1).
SD vs Variance?
Variance = SD². SD has same units as data → more interpretable.
SD vs IQR?
SD assumes normal distribution. IQR robust для skewed.
CV percent?
Yes — multiply by 100. CV 15% = SD = 15% of mean.
When 0 SD?
All values identical. Or N=1 (sample) → SAMP undefined.