Как посчитать корреляцию в SQL
Содержание:
Зачем корреляция
Pearson r ∈ [-1, 1]. r = 0.8 — strong positive linear relationship. r = -0.5 — moderate negative. r = 0 — no linear relationship (но могут быть non-linear).
Формула
r = Σ((xi - x̄)(yi - ȳ)) / √(Σ(xi - x̄)² × Σ(yi - ȳ)²)В SQL — встроенная функция CORR(x, y) в PostgreSQL.
CORR в SQL
SELECT
CORR(user_age, lifetime_value) AS corr_age_ltv,
COUNT(*) AS n
FROM users
WHERE lifetime_value IS NOT NULL
AND user_age IS NOT NULL;По группам
SELECT
country,
COUNT(*) AS n,
CORR(user_age, lifetime_value) AS r,
CORR(user_age, lifetime_value) * CORR(user_age, lifetime_value) AS r_squared
FROM users
WHERE lifetime_value IS NOT NULL
GROUP BY country
HAVING COUNT(*) >= 30 -- meaningful sample
ORDER BY ABS(r) DESC;Multiple variables
Pairwise correlation matrix:
WITH stats AS (
SELECT user_id, age, ltv, days_active, n_sessions
FROM users
WHERE ltv IS NOT NULL
)
SELECT
CORR(age, ltv) AS age_ltv,
CORR(age, days_active) AS age_days_active,
CORR(age, n_sessions) AS age_n_sessions,
CORR(ltv, days_active) AS ltv_days_active,
CORR(ltv, n_sessions) AS ltv_n_sessions,
CORR(days_active, n_sessions) AS days_active_n_sessions
FROM stats;Spearman (rank-based)
Pearson assumes linear + normal. Spearman robust для non-linear monotonic:
WITH ranked AS (
SELECT
RANK() OVER (ORDER BY x) AS rank_x,
RANK() OVER (ORDER BY y) AS rank_y
FROM data
)
SELECT CORR(rank_x::NUMERIC, rank_y::NUMERIC) AS spearman_r
FROM ranked;Интерпретация r
| r | Strength |
|---|---|
| 0.9-1.0 | Very strong |
| 0.7-0.9 | Strong |
| 0.5-0.7 | Moderate |
| 0.3-0.5 | Weak |
| 0.0-0.3 | Very weak / none |
Negative: same scale, but inverse direction.
R² (= r²) — % variance explained. r = 0.5 → R² = 0.25 (25% explained).
Частые ошибки
Ошибка 1. Correlation = causation. r = 0.9 не значит X → Y. Confounders.
Ошибка 2. Non-linear. r = 0 для perfect parabola. Plot data first.
Ошибка 3. Outliers. Один extreme point может flip sign of r. Winsorize.
Ошибка 4. Small sample. N < 30 — unreliable r. Check CI.
Ошибка 5. Spurious correlations. Random time-series correlate. «X people on Mars and US per-capita cheese consumption» — fun examples.
Связанные темы
- Linear regression для аналитика
- Как посчитать standard deviation в SQL
- Correlation vs covariance
- Как проверить значимо ли среднее
FAQ
CORR в other databases?
PostgreSQL — CORR. MySQL — нет, считать manually. ClickHouse — corr. BQ — CORR.
Что значит r = 1?
Perfect positive linear. Almost never в real data unless tautology.
r vs r²?
r — strength + direction. r² — variance explained. R² = r² для linear regression.
Spearman vs Pearson?
Pearson — linear assumption. Spearman — monotonic, rank-based, robust.
CORR(x, x)?
Always 1 (perfect with itself).