Как посчитать корреляцию в SQL

Закрепи формулу correlation в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать correlation в Telegram

Зачем корреляция

Pearson r ∈ [-1, 1]. r = 0.8 — strong positive linear relationship. r = -0.5 — moderate negative. r = 0 — no linear relationship (но могут быть non-linear).

Формула

r = Σ((xi - x̄)(yi - ȳ)) / √(Σ(xi - x̄)² × Σ(yi - ȳ)²)

В SQL — встроенная функция CORR(x, y) в PostgreSQL.

CORR в SQL

SELECT
    CORR(user_age, lifetime_value) AS corr_age_ltv,
    COUNT(*) AS n
FROM users
WHERE lifetime_value IS NOT NULL
  AND user_age IS NOT NULL;

По группам

SELECT
    country,
    COUNT(*) AS n,
    CORR(user_age, lifetime_value) AS r,
    CORR(user_age, lifetime_value) * CORR(user_age, lifetime_value) AS r_squared
FROM users
WHERE lifetime_value IS NOT NULL
GROUP BY country
HAVING COUNT(*) >= 30  -- meaningful sample
ORDER BY ABS(r) DESC;

Multiple variables

Pairwise correlation matrix:

WITH stats AS (
    SELECT user_id, age, ltv, days_active, n_sessions
    FROM users
    WHERE ltv IS NOT NULL
)
SELECT
    CORR(age, ltv) AS age_ltv,
    CORR(age, days_active) AS age_days_active,
    CORR(age, n_sessions) AS age_n_sessions,
    CORR(ltv, days_active) AS ltv_days_active,
    CORR(ltv, n_sessions) AS ltv_n_sessions,
    CORR(days_active, n_sessions) AS days_active_n_sessions
FROM stats;
Закрепи формулу correlation в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать correlation в Telegram

Spearman (rank-based)

Pearson assumes linear + normal. Spearman robust для non-linear monotonic:

WITH ranked AS (
    SELECT
        RANK() OVER (ORDER BY x) AS rank_x,
        RANK() OVER (ORDER BY y) AS rank_y
    FROM data
)
SELECT CORR(rank_x::NUMERIC, rank_y::NUMERIC) AS spearman_r
FROM ranked;

Интерпретация r

r Strength
0.9-1.0 Very strong
0.7-0.9 Strong
0.5-0.7 Moderate
0.3-0.5 Weak
0.0-0.3 Very weak / none

Negative: same scale, but inverse direction.

R² (= r²) — % variance explained. r = 0.5 → R² = 0.25 (25% explained).

Частые ошибки

Ошибка 1. Correlation = causation. r = 0.9 не значит X → Y. Confounders.

Ошибка 2. Non-linear. r = 0 для perfect parabola. Plot data first.

Ошибка 3. Outliers. Один extreme point может flip sign of r. Winsorize.

Ошибка 4. Small sample. N < 30 — unreliable r. Check CI.

Ошибка 5. Spurious correlations. Random time-series correlate. «X people on Mars and US per-capita cheese consumption» — fun examples.

Связанные темы

FAQ

CORR в other databases?

PostgreSQL — CORR. MySQL — нет, считать manually. ClickHouse — corr. BQ — CORR.

Что значит r = 1?

Perfect positive linear. Almost never в real data unless tautology.

r vs r²?

r — strength + direction. r² — variance explained. R² = r² для linear regression.

Spearman vs Pearson?

Pearson — linear assumption. Spearman — monotonic, rank-based, robust.

CORR(x, x)?

Always 1 (perfect with itself).