Как посчитать R² в SQL

Закрепи формулу r2 в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать r2 в Telegram

Зачем R²

R² (coefficient of determination) = % variance в target explained by model. Range 0-1 (sometimes negative для very bad). Простой single-number summary regression quality.

Формула

R² = 1 - (SSres / SStot)
SSres = Σ(actual - predicted)²
SStot = Σ(actual - mean(actual))²

R² = 0.8 → model explains 80% of variance. Remaining 20% — noise / missing features.

Базовый расчёт

WITH stats AS (
    SELECT
        actual_value,
        predicted_value,
        AVG(actual_value) OVER () AS mean_actual
    FROM predictions
    WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
),
ss AS (
    SELECT
        SUM(POWER(actual_value - predicted_value, 2)) AS ss_res,
        SUM(POWER(actual_value - mean_actual, 2)) AS ss_tot
    FROM stats
)
SELECT
    ss_res,
    ss_tot,
    1 - (ss_res::NUMERIC / NULLIF(ss_tot, 0)) AS r_squared
FROM ss;

Интерпретация

Quality
1.0 Perfect fit (suspicious)
0.9-0.99 Excellent
0.7-0.9 Good
0.5-0.7 Moderate
0.3-0.5 Weak
< 0.3 Poor
< 0 Worse than mean (broken model)

Adjusted R²

Penalizes adding more features:

Adjusted R² = 1 - (1 - R²) × (n-1) / (n-k-1)

n = sample size, k = features count.

WITH calc AS (
    SELECT
        1 - SUM(POWER(actual_value - predicted_value, 2))::NUMERIC
            / SUM(POWER(actual_value - mean_actual, 2)) AS r2,
        COUNT(*) AS n,
        5 AS k  -- number of features
    FROM (
        SELECT
            actual_value,
            predicted_value,
            AVG(actual_value) OVER () AS mean_actual
        FROM predictions
    ) t
)
SELECT
    r2,
    n,
    k,
    1 - (1 - r2) * (n - 1)::NUMERIC / (n - k - 1) AS adjusted_r2
FROM calc;
Закрепи формулу r2 в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать r2 в Telegram

По segments

WITH segment_stats AS (
    SELECT
        segment,
        actual_value,
        predicted_value,
        AVG(actual_value) OVER (PARTITION BY segment) AS mean_actual
    FROM predictions
    JOIN entities USING (entity_id)
)
SELECT
    segment,
    COUNT(*) AS n,
    1 - SUM(POWER(actual_value - predicted_value, 2))::NUMERIC
        / NULLIF(SUM(POWER(actual_value - mean_actual, 2)), 0) AS r_squared
FROM segment_stats
GROUP BY segment
ORDER BY r_squared DESC;

R² vs Correlation

-- For simple linear regression: R² = correlation²
SELECT
    CORR(actual_value, predicted_value) AS r,
    POWER(CORR(actual_value, predicted_value), 2) AS r_squared
FROM predictions;

Для multiple regression: R² ≠ correlation².

Частые ошибки

Ошибка 1. R² = 1 too good. Likely overfitting / data leakage / target leak in features.

Ошибка 2. Negative R². Model worse than mean. Broken — debug.

Ошибка 3. R² compared across models with different N. Use Adjusted R² для comparison.

Ошибка 4. R² for classification. R² for regression. Classification → AUC / F1.

Ошибка 5. Train R² high, Test R² low. Overfitting. Reduce model complexity или more data.

Связанные темы

FAQ

R² range?

0 to 1 для linear regression. Negative для models worse than mean (very bad).

R² high — всегда good?

Not necessarily. Could be overfitting. Test on holdout.

R² vs Adjusted R²?

R² inflates с features. Adjusted R² penalizes — compare models fairly.

R² для time series?

Less meaningful (lots of patterns). Use MAPE / RMSE.

R² 0.95 in business — suspicious?

Yes — verify target leakage. Real-world R² rarely 0.9+.