Как посчитать R² в SQL
Содержание:
Зачем R²
R² (coefficient of determination) = % variance в target explained by model. Range 0-1 (sometimes negative для very bad). Простой single-number summary regression quality.
Формула
R² = 1 - (SSres / SStot)
SSres = Σ(actual - predicted)²
SStot = Σ(actual - mean(actual))²R² = 0.8 → model explains 80% of variance. Remaining 20% — noise / missing features.
Базовый расчёт
WITH stats AS (
SELECT
actual_value,
predicted_value,
AVG(actual_value) OVER () AS mean_actual
FROM predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
),
ss AS (
SELECT
SUM(POWER(actual_value - predicted_value, 2)) AS ss_res,
SUM(POWER(actual_value - mean_actual, 2)) AS ss_tot
FROM stats
)
SELECT
ss_res,
ss_tot,
1 - (ss_res::NUMERIC / NULLIF(ss_tot, 0)) AS r_squared
FROM ss;Интерпретация
| R² | Quality |
|---|---|
| 1.0 | Perfect fit (suspicious) |
| 0.9-0.99 | Excellent |
| 0.7-0.9 | Good |
| 0.5-0.7 | Moderate |
| 0.3-0.5 | Weak |
| < 0.3 | Poor |
| < 0 | Worse than mean (broken model) |
Adjusted R²
Penalizes adding more features:
Adjusted R² = 1 - (1 - R²) × (n-1) / (n-k-1)n = sample size, k = features count.
WITH calc AS (
SELECT
1 - SUM(POWER(actual_value - predicted_value, 2))::NUMERIC
/ SUM(POWER(actual_value - mean_actual, 2)) AS r2,
COUNT(*) AS n,
5 AS k -- number of features
FROM (
SELECT
actual_value,
predicted_value,
AVG(actual_value) OVER () AS mean_actual
FROM predictions
) t
)
SELECT
r2,
n,
k,
1 - (1 - r2) * (n - 1)::NUMERIC / (n - k - 1) AS adjusted_r2
FROM calc;По segments
WITH segment_stats AS (
SELECT
segment,
actual_value,
predicted_value,
AVG(actual_value) OVER (PARTITION BY segment) AS mean_actual
FROM predictions
JOIN entities USING (entity_id)
)
SELECT
segment,
COUNT(*) AS n,
1 - SUM(POWER(actual_value - predicted_value, 2))::NUMERIC
/ NULLIF(SUM(POWER(actual_value - mean_actual, 2)), 0) AS r_squared
FROM segment_stats
GROUP BY segment
ORDER BY r_squared DESC;R² vs Correlation
-- For simple linear regression: R² = correlation²
SELECT
CORR(actual_value, predicted_value) AS r,
POWER(CORR(actual_value, predicted_value), 2) AS r_squared
FROM predictions;Для multiple regression: R² ≠ correlation².
Частые ошибки
Ошибка 1. R² = 1 too good. Likely overfitting / data leakage / target leak in features.
Ошибка 2. Negative R². Model worse than mean. Broken — debug.
Ошибка 3. R² compared across models with different N. Use Adjusted R² для comparison.
Ошибка 4. R² for classification. R² for regression. Classification → AUC / F1.
Ошибка 5. Train R² high, Test R² low. Overfitting. Reduce model complexity или more data.
Связанные темы
- Как посчитать RMSE в SQL
- Как посчитать MAE в SQL
- Linear regression для аналитика
- Как посчитать correlation в SQL
FAQ
R² range?
0 to 1 для linear regression. Negative для models worse than mean (very bad).
R² high — всегда good?
Not necessarily. Could be overfitting. Test on holdout.
R² vs Adjusted R²?
R² inflates с features. Adjusted R² penalizes — compare models fairly.
R² для time series?
Less meaningful (lots of patterns). Use MAPE / RMSE.
R² 0.95 in business — suspicious?
Yes — verify target leakage. Real-world R² rarely 0.9+.