Как посчитать Skewness в SQL
Содержание:
Зачем Skewness
Skewness measures асимметрию distribution. Right skew (positive) — long tail справа (revenue, salaries). Left skew (negative) — long tail слева (test scores близко к max). Skewness = 0 — symmetric (normal).
Формула
Sample skewness:
γ = (1/n) × Σ((xi - x̄) / s)³Pearson's median skewness (simpler):
γ = 3 × (mean - median) / sdPearson skewness (простая)
WITH stats AS (
SELECT
AVG(amount) AS mean,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median,
STDDEV(amount) AS sd
FROM transactions
WHERE status = 'paid'
)
SELECT
mean,
median,
sd,
3.0 * (mean - median) / NULLIF(sd, 0) AS pearson_skew
FROM stats;Sample skewness
WITH stats AS (
SELECT
AVG(amount) AS mean,
STDDEV(amount) AS sd,
COUNT(*) AS n
FROM transactions
WHERE status = 'paid'
),
moments AS (
SELECT
AVG(POWER((amount - s.mean) / NULLIF(s.sd, 0), 3)) AS skew
FROM transactions t, stats s
WHERE t.status = 'paid'
)
SELECT skew FROM moments;Интерпретация
| Skew | Interpretation |
|---|---|
| 0 | Symmetric (normal) |
| > 0 | Right-skewed (long right tail) |
| < 0 | Left-skewed (long left tail) |
| > 1 | Highly skewed |
| > 2 | Extremely skewed |
-- Decision rule
SELECT
mean,
median,
3.0 * (mean - median) / NULLIF(sd, 0) AS skew,
CASE
WHEN ABS(3.0 * (mean - median) / NULLIF(sd, 0)) < 0.5 THEN 'symmetric'
WHEN 3.0 * (mean - median) / NULLIF(sd, 0) > 1 THEN 'heavily RIGHT-skewed (whales)'
WHEN 3.0 * (mean - median) / NULLIF(sd, 0) > 0.5 THEN 'RIGHT-skewed'
WHEN 3.0 * (mean - median) / NULLIF(sd, 0) < -1 THEN 'heavily LEFT-skewed'
ELSE 'LEFT-skewed'
END AS shape
FROM stats;По группам
SELECT
country,
AVG(amount) AS mean,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median,
STDDEV(amount) AS sd,
3.0 * (AVG(amount) - PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount)) / NULLIF(STDDEV(amount), 0) AS skew
FROM transactions
JOIN users USING (user_id)
WHERE status = 'paid'
GROUP BY country
ORDER BY skew DESC;Why care?
- Mean vs Median: Right-skew → mean > median. Report median.
- A/B tests: Heavy skew → use non-parametric / bootstrap, not t-test.
- Outliers: Skew indicates outliers presence.
- Transformations: log(x) often normalize right-skewed data.
Частые ошибки
Ошибка 1. Skew = 0 = normal. Skew = 0 — symmetric. Не гарантирует normality (kurtosis тоже).
Ошибка 2. Sample size matters. N < 50 — skew estimate unreliable.
Ошибка 3. Pearson vs sample formula. Different formulas. Pearson — simpler approximation.
Ошибка 4. Log-transform на 0/negative. Log requires > 0. Add constant первый: log(x + 1).
Ошибка 5. SD undefined. Pearson skew = (mean - median) / sd. Sd = 0 → undefined.
Связанные темы
- Как посчитать standard deviation в SQL
- Как посчитать медиану в SQL
- Как посчитать IQR в SQL
- Медиана vs среднее
FAQ
Skewness vs Kurtosis?
Skew — asymmetry. Kurtosis — tail heaviness (peaked vs flat).
Negative skew когда?
Test scores где max каpped (100% — test). Tail слева.
Skew > 1 — что значит?
Heavy right tail. Use median вместо mean.
Log transform?
Right-skewed (positive only) → log сглаживает. Normality test после.
Bootstrap для skewed?
Yes. Non-parametric, не assumes normality. CI на median through bootstrap.