Шпаргалка по A/B-тестам
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
A/B-тесты — daily хлеб аналитика. На собесах спрашивают regularly. Эта шпаргалка — quick reference key concepts, формул, pitfalls.
Фундаменты
Hypothesis
H0 (null): нет effect. H1 (alternative): есть effect.
A/B test — try reject H0.
P-value
Probability data как эта при H0.
Small p (< 0.05) — reject H0, «significant».
Alpha (α)
Threshold significance. Typically 0.05.
= false positive rate если H0 true.
Beta (β)
False negative rate. Typically 0.20 (power = 1 - β = 0.80).
Key metrics
Primary
Main metric decided.
Secondary
Additional context.
Guardrail
Metrics не должны ухудшиться (crashes, churn).
Sample size
Формула (proportions)
n = 2 × (z_α/2 + z_β)² × p̄ × (1 - p̄) / (p1 - p2)²Для α = 0.05, power = 0.80:
- z_α/2 = 1.96
- z_β = 0.84
MDE relationship
n ∝ 1 / MDE²MDE ÷ 2 → N × 4.
Лестница analyses
T-test
Compare means двух групп (continuous).
from scipy.stats import ttest_ind
t, p = ttest_ind(a, b)Z-test для proportions
Compare conversion rates.
from statsmodels.stats.proportion import proportions_ztest
z, p = proportions_ztest([conv_a, conv_b], [n_a, n_b])Chi-square
Categorical data, independence.
Mann-Whitney U
Non-parametric, skewed data.
Pitfalls
Peeking
Multiple checks inflate FPR.
Multiple testing
10 metrics tested → α actually 10×. Bonferroni: α/k.
Novelty effect
Temporary boost new feature.
Primary effect
Users resist change temporarily.
Sample ratio mismatch (SRM)
Expected 50/50, got 49/51 — bias? Chi-square test.
Network effects
Treatment affects control (social, marketplace).
Cluster-randomized тest.
Statistical vs practical significance
- p < 0.05: statistical
- Effect size — practical
Large sample: trivial effects значимые.
Always both.
CI для разницы
CI = (diff) ± z × SEWhere SE depends on type metric.
For proportions:
SE = √(p1(1-p1)/n1 + p2(1-p2)/n2)95% CI: ± 1.96 × SE.
Не включает 0 → significant.
Effect size
Cohen's d
d = (mean1 - mean2) / pooled_std- 0.2: small
- 0.5: medium
- 0.8: large
Lift
Relative: (treatment - control) / control × 100%.
Ease interpret для business.
CUPED
Variance reduction:
Y_cuped = Y - θ × (X - mean(X))Где X — pre-experiment covariate, θ = Cov(Y,X) / Var(X).
Reduces variance → меньше sample size.
Randomization
Simple
Random assign 50/50.
Stratified
Within strata (platform, country) — balance.
Cluster
Randomize by city / cohort, не individual.
Выбор duration
Min: 1-2 weeks.
Ideal: until stat significance reach OR pre-calculated N.
Consider:
- Full weeks (weekly patterns)
- Novelty wearing off
- Seasonality
Results analysis
Confirm
Primary metric significant?
Guardrails
Not degraded?
Segments
Heterogeneity? Works для all segments?
Duration
Ran full planned period?
Decision
Ship / not ship / iterate.
Decision framework
- Significant + positive + guardrails ok: ship
- Significant + negative: don't ship
- Not significant: не ship или iterate
- Mixed segments: consider partial ship
SQL для A/B
Basic
SELECT variant, AVG(converted), COUNT(*)
FROM experiment
GROUP BY variant;CI
SELECT
variant,
AVG(metric) AS mean,
STDDEV(metric) AS sd,
COUNT(*) AS n,
AVG(metric) - 1.96 * STDDEV(metric) / SQRT(COUNT(*)) AS ci_lower,
AVG(metric) + 1.96 * STDDEV(metric) / SQRT(COUNT(*)) AS ci_upper
FROM experiment
GROUP BY variant;Bayesian A/B
Alternative:
- Prior + likelihood = posterior
- P(B > A | data) — интерпретируется прямо
- Peeking safer (сохраняет правильность)
Variance reduction techniques
- CUPED
- Stratification
- Regression adjustment
- Variance-weighted estimator
Все уменьшают N для given detectable effect.
Sequential testing
Proper methods для peeking:
- AVI (Always Valid Inference)
- mSPRT
- Alpha spending
Типичные errors
1. Tiny sample
100 users, «significant lift 10%» — probably noise.
2. Cherry-picking
Try 10 metrics → 1 significant by chance. Adjust α.
3. Wrong assumptions
Assumed normality on log-normal data. T-test biased.
4. Ignoring context
«Ran test, got result, ship» — мисс cultural, seasonal context.
Собесные questions
«T-test vs Z-test?» Z для больших N или known σ. T для малых.
«Как choose N?» Based on MDE, α, power, baseline variance.
«Что такое power?» 1 - β. Probability detect true effect.
«CUPED?» Variance reduction через pre-experiment covariate.
«Как handle network effects?» Cluster-randomized experiments.
Related cheat sheets
Связанные темы
FAQ
Какой α default?
0.05 convention. Stricter: 0.01 для critical decisions.
One-tailed vs two-tailed?
Two — safer default. One — только если сильная prior.
Bootstrap или analytical?
Analytical fast. Bootstrap flexible, heavier compute.
Тренируйте A/B — откройте тренажёр с 1500+ вопросами для собесов.