Правила хорошего A/B-теста
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
A/B-тесты часто делают плохо. Rules ниже — ставит вас в топ 20% аналитиков.
На собесах спрашивают best practices. Разбираться — senior signal.
Do's
1. Pre-register
Hypothesis, metric, MDE, duration — decided до test.
Prevents p-hacking.
2. Single primary metric
One. Secondary — context. Guardrail — не broken.
3. Calculate sample size
Formula based на α, power, MDE, baseline.
Don't eyeball.
4. Random assignment
Hash user_id → bucket. Not rotating.
5. Monitor guardrails
Real-time alerts on critical metrics degradation.
6. Full week at minimum
Cover weekly patterns.
7. Document
Plan, results, decision — в writing.
8. Share results
Good и bad. Culture transparency.
9. Learn от failures
Null result — learning.
10. A/A test occasionally
Validate system. Should показать no difference.
Don'ts
1. Don't peek
Don't look до planned end.
Если must peek — use sequential methods.
2. Don't change midway
«Add another variant» — breaks stats.
3. Don't rely p < 0.05 только
Check effect size. Statistical != practical.
4. Don't test trivial
«Button color» — tiny potential impact. Spend on meaningful.
5. Don't ignore SRM
Unbalanced assignment → check before analyzing.
6. Don't ship на pre-final data
Even if trending positive — wait.
7. Don't forget holdout
Long-term effects matter. Holdout group helps.
8. Don't run parallel conflicting
Treatment X + Treatment Y simultaneously → interaction not isolated.
9. Don't ignore segments
Aggregate может hide segment dynamics.
10. Don't forget novelty
Ship based на week 1 lift → overstate.
Technical checks
SRM test
Chi-square on assignment:
from scipy.stats import chisquare
stat, p = chisquare([n_a, n_b], [expected_a, expected_b])
if p < 0.01: print("SRM issue!")Outlier check
Any variant dominated one extreme user?
SELECT variant, MAX(metric), PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY metric)
FROM experiment GROUP BY variant;Sanity check
Basic counts. Reasonable?
Ethical checks
Harm
Treatment hurts user?
Test not extreme harm. Even detection test should catch fast.
Consent
Ethics boards might require.
Fairness
Unequal effect across demographics?
Monitor.
Statistical
Type I, II errors
Balance.
Multiple testing
10 metrics tested → FPR 40%+.
Bonferroni или pre-primary.
Effect size
Not just p. Cohen's d, lift %.
CI
Interval estimate. Width shows precision.
Design
Proper randomization
Unit assignment consistent over time.
Не switching user every session.
Clusters
If network effects — city / cluster.
Stratify
Known confounders → balance within strata.
Analysis
Not only t-test
Specific к metric type:
- Conversion: z-test proportions
- Revenue: t-test or bootstrap (log-normal)
- Count: Poisson / Mann-Whitney
- Time-to-event: survival analysis
CUPED
Variance reduction. Speeds test.
Regression adjustment
Include covariates. Similar effect к CUPED.
Ship decision
All positive
Primary significant, guardrails ok → ship.
All negative
Don't ship.
Mixed
Trade-off analysis. Business call.
Partial rollout
Only some segments benefit → ship selectively.
Common mistakes
Gambling
«Trend positive, ship now» — premature.
Ignoring limitations
Small N, short duration, narrow segment — overclaim.
Re-analyzing
«Changed primary metric» — p-hacking.
Ignoring outliers
One customer в treatment made huge purchase → inflates mean.
Check distribution.
Reporting
What include
- Hypothesis
- Setup (duration, size, randomization)
- Results (primary, secondary, guardrails)
- Segments
- Decision
- Learnings
Audience
- Team: detailed
- Exec: summary + decision
Long-term
Holdout
Some users never получают new features. Measure cumulative impact.
Follow-up
Ship → monitor metric long-term. Novelty wore off?
Meta-analysis
Across experiments — patterns. «We tend ship, expected 5% impact get 3%».
Calibrate.
Tools
Platforms
- Optimizely, VWO (commercial)
- GrowthBook, Flagsmith (open-source)
- Custom (big tech)
Analysis
- SQL + Python для flexible
- Platforms — easy but limited
Power analysis
- Online calculators
- Python
statsmodels
Team culture
Experimentation ritual
Weekly / bi-weekly review results.
Learning
Share wins, losses. Sub-teams learn.
Continuous improvement
Process, templates, tools evolving.
Advanced
Sequential testing
Properly peek. Подробнее.
Bayesian A/B
Probability posterior.
Multi-armed bandits
Allocate traffic dynamically к best variant.
Causal inference
Observational методы когда A/B нельзя.
На собесе
«Golden rule A/B?»
Answer:
- Pre-register
- Sample size
- No peeking
- Monitor guardrails
- Full cycle
Multiple covers breadth.
«Common mistakes?»
Peeking, multiple testing, ignoring segments, novelty.
Связанные темы
- A/B простыми словами
- A/B проектирование пошагово
- Как рассчитать sample size
- CUPED
- Sequential testing
- Peeking problem
- Novelty effect
FAQ
Perfect A/B возможен?
Rarely perfect. Balance rigor vs speed.
Small sample?
Smaller MDE detectable, но longer duration.
Новый к A/B?
Start с simple tests. Learn infrastructure. Ramp to complex.
Тренируйте — откройте тренажёр с 1500+ вопросами для собесов.