Почему A/B-тесты проваливаются
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
80% A/B-тестов — null или fail. Knowing why — avoid traps, increase success rate.
На собесах вопросы про A/B failures — signals experience.
Проблема вероятностей
Реальность
Most features не make significant impact. True null is common.
«Ship everything positive» strategy → 95% ship-able (bias).
Solutions
- Effort-reward focus
- Multiple wins amass
Technical failures
1. Underpowered
Sample size too small. Can't detect true effect.
Fix: calculate N заранее. Run full duration.
2. Peeking
Check midway → FPR inflates.
Fix: pre-register duration, или sequential tests.
3. SRM
Unbalanced assignment. Silent bug.
Fix: chi-square test on ratio. Exclude if violated.
4. Multiple testing
10 metrics → 1 «significant» by chance.
Fix: primary metric pre-defined. Correct если multiple.
5. Leakage
Treatment reaches control via bugs.
Fix: check assignment carefully. Segment.
6. Novelty effect
Short-term up, long-term no effect.
Fix: longer runs, holdout.
7. Wrong metric
«Shipped, revenue up» — но users uninstall.
Fix: primary + guardrails + long-term.
8. Selection bias
«Tests on engaged users only» — не generalize.
Fix: randomize properly.
9. Network effects
Treatment affects control (social).
Fix: cluster-randomized.
10. Outliers
One whale buys huge → inflates mean.
Fix: check distribution. Use median / clip.
Product failures
1. Feature doesn't address real problem
Built solution без validating problem.
Fix: user research first.
2. Poor implementation
Buggy feature hurts not помогает.
Fix: QA before experiment.
3. Wrong segment tested
Power users vs casual — different reaction.
Fix: segment analysis.
4. Conflicts other features
Simultaneous experiments interfere.
Fix: understand portfolio tests.
5. Wrong timing
Holiday, sale, external event.
Fix: choose stable periods.
6. Long-term vs short-term
Ship короткосрочно positive → hurts retention.
Fix: holdout group, long tracking.
Interpretation failures
1. p < 0.05 = ship
p-value misunderstood. Effect small, business не meaningful.
Fix: also check effect size.
2. Significant = causal
«Feature каuseд X» — maybe confounded.
Fix: randomization ensures causal inference (properly implemented).
3. Ignoring segments
Average +5%, но -20% for premium users.
Fix: segment analysis.
4. Over-generalizing
«Users like feature, everyone wants это».
Maybe specific segment.
5. Cherry-picking
«5 metrics ок, 1 bad — ignore».
Fix: weighted view.
Process failures
1. No hypothesis
«Let's test» — нет prediction.
Result hard interpret.
2. Unclear success criteria
«If lift positive» — vague.
Fix: pre-commit thresholds.
3. Post-hoc rationalization
Story change after results.
Fix: pre-register.
4. Politics
«PM wants ship» → analyst найдёт significance.
Fix: analyst independence. Separate from PM incentives.
5. No learnings
Win / lose — no team review.
Fix: post-mortem каждого test.
Сценарий-specific failures
A/A test
Supposed null, но «significant» — system bug.
Fix: investigate. Validate platform periodically.
Holiday shift
Normal retention drops Christmas. «Treatment down» — not really.
Fix: seasonally adjust, YoY comparisons.
Ramp-up
Gradual rollout (1% → 10% → 50%). Different user mixes each phase.
Fix: wait до full rollout + stabilize.
Changing populations
New users flooding (marketing spike). Composition shifts.
Fix: segment new vs returning.
Long tails
Small effect на 1% of users, но huge per user.
Fix: distribution analysis, not just mean.
Recovery after failure
1. Understand why
Post-mortem. Write-up.
2. Improve process
Document learning. Update templates.
3. Communicate team
Transparency. Team learns.
4. Retry carefully
After fix — try again, structured.
Когда не запускать A/B
1. Very small expected effect
Stat significance practically impossible.
2. Very small traffic
Impossible to detect reasonable MDE.
3. Obvious / critical
«Fix data loss» — don't A/B. Just fix.
4. Ethical concerns
Treatment harmful к control.
5. Non-random assignable
Geography, regulation.
Alternative methods
Когда A/B не fit:
- Diff-in-diff
- Synthetic control
- Propensity matching
- Pre-post analysis
- Holdout analysis
Each assumptions, pitfalls.
Culture
Healthy experimentation
- Hypothesize
- Test
- Learn regardless outcome
- Share findings
Unhealthy
- Only celebrate wins
- Hide failures
- Cherry-pick
- Chase significance
Affects quality decisions.
Build experimentation muscle
Start small
Simple tests first. Build comfort.
Mentor
Experienced person review early tests.
Review results
Team meeting regularly.
Metric / track success rate
Goal: experiment velocity + quality.
Common metrics of quality
- % experiments properly powered
- % pre-registered hypotheses
- % with complete post-mortem
- Ship rate (target 30-50% reasonable)
На собесе
«Test showed positive, что дальше?»
Check:
- Properly powered?
- SRM ок?
- Guardrails?
- Segments consistent?
- Long-term impact?
Not blindly ship.
«Test not significant»:
- Effect size? Possibly powered wrong.
- Trend direction? Maybe need more data.
- Segments?
- Learn — don't just move on.
«A/B fail example?»
Specific story. What happened, learned.
Связанные темы
FAQ
Too many failures OK?
Normal. 70-80% null — industry standard.
Improve success rate?
Better hypotheses, bigger MDE, smart prioritization.
Hide failures?
Don't. Team learns.
Тренируйте A/B — откройте тренажёр с 1500+ вопросами для собесов.