Почему A/B-тесты проваливаются

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

80% A/B-тестов — null или fail. Knowing why — avoid traps, increase success rate.

На собесах вопросы про A/B failures — signals experience.

Проблема вероятностей

Реальность

Most features не make significant impact. True null is common.

«Ship everything positive» strategy → 95% ship-able (bias).

Solutions

  • Effort-reward focus
  • Multiple wins amass

Technical failures

1. Underpowered

Sample size too small. Can't detect true effect.

Fix: calculate N заранее. Run full duration.

2. Peeking

Check midway → FPR inflates.

Fix: pre-register duration, или sequential tests.

3. SRM

Unbalanced assignment. Silent bug.

Fix: chi-square test on ratio. Exclude if violated.

4. Multiple testing

10 metrics → 1 «significant» by chance.

Fix: primary metric pre-defined. Correct если multiple.

5. Leakage

Treatment reaches control via bugs.

Fix: check assignment carefully. Segment.

6. Novelty effect

Short-term up, long-term no effect.

Fix: longer runs, holdout.

7. Wrong metric

«Shipped, revenue up» — но users uninstall.

Fix: primary + guardrails + long-term.

8. Selection bias

«Tests on engaged users only» — не generalize.

Fix: randomize properly.

9. Network effects

Treatment affects control (social).

Fix: cluster-randomized.

10. Outliers

One whale buys huge → inflates mean.

Fix: check distribution. Use median / clip.

Product failures

1. Feature doesn't address real problem

Built solution без validating problem.

Fix: user research first.

2. Poor implementation

Buggy feature hurts not помогает.

Fix: QA before experiment.

3. Wrong segment tested

Power users vs casual — different reaction.

Fix: segment analysis.

4. Conflicts other features

Simultaneous experiments interfere.

Fix: understand portfolio tests.

5. Wrong timing

Holiday, sale, external event.

Fix: choose stable periods.

6. Long-term vs short-term

Ship короткосрочно positive → hurts retention.

Fix: holdout group, long tracking.

Interpretation failures

1. p < 0.05 = ship

p-value misunderstood. Effect small, business не meaningful.

Fix: also check effect size.

2. Significant = causal

«Feature каuseд X» — maybe confounded.

Fix: randomization ensures causal inference (properly implemented).

3. Ignoring segments

Average +5%, но -20% for premium users.

Fix: segment analysis.

4. Over-generalizing

«Users like feature, everyone wants это».

Maybe specific segment.

5. Cherry-picking

«5 metrics ок, 1 bad — ignore».

Fix: weighted view.

Process failures

1. No hypothesis

«Let's test» — нет prediction.

Result hard interpret.

2. Unclear success criteria

«If lift positive» — vague.

Fix: pre-commit thresholds.

3. Post-hoc rationalization

Story change after results.

Fix: pre-register.

4. Politics

«PM wants ship» → analyst найдёт significance.

Fix: analyst independence. Separate from PM incentives.

5. No learnings

Win / lose — no team review.

Fix: post-mortem каждого test.

Сценарий-specific failures

A/A test

Supposed null, но «significant» — system bug.

Fix: investigate. Validate platform periodically.

Holiday shift

Normal retention drops Christmas. «Treatment down» — not really.

Fix: seasonally adjust, YoY comparisons.

Ramp-up

Gradual rollout (1% → 10% → 50%). Different user mixes each phase.

Fix: wait до full rollout + stabilize.

Changing populations

New users flooding (marketing spike). Composition shifts.

Fix: segment new vs returning.

Long tails

Small effect на 1% of users, но huge per user.

Fix: distribution analysis, not just mean.

Recovery after failure

1. Understand why

Post-mortem. Write-up.

2. Improve process

Document learning. Update templates.

3. Communicate team

Transparency. Team learns.

4. Retry carefully

After fix — try again, structured.

Когда не запускать A/B

1. Very small expected effect

Stat significance practically impossible.

2. Very small traffic

Impossible to detect reasonable MDE.

3. Obvious / critical

«Fix data loss» — don't A/B. Just fix.

4. Ethical concerns

Treatment harmful к control.

5. Non-random assignable

Geography, regulation.

Alternative methods

Когда A/B не fit:

  • Diff-in-diff
  • Synthetic control
  • Propensity matching
  • Pre-post analysis
  • Holdout analysis

Each assumptions, pitfalls.

Culture

Healthy experimentation

  • Hypothesize
  • Test
  • Learn regardless outcome
  • Share findings

Unhealthy

  • Only celebrate wins
  • Hide failures
  • Cherry-pick
  • Chase significance

Affects quality decisions.

Build experimentation muscle

Start small

Simple tests first. Build comfort.

Mentor

Experienced person review early tests.

Review results

Team meeting regularly.

Metric / track success rate

Goal: experiment velocity + quality.

Common metrics of quality

  • % experiments properly powered
  • % pre-registered hypotheses
  • % with complete post-mortem
  • Ship rate (target 30-50% reasonable)

На собесе

«Test showed positive, что дальше?»

Check:

  • Properly powered?
  • SRM ок?
  • Guardrails?
  • Segments consistent?
  • Long-term impact?

Not blindly ship.

«Test not significant»:

  • Effect size? Possibly powered wrong.
  • Trend direction? Maybe need more data.
  • Segments?
  • Learn — don't just move on.

«A/B fail example?»

Specific story. What happened, learned.

Связанные темы

FAQ

Too many failures OK?

Normal. 70-80% null — industry standard.

Improve success rate?

Better hypotheses, bigger MDE, smart prioritization.

Hide failures?

Don't. Team learns.


Тренируйте A/B — откройте тренажёр с 1500+ вопросами для собесов.