Правила хорошего A/B-теста

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

A/B-тесты часто делают плохо. Rules ниже — ставит вас в топ 20% аналитиков.

На собесах спрашивают best practices. Разбираться — senior signal.

Do's

1. Pre-register

Hypothesis, metric, MDE, duration — decided до test.

Prevents p-hacking.

2. Single primary metric

One. Secondary — context. Guardrail — не broken.

3. Calculate sample size

Formula based на α, power, MDE, baseline.

Don't eyeball.

4. Random assignment

Hash user_id → bucket. Not rotating.

5. Monitor guardrails

Real-time alerts on critical metrics degradation.

6. Full week at minimum

Cover weekly patterns.

7. Document

Plan, results, decision — в writing.

8. Share results

Good и bad. Culture transparency.

9. Learn от failures

Null result — learning.

10. A/A test occasionally

Validate system. Should показать no difference.

Don'ts

1. Don't peek

Don't look до planned end.

Если must peek — use sequential methods.

2. Don't change midway

«Add another variant» — breaks stats.

3. Don't rely p < 0.05 только

Check effect size. Statistical != practical.

4. Don't test trivial

«Button color» — tiny potential impact. Spend on meaningful.

5. Don't ignore SRM

Unbalanced assignment → check before analyzing.

6. Don't ship на pre-final data

Even if trending positive — wait.

7. Don't forget holdout

Long-term effects matter. Holdout group helps.

8. Don't run parallel conflicting

Treatment X + Treatment Y simultaneously → interaction not isolated.

9. Don't ignore segments

Aggregate может hide segment dynamics.

10. Don't forget novelty

Ship based на week 1 lift → overstate.

Technical checks

SRM test

Chi-square on assignment:

from scipy.stats import chisquare
stat, p = chisquare([n_a, n_b], [expected_a, expected_b])
if p < 0.01: print("SRM issue!")

Outlier check

Any variant dominated one extreme user?

SELECT variant, MAX(metric), PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY metric)
FROM experiment GROUP BY variant;

Sanity check

Basic counts. Reasonable?

Ethical checks

Harm

Treatment hurts user?

Test not extreme harm. Even detection test should catch fast.

Consent

Ethics boards might require.

Fairness

Unequal effect across demographics?

Monitor.

Statistical

Type I, II errors

Balance.

Multiple testing

10 metrics tested → FPR 40%+.

Bonferroni или pre-primary.

Effect size

Not just p. Cohen's d, lift %.

CI

Interval estimate. Width shows precision.

Design

Proper randomization

Unit assignment consistent over time.

Не switching user every session.

Clusters

If network effects — city / cluster.

Stratify

Known confounders → balance within strata.

Analysis

Not only t-test

Specific к metric type:

  • Conversion: z-test proportions
  • Revenue: t-test or bootstrap (log-normal)
  • Count: Poisson / Mann-Whitney
  • Time-to-event: survival analysis

CUPED

Variance reduction. Speeds test.

Regression adjustment

Include covariates. Similar effect к CUPED.

Ship decision

All positive

Primary significant, guardrails ok → ship.

All negative

Don't ship.

Mixed

Trade-off analysis. Business call.

Partial rollout

Only some segments benefit → ship selectively.

Common mistakes

Gambling

«Trend positive, ship now» — premature.

Ignoring limitations

Small N, short duration, narrow segment — overclaim.

Re-analyzing

«Changed primary metric» — p-hacking.

Ignoring outliers

One customer в treatment made huge purchase → inflates mean.

Check distribution.

Reporting

What include

  • Hypothesis
  • Setup (duration, size, randomization)
  • Results (primary, secondary, guardrails)
  • Segments
  • Decision
  • Learnings

Audience

  • Team: detailed
  • Exec: summary + decision

Long-term

Holdout

Some users never получают new features. Measure cumulative impact.

Follow-up

Ship → monitor metric long-term. Novelty wore off?

Meta-analysis

Across experiments — patterns. «We tend ship, expected 5% impact get 3%».

Calibrate.

Tools

Platforms

  • Optimizely, VWO (commercial)
  • GrowthBook, Flagsmith (open-source)
  • Custom (big tech)

Analysis

  • SQL + Python для flexible
  • Platforms — easy but limited

Power analysis

  • Online calculators
  • Python statsmodels

Team culture

Experimentation ritual

Weekly / bi-weekly review results.

Learning

Share wins, losses. Sub-teams learn.

Continuous improvement

Process, templates, tools evolving.

Advanced

Sequential testing

Properly peek. Подробнее.

Bayesian A/B

Probability posterior.

Multi-armed bandits

Allocate traffic dynamically к best variant.

Causal inference

Observational методы когда A/B нельзя.

На собесе

«Golden rule A/B?»

Answer:

  • Pre-register
  • Sample size
  • No peeking
  • Monitor guardrails
  • Full cycle

Multiple covers breadth.

«Common mistakes?»

Peeking, multiple testing, ignoring segments, novelty.

Связанные темы

FAQ

Perfect A/B возможен?

Rarely perfect. Balance rigor vs speed.

Small sample?

Smaller MDE detectable, но longer duration.

Новый к A/B?

Start с simple tests. Learn infrastructure. Ramp to complex.


Тренируйте — откройте тренажёр с 1500+ вопросами для собесов.