Как проектировать A/B-тест пошагово

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

A/B test — не просто запустил code. Плохой design → wrong conclusions. Правильная последовательность shagov — discipline.

На собесах могут попросить walk through A/B design. Показывает experience.

Workflow шаги

  1. Hypothesis — что тестируем, почему
  2. Primary metric — что мерить
  3. Secondary / guardrail
  4. Sample size — how many users
  5. Randomization — assignment strategy
  6. Duration — how long run
  7. Launch — deploy
  8. Monitor — during run
  9. Analyze — after complete
  10. Decide — ship / rollback / iterate

Шаг 1: Hypothesis

Clear statement:

«If we change X, then Y will improve because Z».

Example:

«If мы упростим checkout form (remove 3 fields), then conversion rate increase by 5% because less friction».

Bad hypothesis

  • «Улучшить UX» (vague)
  • «Redesign будет лучше» (no mechanism)
  • «Что-то happens» (no prediction)

Good hypothesis

  • Specific change
  • Specific metric
  • Predicted direction
  • Mechanism

Шаг 2: Primary metric

One main metric.

  • Measurable
  • Sensitivity (can detect lift)
  • Aligned с business

Типичные

  • Conversion rate
  • Revenue per user
  • Retention D7
  • NPS

Tradeoffs

  • Revenue: ultimate business metric, but high variance
  • CR: sensitive, but proxy
  • Retention: long-term, slow

Pick based on goal.

Шаг 3: Secondary / guardrail

Secondary

Additional context:

  • CR across segments
  • Average order value
  • Time on site

Guardrail

Not должны ухудшиться:

  • Churn
  • Support tickets
  • Errors / crashes

«Treatment increased CR 5% but tripled refunds» — bad tradeoff.

Шаг 4: Sample size

See как рассчитать sample size.

Inputs:

  • Alpha (0.05)
  • Power (0.80)
  • Baseline
  • MDE

Formula или calculator.

Шаг 5: Randomization

Simple random

50/50 users. Most common.

Stratified

Random within strata (platform, country). Balances confounders.

Cluster

Randomize groups (city, team). For network effects.

Identifier

User ID → hash → bucket.

Ensures consistent assignment over time.

Шаг 6: Duration

Minimum

2 weeks обычно. Cover weekly patterns.

Maximum

Until N reached OR practical reasons stop.

Considerations

  • Novelty (settles в 2-4 weeks)
  • Seasonality
  • Traffic fluctuations

Pre-register

Set duration / N upfront. Prevents peeking.

Шаг 7: Launch

Ramp-up

Sometimes gradually:

  • 1% → 10% → 50% → full
  • Catches critical bugs without full exposure

Flag

Feature flag system:

  • User based на bucket
  • Can turn off quickly

Observability

  • Log treatment assignment
  • Track key events

Шаг 8: Monitor

During run

  • Sample ratio mismatch (SRM)
  • Error rate spikes
  • Severe regression
  • Guardrail crashes

If severe harm → kill experiment early.

Don't peek primary

Results tempting. Don't look at lift midway (peeking problem).

Peek guardrails

Fine. Safety check.

Шаг 9: Analyze

Verify

  • SRM test
  • Assignment correct
  • Data integrity

Compute

  • Primary metric per variant
  • CI
  • P-value
  • Effect size

Segments

  • Per segment analysis
  • Heterogeneous effects

Validate

  • Cross-check с другим metric
  • Sanity check

Шаг 10: Decide

Framework

  • Primary significant + positive → strong ship
  • Primary not significant → don't ship (usually)
  • Primary positive, guardrail negative → tradeoff analysis
  • Mixed by segment → partial ship possible

Other considerations

  • Business value
  • Engineering cost maintain
  • Long-term effects (if mapped)

Шаблон plan document

# A/B Test Plan: [Feature Name]

## Hypothesis
If we change X, then Y will improve because Z.

## Primary metric
[Metric, definition, target MDE]

## Secondary
- X, Y, Z

## Guardrails
- Must not decrease: A, B, C

## Sample size
[Calculated N per group]

## Duration
[Planned weeks, based on traffic]

## Randomization
[Unit, split]

## Analysis plan
- T-test / chi-square
- Segments to check
- Stopping rules

## Risks
- Novelty effect possible
- Network effects from X

## Timeline
- Week 1: implementation
- Week 2-3: run
- Week 4: analyze

Common designs

Classic A/B

1 control, 1 treatment. Simplest.

Multivariate

Multiple variants (A/B/C/D). More combinations but need more traffic.

Factorial

Test multiple variables. 2×2 design: e.g., button color × copy.

Holdout

Some users never get new features. Long-term measurement. Holdout test.

Switchback

Same users alternating. Для network effects.

Pitfalls

Peeking

Multiple checks → FPR inflates.

SRM

Assignment unbalanced → biased results.

Leakage

Treatment affects control.

Novelty

Temporary lift from freshness.

Survivorship

Only «survivors» analyzed.

Multiple testing

10 metrics → 1 significant by chance. Correct.

Примеры bad tests

Without hypothesis

«Test new design». What effect expected? How measure?

Too short

2 days run. Not enough data.

Wrong metric

Test for revenue impact, но measure only click.

Changing midway

«Let's add another variant» — breaks stats.

Ethical

Informed?

Users знают? Often no (common in A/B).

Harm potential

Test must not significantly harm users.

Fairness

Pricing experiments особенно careful.

Tools

Platform

  • Optimizely
  • VWO
  • GrowthBook
  • Internal (Yandex ABT, Meta's platform)

Custom

Feature flags + SQL analysis. Many startups.

На собесе

«Design A/B test для [feature]»

Walk через 10 шагов.

Hypothesis → metric → N → duration → analysis → decision.

Show process maturity.

Связанные темы

FAQ

Всегда ли нужен A/B?

Не для everything. Big, reversible changes — yes. Small / OBS — no.

Bayesian подход?

Alternative. Probability-based. Peeking safer.

Сколько одновременно?

Limit. 5-10 major tests в team parallel OK. Больше — check для interactions.


Тренируйте A/B — откройте тренажёр с 1500+ вопросами для собесов.