Как проектировать A/B-тест пошагово
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
A/B test — не просто запустил code. Плохой design → wrong conclusions. Правильная последовательность shagov — discipline.
На собесах могут попросить walk through A/B design. Показывает experience.
Workflow шаги
- Hypothesis — что тестируем, почему
- Primary metric — что мерить
- Secondary / guardrail
- Sample size — how many users
- Randomization — assignment strategy
- Duration — how long run
- Launch — deploy
- Monitor — during run
- Analyze — after complete
- Decide — ship / rollback / iterate
Шаг 1: Hypothesis
Clear statement:
«If we change X, then Y will improve because Z».
Example:
«If мы упростим checkout form (remove 3 fields), then conversion rate increase by 5% because less friction».
Bad hypothesis
- «Улучшить UX» (vague)
- «Redesign будет лучше» (no mechanism)
- «Что-то happens» (no prediction)
Good hypothesis
- Specific change
- Specific metric
- Predicted direction
- Mechanism
Шаг 2: Primary metric
One main metric.
- Measurable
- Sensitivity (can detect lift)
- Aligned с business
Типичные
- Conversion rate
- Revenue per user
- Retention D7
- NPS
Tradeoffs
- Revenue: ultimate business metric, but high variance
- CR: sensitive, but proxy
- Retention: long-term, slow
Pick based on goal.
Шаг 3: Secondary / guardrail
Secondary
Additional context:
- CR across segments
- Average order value
- Time on site
Guardrail
Not должны ухудшиться:
- Churn
- Support tickets
- Errors / crashes
«Treatment increased CR 5% but tripled refunds» — bad tradeoff.
Шаг 4: Sample size
See как рассчитать sample size.
Inputs:
- Alpha (0.05)
- Power (0.80)
- Baseline
- MDE
Formula или calculator.
Шаг 5: Randomization
Simple random
50/50 users. Most common.
Stratified
Random within strata (platform, country). Balances confounders.
Cluster
Randomize groups (city, team). For network effects.
Identifier
User ID → hash → bucket.
Ensures consistent assignment over time.
Шаг 6: Duration
Minimum
2 weeks обычно. Cover weekly patterns.
Maximum
Until N reached OR practical reasons stop.
Considerations
- Novelty (settles в 2-4 weeks)
- Seasonality
- Traffic fluctuations
Pre-register
Set duration / N upfront. Prevents peeking.
Шаг 7: Launch
Ramp-up
Sometimes gradually:
- 1% → 10% → 50% → full
- Catches critical bugs without full exposure
Flag
Feature flag system:
- User based на bucket
- Can turn off quickly
Observability
- Log treatment assignment
- Track key events
Шаг 8: Monitor
During run
- Sample ratio mismatch (SRM)
- Error rate spikes
- Severe regression
- Guardrail crashes
If severe harm → kill experiment early.
Don't peek primary
Results tempting. Don't look at lift midway (peeking problem).
Peek guardrails
Fine. Safety check.
Шаг 9: Analyze
Verify
- SRM test
- Assignment correct
- Data integrity
Compute
- Primary metric per variant
- CI
- P-value
- Effect size
Segments
- Per segment analysis
- Heterogeneous effects
Validate
- Cross-check с другим metric
- Sanity check
Шаг 10: Decide
Framework
- Primary significant + positive → strong ship
- Primary not significant → don't ship (usually)
- Primary positive, guardrail negative → tradeoff analysis
- Mixed by segment → partial ship possible
Other considerations
- Business value
- Engineering cost maintain
- Long-term effects (if mapped)
Шаблон plan document
# A/B Test Plan: [Feature Name]
## Hypothesis
If we change X, then Y will improve because Z.
## Primary metric
[Metric, definition, target MDE]
## Secondary
- X, Y, Z
## Guardrails
- Must not decrease: A, B, C
## Sample size
[Calculated N per group]
## Duration
[Planned weeks, based on traffic]
## Randomization
[Unit, split]
## Analysis plan
- T-test / chi-square
- Segments to check
- Stopping rules
## Risks
- Novelty effect possible
- Network effects from X
## Timeline
- Week 1: implementation
- Week 2-3: run
- Week 4: analyzeCommon designs
Classic A/B
1 control, 1 treatment. Simplest.
Multivariate
Multiple variants (A/B/C/D). More combinations but need more traffic.
Factorial
Test multiple variables. 2×2 design: e.g., button color × copy.
Holdout
Some users never get new features. Long-term measurement. Holdout test.
Switchback
Same users alternating. Для network effects.
Pitfalls
Peeking
Multiple checks → FPR inflates.
SRM
Assignment unbalanced → biased results.
Leakage
Treatment affects control.
Novelty
Temporary lift from freshness.
Survivorship
Only «survivors» analyzed.
Multiple testing
10 metrics → 1 significant by chance. Correct.
Примеры bad tests
Without hypothesis
«Test new design». What effect expected? How measure?
Too short
2 days run. Not enough data.
Wrong metric
Test for revenue impact, но measure only click.
Changing midway
«Let's add another variant» — breaks stats.
Ethical
Informed?
Users знают? Often no (common in A/B).
Harm potential
Test must not significantly harm users.
Fairness
Pricing experiments особенно careful.
Tools
Platform
- Optimizely
- VWO
- GrowthBook
- Internal (Yandex ABT, Meta's platform)
Custom
Feature flags + SQL analysis. Many startups.
На собесе
«Design A/B test для [feature]»
Walk через 10 шагов.
Hypothesis → metric → N → duration → analysis → decision.
Show process maturity.
Связанные темы
- A/B-тест простыми словами
- Как рассчитать размер выборки
- Шпаргалка A/B
- Peeking problem
- CUPED
- Holdout test
- Novelty effect
FAQ
Всегда ли нужен A/B?
Не для everything. Big, reversible changes — yes. Small / OBS — no.
Bayesian подход?
Alternative. Probability-based. Peeking safer.
Сколько одновременно?
Limit. 5-10 major tests в team parallel OK. Больше — check для interactions.
Тренируйте A/B — откройте тренажёр с 1500+ вопросами для собесов.