Шпаргалка по A/B-тестам

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

A/B-тесты — daily хлеб аналитика. На собесах спрашивают regularly. Эта шпаргалка — quick reference key concepts, формул, pitfalls.

Фундаменты

Hypothesis

H0 (null): нет effect. H1 (alternative): есть effect.

A/B test — try reject H0.

P-value

Probability data как эта при H0.

Small p (< 0.05) — reject H0, «significant».

Alpha (α)

Threshold significance. Typically 0.05.

= false positive rate если H0 true.

Beta (β)

False negative rate. Typically 0.20 (power = 1 - β = 0.80).

Key metrics

Primary

Main metric decided.

Secondary

Additional context.

Guardrail

Metrics не должны ухудшиться (crashes, churn).

Sample size

Формула (proportions)

n = 2 × (z_α/2 + z_β)² × p̄ × (1 - p̄) / (p1 - p2)²

Для α = 0.05, power = 0.80:

  • z_α/2 = 1.96
  • z_β = 0.84

MDE relationship

n ∝ 1 / MDE²

MDE ÷ 2 → N × 4.

Лестница analyses

T-test

Compare means двух групп (continuous).

from scipy.stats import ttest_ind
t, p = ttest_ind(a, b)

Z-test для proportions

Compare conversion rates.

from statsmodels.stats.proportion import proportions_ztest
z, p = proportions_ztest([conv_a, conv_b], [n_a, n_b])

Chi-square

Categorical data, independence.

Mann-Whitney U

Non-parametric, skewed data.

Pitfalls

Peeking

Multiple checks inflate FPR.

Подробнее.

Multiple testing

10 metrics tested → α actually 10×. Bonferroni: α/k.

Novelty effect

Temporary boost new feature.

Primary effect

Users resist change temporarily.

Sample ratio mismatch (SRM)

Expected 50/50, got 49/51 — bias? Chi-square test.

Network effects

Treatment affects control (social, marketplace).

Cluster-randomized тest.

Statistical vs practical significance

  • p < 0.05: statistical
  • Effect size — practical

Large sample: trivial effects значимые.

Always both.

CI для разницы

CI = (diff) ± z × SE

Where SE depends on type metric.

For proportions:

SE = √(p1(1-p1)/n1 + p2(1-p2)/n2)

95% CI: ± 1.96 × SE.

Не включает 0 → significant.

Effect size

Cohen's d

d = (mean1 - mean2) / pooled_std
  • 0.2: small
  • 0.5: medium
  • 0.8: large

Lift

Relative: (treatment - control) / control × 100%.

Ease interpret для business.

CUPED

Variance reduction:

Y_cuped = Y - θ × (X - mean(X))

Где X — pre-experiment covariate, θ = Cov(Y,X) / Var(X).

Reduces variance → меньше sample size.

Randomization

Simple

Random assign 50/50.

Stratified

Within strata (platform, country) — balance.

Cluster

Randomize by city / cohort, не individual.

Выбор duration

Min: 1-2 weeks.

Ideal: until stat significance reach OR pre-calculated N.

Consider:

  • Full weeks (weekly patterns)
  • Novelty wearing off
  • Seasonality

Results analysis

Confirm

Primary metric significant?

Guardrails

Not degraded?

Segments

Heterogeneity? Works для all segments?

Duration

Ran full planned period?

Decision

Ship / not ship / iterate.

Decision framework

  • Significant + positive + guardrails ok: ship
  • Significant + negative: don't ship
  • Not significant: не ship или iterate
  • Mixed segments: consider partial ship

SQL для A/B

Basic

SELECT variant, AVG(converted), COUNT(*)
FROM experiment
GROUP BY variant;

CI

SELECT
    variant,
    AVG(metric) AS mean,
    STDDEV(metric) AS sd,
    COUNT(*) AS n,
    AVG(metric) - 1.96 * STDDEV(metric) / SQRT(COUNT(*)) AS ci_lower,
    AVG(metric) + 1.96 * STDDEV(metric) / SQRT(COUNT(*)) AS ci_upper
FROM experiment
GROUP BY variant;

Bayesian A/B

Alternative:

  • Prior + likelihood = posterior
  • P(B > A | data) — интерпретируется прямо
  • Peeking safer (сохраняет правильность)

Variance reduction techniques

  • CUPED
  • Stratification
  • Regression adjustment
  • Variance-weighted estimator

Все уменьшают N для given detectable effect.

Sequential testing

Proper methods для peeking:

  • AVI (Always Valid Inference)
  • mSPRT
  • Alpha spending

Типичные errors

1. Tiny sample

100 users, «significant lift 10%» — probably noise.

2. Cherry-picking

Try 10 metrics → 1 significant by chance. Adjust α.

3. Wrong assumptions

Assumed normality on log-normal data. T-test biased.

4. Ignoring context

«Ran test, got result, ship» — мисс cultural, seasonal context.

Собесные questions

«T-test vs Z-test?» Z для больших N или known σ. T для малых.

«Как choose N?» Based on MDE, α, power, baseline variance.

«Что такое power?» 1 - β. Probability detect true effect.

«CUPED?» Variance reduction через pre-experiment covariate.

«Как handle network effects?» Cluster-randomized experiments.

Related cheat sheets

Связанные темы

FAQ

Какой α default?

0.05 convention. Stricter: 0.01 для critical decisions.

One-tailed vs two-tailed?

Two — safer default. One — только если сильная prior.

Bootstrap или analytical?

Analytical fast. Bootstrap flexible, heavier compute.


Тренируйте A/B — откройте тренажёр с 1500+ вопросами для собесов.