Bootstrap в A/B-тестах

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Revenue per user распределён log-normal. T-test assumes normal → на small samples врёт. Bootstrap — non-parametric alternative.

На middle+ собесах и в real A/B analysis bootstrap — часто employed. Know его.

Что такое bootstrap

Resampling with replacement из observed data → estimate distribution of statistic.

Простой и powerful.

Алгоритм

  1. Observed data: N values
  2. Sample (with replacement) N values from observed → bootstrap sample
  3. Compute statistic (mean, median, etc.) from sample
  4. Repeat 1000-10000 times
  5. Distribution of statistics → approximation sampling distribution

В Python

import numpy as np

def bootstrap_mean(data, n_iter=10000):
    means = []
    for _ in range(n_iter):
        sample = np.random.choice(data, size=len(data), replace=True)
        means.append(sample.mean())
    return np.array(means)

# Usage
data = np.random.exponential(scale=10, size=1000)
boot_means = bootstrap_mean(data)

# CI
print(np.percentile(boot_means, [2.5, 97.5]))

Для A/B

Compare 2 groups:

def bootstrap_diff(control, treatment, n_iter=10000):
    diffs = []
    for _ in range(n_iter):
        c_sample = np.random.choice(control, len(control), replace=True)
        t_sample = np.random.choice(treatment, len(treatment), replace=True)
        diffs.append(t_sample.mean() - c_sample.mean())
    return np.array(diffs)

# Usage
diffs = bootstrap_diff(control_revenue, treatment_revenue)

# CI для разницы
ci_lower, ci_upper = np.percentile(diffs, [2.5, 97.5])

# P-value (approximate)
# Under H0: diff = 0
p_value = 2 * min(
    np.mean(diffs > 0),
    np.mean(diffs < 0)
)

Permutation test

Related. More proper для null hypothesis test:

def permutation_test(control, treatment, n_iter=10000):
    observed_diff = treatment.mean() - control.mean()
    
    combined = np.concatenate([control, treatment])
    n_c = len(control)
    
    null_diffs = []
    for _ in range(n_iter):
        np.random.shuffle(combined)
        new_c = combined[:n_c]
        new_t = combined[n_c:]
        null_diffs.append(new_t.mean() - new_c.mean())
    
    p_value = np.mean(np.abs(null_diffs) >= np.abs(observed_diff))
    return p_value

Bootstrap vs t-test

T-test

  • Assumes normal distribution (или CLT applies)
  • Analytical
  • Fast
  • Standard

Bootstrap

  • Non-parametric
  • Works для любой statistic (median, percentile)
  • Slower compute
  • Flexible

Когда bootstrap

1. Non-normal data

Revenue, session length — heavy-tailed. T-test underestimates variance.

2. Custom metrics

«Average LTV за cohort» с complex aggregation. T-test hard.

3. Ratio metrics

«Revenue per session» — ratio. Variance calc complex. Bootstrap — easy.

4. Median / percentile

T-test для mean. Bootstrap — для anything.

Bootstrap pitfalls

1. Small N

Если data N = 20, bootstrap doesn't magically fix. Just re-uses same 20 points.

Rule: минимум N = 100+ для bootstrap reliable.

2. Dependence

Bootstrap assumes IID. Time series, clustered data — violates.

Block bootstrap для time series.

3. Computational cost

10000 iterations × complex metric × large data = hours.

Parallelize или subsample.

4. Extreme values

Если metric heavy outliers — outliers dominate.

Use robust statistics (median, winsorized).

Bayesian A/B

Related approach:

Posterior на metric. Probability B > A.

# Simplified Beta-Binomial
from scipy.stats import beta

# Prior: Beta(1, 1) = uniform
alpha_c, beta_c = 1 + conversions_c, 1 + (users_c - conversions_c)
alpha_t, beta_t = 1 + conversions_t, 1 + (users_t - conversions_t)

# Sample from posteriors
samples_c = beta.rvs(alpha_c, beta_c, size=10000)
samples_t = beta.rvs(alpha_t, beta_t, size=10000)

# Probability treatment > control
prob = np.mean(samples_t > samples_c)

Не bootstrap, но похожая idea (samples from posterior).

Использование в компаниях

  • Airbnb, Uber: bootstrap для booking metrics
  • Netflix: complex metrics через bootstrap
  • Microsoft ExP: combines t-test и bootstrap

Modern A/B platform обычно supports bootstrap internally.

Performance tricks

Numpy vectorize

Don't for-loop. Use numpy arrays.

Subsample

For very large data — subsample first, bootstrap после.

Parallel

multiprocessing для paralleliz.

JAX / numba

Compile for speedup.

Connecting к traditional

Bootstrap CI ≈ normal CI для normal data (big N).

Для non-normal — bootstrap gives better coverage.

T-test good default, bootstrap safer fallback.

На собесе

«Bootstrap — что?» Resample observed data с replacement, estimate statistic distribution.

«Когда?» Non-normal, complex metrics, small samples.

«Alternatives?» T-test (normal), Bayesian (priors), permutation (null).

«Bootstrap всегда better?» No. Slower, requires N = 100+.

Частые ошибки

Bootstrap для tiny N

Don't work well.

Ignore assumptions

Still нужны independence, representative data.

Blindly применять

Think if bootstrap fits problem. Not magic.

No verification

Cross-check с другим method (t-test) for sanity.

Связанные темы

FAQ

Сколько iterations?

10000 достаточно обычно.

Works для medians?

Yes. Sampling distribution медианы через bootstrap.

Bayesian vs Bootstrap?

Different frameworks. Bayesian — priors + posterior. Bootstrap — frequentist resampling. Complementary.


Тренируйте A/B — откройте тренажёр с 1500+ вопросами для собесов.