Шпаргалка по статистике для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Статистика — foundation analytics. На собесах regularly спрашивают basics. Шпаргалка — quick refresh перед интервью.

Descriptive statistics

Central tendency

  • Mean — arithmetic average
  • Median — middle value
  • Mode — most frequent

Use:

  • Mean: normal distribution
  • Median: skewed
  • Mode: categorical

Spread

  • Variance (σ²) — average squared deviation from mean
  • Standard deviation (σ) — √variance
  • IQR — Q3 - Q1
  • Range — max - min

Shape

  • Skewness — asymmetry
  • Kurtosis — tailedness

Distributions

Normal

Bell curve. Parameters: μ (mean), σ (std).

68-95-99.7 rule (1σ, 2σ, 3σ from mean).

Bernoulli

One binary trial. P(1) = p.

Binomial

Sum N independent Bernoulli. Mean = Np.

Poisson

Count events in interval. Mean = variance = λ.

Exponential

Time between events. Mean = 1/λ.

Log-normal

log(X) normal. Right-skewed. Income, revenue, time.

Uniform

All values equally likely.

Probability основы

Conditional

P(A|B) = P(A and B) / P(B)

Bayes

P(A|B) = P(B|A) × P(A) / P(B)

Independence

P(A and B) = P(A) × P(B)

Sampling

Sample

Subset of population.

Population

All individuals of interest.

Sampling bias

Sample не representative.

IID

Independent and identically distributed.

Theorems

ЦПТ (Central Limit Theorem)

Mean of sample → normal при N → ∞.

Independent от original distribution.

Закон больших чисел

Sample mean → true mean при N → ∞.

Estimation

Point estimate

Single value (mean, proportion).

Confidence interval

Range containing true parameter with probability.

CI_95 = estimate ± 1.96 × SE

SE = standard error.

Hypothesis testing

Steps

  1. Formulate H0 / H1
  2. Choose α
  3. Compute test statistic
  4. Compare к critical value или compute p-value
  5. Decision

P-value

Probability data если H0 true.

p < α: reject H0.

Type I error

False positive. Reject true H0. Rate = α.

Type II error

False negative. Fail reject false H0. Rate = β.

Power

1 - β. Probability detect true effect.

Common tests

T-test

Compare means. Assumes normal или large N.

from scipy.stats import ttest_ind
t, p = ttest_ind(a, b)

Z-test

Compare means (known σ or large N).

Chi-square

Categorical independence.

from scipy.stats import chi2_contingency
chi2, p, dof, exp = chi2_contingency(table)

ANOVA

Compare > 2 means.

from scipy.stats import f_oneway
f, p = f_oneway(a, b, c)

Mann-Whitney

Non-parametric t-test.

Kolmogorov-Smirnov

Compare distributions.

Fisher's exact

Small sample 2x2 table.

Correlation

Pearson

Linear correlation. [-1, 1].

Spearman

Rank correlation. Robust к outliers.

Kendall

Another rank correlation.

Regression

Linear

Y = β0 + β1×X + ε

Coefficient: effect X на Y, other things equal.

Logistic

Binary Y. Predict probability.

Multiple regression

Multiple X's.

Assumptions

  • Linearity
  • Independence
  • Homoscedasticity
  • Normal residuals

Effect size

Cohen's d

d = (mean1 - mean2) / pooled_std
  • 0.2: small, 0.5: medium, 0.8: large

Eta squared (η²)

ANOVA's R² equivalent.

Odds ratio

Logistic regression.

Common mistakes

Correlation = causation

Never assume. Need experiment.

P-hacking

Multiple tests, report significant.

Confounding

Hidden variable influences both.

Survivorship bias

Ignore lost data.

Selection bias

Non-random sample.

Base rate fallacy

Ignore background frequency.

Small sample

High variance, tiny effects значимыми seem.

Probability задачки

Coin flip

P(heads 3 times in row) = (1/2)³ = 1/8

Dice

P(sum 7 with 2 dice) = 6/36 = 1/6

Birthday paradox

In 23 people: > 50% shared birthday.

Monty Hall

Switch door — 2/3 win.

Bayes classic

1% disease, 99% test accuracy. Positive test → P(disease) = ?

Using Bayes: ~50% (not 99%!)

A/B statistics

Conversion t-test approximation

from statsmodels.stats.proportion import proportions_ztest
z, p = proportions_ztest([x_a, x_b], [n_a, n_b])

CUPED

Variance reduction.

Sequential

Alpha spending, mSPRT.

Python packages

  • scipy.stats — comprehensive
  • statsmodels — regression, testing
  • numpy — math
  • seaborn — viz
  • pingouin — alternative stats

Собесные faq

t-test vs ANOVA?

t — 2 groups. ANOVA — 3+.

Paired vs independent t-test?

Paired — same individuals (before/after). Independent — different.

Non-parametric когда?

Non-normal data, small samples, ranked data.

Power 80% почему?

Convention. Balance.

Correlation 0.9 strong?

Depends на domain. Physics: moderate. Psychology: strong.

Связанные темы

FAQ

Statistics книги?

  • «OpenIntro Statistics» (free online)
  • «All of Statistics» — Wasserman

Курсы?

  • Harvard Stat110 (YouTube free)
  • Coursera Andrew Ng

Practice?

StrataScratch, DataLemur — statistics problems.


Тренируйте статистику — откройте тренажёр с 1500+ вопросами для собесов.