Шпаргалка по статистике для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Статистика — foundation analytics. На собесах regularly спрашивают basics. Шпаргалка — quick refresh перед интервью.
Descriptive statistics
Central tendency
- Mean — arithmetic average
- Median — middle value
- Mode — most frequent
Use:
- Mean: normal distribution
- Median: skewed
- Mode: categorical
Spread
- Variance (σ²) — average squared deviation from mean
- Standard deviation (σ) — √variance
- IQR — Q3 - Q1
- Range — max - min
Shape
- Skewness — asymmetry
- Kurtosis — tailedness
Distributions
Normal
Bell curve. Parameters: μ (mean), σ (std).
68-95-99.7 rule (1σ, 2σ, 3σ from mean).
Bernoulli
One binary trial. P(1) = p.
Binomial
Sum N independent Bernoulli. Mean = Np.
Poisson
Count events in interval. Mean = variance = λ.
Exponential
Time between events. Mean = 1/λ.
Log-normal
log(X) normal. Right-skewed. Income, revenue, time.
Uniform
All values equally likely.
Probability основы
Conditional
P(A|B) = P(A and B) / P(B)
Bayes
P(A|B) = P(B|A) × P(A) / P(B)
Independence
P(A and B) = P(A) × P(B)
Sampling
Sample
Subset of population.
Population
All individuals of interest.
Sampling bias
Sample не representative.
IID
Independent and identically distributed.
Theorems
ЦПТ (Central Limit Theorem)
Mean of sample → normal при N → ∞.
Independent от original distribution.
Закон больших чисел
Sample mean → true mean при N → ∞.
Estimation
Point estimate
Single value (mean, proportion).
Confidence interval
Range containing true parameter with probability.
CI_95 = estimate ± 1.96 × SESE = standard error.
Hypothesis testing
Steps
- Formulate H0 / H1
- Choose α
- Compute test statistic
- Compare к critical value или compute p-value
- Decision
P-value
Probability data если H0 true.
p < α: reject H0.
Type I error
False positive. Reject true H0. Rate = α.
Type II error
False negative. Fail reject false H0. Rate = β.
Power
1 - β. Probability detect true effect.
Common tests
T-test
Compare means. Assumes normal или large N.
from scipy.stats import ttest_ind
t, p = ttest_ind(a, b)Z-test
Compare means (known σ or large N).
Chi-square
Categorical independence.
from scipy.stats import chi2_contingency
chi2, p, dof, exp = chi2_contingency(table)ANOVA
Compare > 2 means.
from scipy.stats import f_oneway
f, p = f_oneway(a, b, c)Mann-Whitney
Non-parametric t-test.
Kolmogorov-Smirnov
Compare distributions.
Fisher's exact
Small sample 2x2 table.
Correlation
Pearson
Linear correlation. [-1, 1].
Spearman
Rank correlation. Robust к outliers.
Kendall
Another rank correlation.
Regression
Linear
Y = β0 + β1×X + εCoefficient: effect X на Y, other things equal.
Logistic
Binary Y. Predict probability.
Multiple regression
Multiple X's.
Assumptions
- Linearity
- Independence
- Homoscedasticity
- Normal residuals
Effect size
Cohen's d
d = (mean1 - mean2) / pooled_std- 0.2: small, 0.5: medium, 0.8: large
Eta squared (η²)
ANOVA's R² equivalent.
Odds ratio
Logistic regression.
Common mistakes
Correlation = causation
Never assume. Need experiment.
P-hacking
Multiple tests, report significant.
Confounding
Hidden variable influences both.
Survivorship bias
Ignore lost data.
Selection bias
Non-random sample.
Base rate fallacy
Ignore background frequency.
Small sample
High variance, tiny effects значимыми seem.
Probability задачки
Coin flip
P(heads 3 times in row) = (1/2)³ = 1/8
Dice
P(sum 7 with 2 dice) = 6/36 = 1/6
Birthday paradox
In 23 people: > 50% shared birthday.
Monty Hall
Switch door — 2/3 win.
Bayes classic
1% disease, 99% test accuracy. Positive test → P(disease) = ?
Using Bayes: ~50% (not 99%!)
A/B statistics
Conversion t-test approximation
from statsmodels.stats.proportion import proportions_ztest
z, p = proportions_ztest([x_a, x_b], [n_a, n_b])CUPED
Variance reduction.
Sequential
Alpha spending, mSPRT.
Python packages
- scipy.stats — comprehensive
- statsmodels — regression, testing
- numpy — math
- seaborn — viz
- pingouin — alternative stats
Собесные faq
t-test vs ANOVA?
t — 2 groups. ANOVA — 3+.
Paired vs independent t-test?
Paired — same individuals (before/after). Independent — different.
Non-parametric когда?
Non-normal data, small samples, ranked data.
Power 80% почему?
Convention. Balance.
Correlation 0.9 strong?
Depends на domain. Physics: moderate. Psychology: strong.
Связанные темы
- P-value простыми словами
- T-test простыми словами
- Normal распределение
- ЦПТ простыми словами
- Bayes простыми словами
- Шпаргалка по A/B
FAQ
Statistics книги?
- «OpenIntro Statistics» (free online)
- «All of Statistics» — Wasserman
Курсы?
- Harvard Stat110 (YouTube free)
- Coursera Andrew Ng
Practice?
StrataScratch, DataLemur — statistics problems.
Тренируйте статистику — откройте тренажёр с 1500+ вопросами для собесов.