Random Forest для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Random Forest — workhorse ML. Менее fancy чем XGBoost, но часто не хуже с меньшим tuning. Robust, interpretable на уровне features. Отличный baseline для tabular data.
На собесах ML для аналитика Random Forest — обязательный знать.
Что такое
Ensemble many decision trees:
- Create N trees
- Each tree trained on random sample data (bagging)
- Each tree splits on random subset features (feature bagging)
- Predictions = majority vote (classification) или average (regression)
Основная идея
Individual trees overfit. Average many — variance reduces, overfitting тоже.
«Wisdom of crowd»: много неточных predictions → more accurate average.
Bagging vs Boosting
Bagging (Random Forest)
- Trees parallel, independent
- Random subset data each tree
- Reduce variance
Boosting (XGBoost)
- Trees sequential, dependent
- Each tree fixes previous errors
- Reduce bias
Different strategies, both work.
В Python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
model.fit(X_train, y_train)
predictions = model.predict(X_test)Hyperparameters
n_estimators
Число trees. More → better (diminishing returns). 100-500 typical.
max_depth
Depth each tree. None = unlimited (risk overfit). 10-20 typical.
min_samples_split
Min samples для split. Higher → regularization.
min_samples_leaf
Min samples в leaf. Higher → regularization.
max_features
Random features per tree. sqrt(n) default для classification.
bootstrap
True default — sample with replacement.
Преимущества
- Robust: handles noisy data.
- Interpretable: feature importance.
- Handle missing: if imputed.
- Non-linear: captures interactions.
- Built-in validation: out-of-bag (OOB) score.
- Less tuning vs XGBoost.
Недостатки
- Bigger memory: stores N trees.
- Slower predict vs single tree.
- Less accurate vs XGBoost typically.
- Extrapolation: bad at values outside training range.
OOB score
Each tree trained on ~63% data (with replacement). Remaining 37% — out-of-bag.
Use OOB samples для validation без separate test set.
model = RandomForestClassifier(oob_score=True)
model.fit(X, y)
print(model.oob_score_)Quick validation.
Feature importance
model.feature_importances_Gini-based reduction. Same caveats as tree gain importance.
SHAP — better interpretability:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)RF vs XGBoost
| Random Forest | XGBoost | |
|---|---|---|
| Training | Parallel | Sequential |
| Speed train | Faster | Slower |
| Accuracy | Good | Usually better |
| Tuning needed | Less | More |
| Interpretability | Similar | Similar |
| Overfit risk | Less | More (без tuning) |
Both widely used.
Use cases
Churn prediction
Robust, good baseline.
Fraud detection
Handles noise, class imbalance.
Credit scoring
Interpretable, good для regulatory.
Feature selection
Use RF feature importance для downstream models.
Quick baseline
Before trying complex models.
Classification + regression
Classifier
from sklearn.ensemble import RandomForestClassifierRegressor
from sklearn.ensemble import RandomForestRegressorSame API, different target type.
Class imbalance
class_weight='balanced' — auto-weighted:
model = RandomForestClassifier(class_weight='balanced')Or manual weights:
class_weight={0: 1, 1: 10} # minority class 10x weightExtrapolation problem
RF не predicts beyond training range.
Training: Y от 10 to 100. Predict на X где Y должен быть 150 → predicts near 100.
Linear regression / neural — extrapolate. RF — не может.
Для trend continuation — use LR или Prophet, not RF.
На собесе
«Random Forest vs decision tree?» RF — ensemble many trees, reduce variance.
«Bagging?» Random sample data для каждого tree.
«Feature importance interpretation?» Gain-based. SHAP для proper.
«Когда RF лучше XGBoost?» Simpler (less tuning), robust к noise, good baseline.
Частые ошибки
Interpret extrapolation
RF не extrapolates. Don't expect reasonable predictions outside range.
Ignoring random_state
Different runs → different results. Set random_state.
Not enough trees
n_estimators=10 — too few. 100+ minimum.
Feature importance causation
Same caveat as any tree. Correlation, не causation.
Связанные темы
- XGBoost vs Random Forest
- Decision trees для аналитика
- XGBoost для аналитика
- Overfitting простыми словами
- Feature importance
FAQ
sklearn ok?
Yes. Primary implementation.
Parallelization?
n_jobs=-1 use all cores.
Memory?
Trees stored — can be big. Если нет памяти — меньше trees.
Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.