Random Forest для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Random Forest — workhorse ML. Менее fancy чем XGBoost, но часто не хуже с меньшим tuning. Robust, interpretable на уровне features. Отличный baseline для tabular data.

На собесах ML для аналитика Random Forest — обязательный знать.

Что такое

Ensemble many decision trees:

  1. Create N trees
  2. Each tree trained on random sample data (bagging)
  3. Each tree splits on random subset features (feature bagging)
  4. Predictions = majority vote (classification) или average (regression)

Основная идея

Individual trees overfit. Average many — variance reduces, overfitting тоже.

«Wisdom of crowd»: много неточных predictions → more accurate average.

Bagging vs Boosting

Bagging (Random Forest)

  • Trees parallel, independent
  • Random subset data each tree
  • Reduce variance

Boosting (XGBoost)

  • Trees sequential, dependent
  • Each tree fixes previous errors
  • Reduce bias

Different strategies, both work.

В Python

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
model.fit(X_train, y_train)

predictions = model.predict(X_test)

Hyperparameters

n_estimators

Число trees. More → better (diminishing returns). 100-500 typical.

max_depth

Depth each tree. None = unlimited (risk overfit). 10-20 typical.

min_samples_split

Min samples для split. Higher → regularization.

min_samples_leaf

Min samples в leaf. Higher → regularization.

max_features

Random features per tree. sqrt(n) default для classification.

bootstrap

True default — sample with replacement.

Преимущества

  • Robust: handles noisy data.
  • Interpretable: feature importance.
  • Handle missing: if imputed.
  • Non-linear: captures interactions.
  • Built-in validation: out-of-bag (OOB) score.
  • Less tuning vs XGBoost.

Недостатки

  • Bigger memory: stores N trees.
  • Slower predict vs single tree.
  • Less accurate vs XGBoost typically.
  • Extrapolation: bad at values outside training range.

OOB score

Each tree trained on ~63% data (with replacement). Remaining 37% — out-of-bag.

Use OOB samples для validation без separate test set.

model = RandomForestClassifier(oob_score=True)
model.fit(X, y)
print(model.oob_score_)

Quick validation.

Feature importance

model.feature_importances_

Gini-based reduction. Same caveats as tree gain importance.

SHAP — better interpretability:

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

RF vs XGBoost

Random Forest XGBoost
Training Parallel Sequential
Speed train Faster Slower
Accuracy Good Usually better
Tuning needed Less More
Interpretability Similar Similar
Overfit risk Less More (без tuning)

Both widely used.

Use cases

Churn prediction

Robust, good baseline.

Fraud detection

Handles noise, class imbalance.

Credit scoring

Interpretable, good для regulatory.

Feature selection

Use RF feature importance для downstream models.

Quick baseline

Before trying complex models.

Classification + regression

Classifier

from sklearn.ensemble import RandomForestClassifier

Regressor

from sklearn.ensemble import RandomForestRegressor

Same API, different target type.

Class imbalance

class_weight='balanced' — auto-weighted:

model = RandomForestClassifier(class_weight='balanced')

Or manual weights:

class_weight={0: 1, 1: 10}  # minority class 10x weight

Extrapolation problem

RF не predicts beyond training range.

Training: Y от 10 to 100. Predict на X где Y должен быть 150 → predicts near 100.

Linear regression / neural — extrapolate. RF — не может.

Для trend continuation — use LR или Prophet, not RF.

На собесе

«Random Forest vs decision tree?» RF — ensemble many trees, reduce variance.

«Bagging?» Random sample data для каждого tree.

«Feature importance interpretation?» Gain-based. SHAP для proper.

«Когда RF лучше XGBoost?» Simpler (less tuning), robust к noise, good baseline.

Частые ошибки

Interpret extrapolation

RF не extrapolates. Don't expect reasonable predictions outside range.

Ignoring random_state

Different runs → different results. Set random_state.

Not enough trees

n_estimators=10 — too few. 100+ minimum.

Feature importance causation

Same caveat as any tree. Correlation, не causation.

Связанные темы

FAQ

sklearn ok?

Yes. Primary implementation.

Parallelization?

n_jobs=-1 use all cores.

Memory?

Trees stored — can be big. Если нет памяти — меньше trees.


Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.