23 апреля 2026 г.·4 мин чтения

Random Forest для аналитика

Проверь себя · 1/3разбор после ответа

Вы строите ключ сессии как склейку user_id (тип int) и session_id (тип text). Какой вариант корректен для конкатенации в SQL?

Зачем это знать

Random Forest — workhorse ML. Менее fancy чем XGBoost, но часто не хуже с меньшим tuning. Robust, interpretable на уровне features. Отличный baseline для tabular data.

На собесах ML для аналитика Random Forest — обязательный знать.

Что такое

Ensemble many decision trees:

Create N trees
Each tree trained on random sample data (bagging)
Each tree splits on random subset features (feature bagging)
Predictions = majority vote (classification) или average (regression)

Основная идея

Individual trees overfit. Average many — variance reduces, overfitting тоже.

«Wisdom of crowd»: много неточных predictions → more accurate average.

Bagging vs Boosting

Bagging (Random Forest)

Trees parallel, independent
Random subset data each tree
Reduce variance

Boosting (XGBoost)

Trees sequential, dependent
Each tree fixes previous errors
Reduce bias

Different strategies, both work.

В Python

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
model.fit(X_train, y_train)

predictions = model.predict(X_test)

Hyperparameters

n_estimators

Число trees. More → better (diminishing returns). 100-500 typical.

max_depth

Depth each tree. None = unlimited (risk overfit). 10-20 typical.

min_samples_split

Min samples для split. Higher → regularization.

min_samples_leaf

Min samples в leaf. Higher → regularization.

max_features

Random features per tree. sqrt(n) default для classification.

bootstrap

True default — sample with replacement.

Преимущества

Robust: handles noisy data.
Interpretable: feature importance.
Handle missing: if imputed.
Non-linear: captures interactions.
Built-in validation: out-of-bag (OOB) score.
Less tuning vs XGBoost.

Недостатки

Bigger memory: stores N trees.
Slower predict vs single tree.
Less accurate vs XGBoost typically.
Extrapolation: bad at values outside training range.

OOB score

Each tree trained on ~63% data (with replacement). Remaining 37% — out-of-bag.

Use OOB samples для validation без separate test set.

model = RandomForestClassifier(oob_score=True)
model.fit(X, y)
print(model.oob_score_)

Quick validation.

Готовься к собесу аналитика как в Duolingo

10 минут в день — SQL, Python, A/B, метрики. 1700+ вопросов в Telegram

Открыть Карьерник в Telegram

Feature importance

model.feature_importances_

Gini-based reduction. Same caveats as tree gain importance.

SHAP — better interpretability:

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

RF vs XGBoost

	Random Forest	XGBoost
Training	Parallel	Sequential
Speed train	Faster	Slower
Accuracy	Good	Usually better
Tuning needed	Less	More
Interpretability	Similar	Similar
Overfit risk	Less	More (без tuning)

Both widely used.

Use cases

Churn prediction

Robust, good baseline.

Fraud detection

Handles noise, class imbalance.

Credit scoring

Interpretable, good для regulatory.

Feature selection

Use RF feature importance для downstream models.

Quick baseline

Before trying complex models.

Classification + regression

Classifier

from sklearn.ensemble import RandomForestClassifier

Regressor

from sklearn.ensemble import RandomForestRegressor

Same API, different target type.

Class imbalance

class_weight='balanced' — auto-weighted:

model = RandomForestClassifier(class_weight='balanced')

Or manual weights:

class_weight={0: 1, 1: 10}  # minority class 10x weight

Extrapolation problem

RF не predicts beyond training range.

Training: Y от 10 to 100. Predict на X где Y должен быть 150 → predicts near 100.

Linear regression / neural — extrapolate. RF — не может.

Для trend continuation — use LR или Prophet, not RF.

На собесе

«Random Forest vs decision tree?» RF — ensemble many trees, reduce variance.

«Bagging?» Random sample data для каждого tree.

«Feature importance interpretation?» Gain-based. SHAP для proper.

«Когда RF лучше XGBoost?» Simpler (less tuning), robust к noise, good baseline.

Частые ошибки

Interpret extrapolation

RF не extrapolates. Don't expect reasonable predictions outside range.

Ignoring random_state

Different runs → different results. Set random_state.

Not enough trees

n_estimators=10 — too few. 100+ minimum.

Feature importance causation

Same caveat as any tree. Correlation, не causation.

Связанные темы

FAQ

sklearn ok?

Yes. Primary implementation.

Parallelization?

n_jobs=-1 use all cores.

Memory?

Trees stored — can be big. Если нет памяти — меньше trees.