23 апреля 2026 г.·3 мин чтения

Decision trees для аналитика

Q: Sklearn классы?

`DecisionTreeClassifier`, `DecisionTreeRegressor`.

Проверь себя · 1/3разбор после ответа

В таблице orders нужно вывести заказы так, чтобы сначала шли самые дорогие (amount), а при одинаковой сумме — более ранние по created_at (чтобы вверху был «первый» из одинаковых). Какой ORDER BY подходит?

Зачем это знать

Decision tree — самая interpretable ML-модель. Когда нужно объяснить менеджеру, почему модель предсказывает churn — plot tree, и всё понятно. На собесах middle-аналитика decision tree встречается регулярно.

Плюс — fundament для Random Forest, XGBoost. Без понимания trees не понять ensembles.

Короткое объяснение

Decision tree — последовательность if/else на features, заканчивающаяся prediction.

if churn_last_30d:
    if payments < 2:
        → churn = 0.8
    else:
        → churn = 0.4
else:
    → churn = 0.1

Каждый node — question, каждый leaf — prediction.

Как строится

Step 1: Find best split

Для каждой feature, каждого threshold:

Compute information gain или Gini impurity.

Best split — максимизирует pure separation классов.

Step 2: Recursive

После split — повторить процесс для каждой sub-group.

Step 3: Stop

Критерии остановки:

Max depth reached
Min samples per leaf
No improvement possible

Metrics для split

Gini impurity

Gini = 1 - Σ p_i²

0: pure (все same class)
0.5: max impurity (2 classes, 50/50)

Entropy

Entropy = -Σ p_i × log(p_i)

Похожая idea.

Variance (для regression)

Tree для continuous Y минимизирует variance в leaves.

Преимущества

Interpretable. Посмотрел — понял.
No scaling needed. Не важен range features.
Handles categorical. Нативно.
Non-linear relationships. Не требует feature engineering.

Недостатки

Overfitting. Easy → deep tree → memorize training.
Instability. Small change in data → completely different tree.
Не лучший accuracy. Single tree losing ensemble.

Overfitting

Без регуляризации tree grows infinitely → memorizes training → bad test performance.

Fixes

max_depth: limit глубины
min_samples_split: min samples для split
min_samples_leaf: min samples per leaf
Pruning: удалить branches после training

В Python

from sklearn.tree import DecisionTreeClassifier, plot_tree

model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=50)
model.fit(X_train, y_train)

# Visualize
import matplotlib.pyplot as plt
plot_tree(model, feature_names=features, class_names=['no', 'yes'], filled=True)
plt.show()

# Feature importance
for feat, imp in zip(features, model.feature_importances_):
    print(f"{feat}: {imp:.3f}")

Feature importance

Gini/entropy reduction для каждой feature — importance.

Pitfalls:

Bias toward high-cardinality features
Correlated features share importance
Не shows direction effect

Готовься к собесу аналитика как в Duolingo

10 минут в день — SQL, Python, A/B, метрики. 1700+ вопросов в Telegram

Открыть Карьерник в Telegram

Regression trees

Для continuous Y:

Leaf prediction = mean в leaf
Split критерий — variance reduction

Ensembles

Random Forest

Many trees на random subsets. Average predictions → reduces variance overfitting.

Gradient Boosting (XGBoost, LightGBM)

Sequential trees, each correcting previous. Accuracy > single tree.

Когда использовать

Tree

Small data
Need interpretability
Baseline / prototype
Clear if-else logic

Ensemble

Accuracy priority
Medium-large data
Less emphasis on per-row explanation

Visualization

Tree plot — отличный инструмент для общения со стейкхолдерами.

from sklearn.tree import export_graphviz
import graphviz

dot = export_graphviz(model, ...)
graphviz.Source(dot).render('tree')

На собесе

«Что такое decision tree?» Recursive split на features, minimizing impurity.

«Overfitting как?» Deep tree = memorize. Fix: max_depth, min_samples.

«vs Random Forest?» RF — ensemble trees, reduces variance.

«Feature importance?» Gini reduction per feature. Bias к high-cardinality.

Использование в аналитике

Churn analysis

Как определить users at risk? Tree показывает segments.

Decision logic documentation

«Если price > X и time > Y → likely buy» — tree выдаёт rules.

Quick EDA

Tree найдёт interactions между features.

Частые ошибки

Deep tree без валидации

max_depth=None → overfit.

Feature importance interpretation

Importance ≠ causation.

Using tree when linear better

If true relationship linear — linear regression beats tree.

Связанные темы

FAQ

Sklearn классы?

DecisionTreeClassifier, DecisionTreeRegressor.

Как объяснить нетехнически?

«Последовательность вопросов, ответы выводят к решению».

Depth оптимум?

Cross-validation — optimal depth. Обычно 5-10.