XGBoost для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

XGBoost — де-факто standard для tabular ML. Выигрывает Kaggle-соревнования, используется в industry для churn, fraud, credit scoring. Analyst, не умеющий запустить XGBoost — limited в ML capabilities.

На собесах middle-senior в data science / fintech / e-commerce XGBoost упоминается часто.

Что такое XGBoost

Extreme Gradient Boosting — ensemble decision trees.

Основа: последовательно строит trees, каждое tree correcting errors предыдущего.

Plus:

  • Fast
  • Accurate
  • Handles missing values
  • Regularization
  • Feature importance

Установка

pip install xgboost

Базовый пример (classification)

import xgboost as xgb
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1
)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:, 1]

Hyperparameters

Основные

  • n_estimators: число trees (100-1000)
  • max_depth: глубина tree (3-10)
  • learning_rate: скорость обучения (0.01-0.3)
  • subsample: % rows per tree (0.5-1)
  • colsample_bytree: % features per tree (0.5-1)

Regularization

  • reg_alpha (L1): sparsity
  • reg_lambda (L2): shrinkage
  • gamma: min loss reduction для split
  • min_child_weight: min sum of weights per node

Early stopping

Prevent overfitting:

model = xgb.XGBClassifier(n_estimators=1000, early_stopping_rounds=10)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

Stops when validation metric не improves для 10 rounds.

Cross-validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(scores.mean())

Hyperparameter tuning

GridSearchCV

from sklearn.model_selection import GridSearchCV

params = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [100, 500]
}

grid = GridSearchCV(xgb.XGBClassifier(), params, cv=5, scoring='roc_auc')
grid.fit(X, y)
print(grid.best_params_)

Optuna (Bayesian)

Modern, efficient:

import optuna

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.3),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000)
    }
    model = xgb.XGBClassifier(**params)
    scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

Imbalanced data

Class imbalance (99/1) — problem.

Solutions:

  • scale_pos_weight = neg / pos
  • Upsample minority
  • Downsample majority
model = xgb.XGBClassifier(scale_pos_weight=10)

Feature importance

importance = model.feature_importances_
# Or
xgb.plot_importance(model)

Default: gain (biased toward high-cardinality).

SHAP для proper interpretation:

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

Missing values

XGBoost handles NaN natively. Не нужно impute.

X_train.loc[0, 'feature'] = np.nan  # ok
model.fit(X_train, y_train)

XGBoost vs альтернативы

vs Random Forest

XGBoost обычно лучше, но slower train. RF — more robust без tuning.

vs LightGBM

LightGBM — faster, похожее качество. Popular в Russia.

vs CatBoost (Yandex)

CatBoost — handles categorical natively. Yandex use.

vs Neural networks

NN — лучше image/text. Tabular — XGBoost usually wins.

Регрессия

model = xgb.XGBRegressor(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)

Same API, regression target.

Ranking

Для ranking tasks (search results):

model = xgb.XGBRanker(...)

Practical tips

1. Start simple

Defaults + early stopping. See baseline.

2. Tune learning rate

Lower + more trees — usually better accuracy.

3. Regularize

For overfitting — increase reg_lambda, min_child_weight.

4. Feature engineering

Still важно. Good features > hyperparameter tuning.

5. SHAP для interpretability

Business communication.

На собесе

«Почему XGBoost?» Fast, accurate, handles missing, regularization.

«Основные hyperparameters?» max_depth, learning_rate, n_estimators.

«Overfit how to prevent?» Regularization, early stopping, subsample.

«Imbalanced?» scale_pos_weight, sampling techniques.

Частые ошибки

Over-tuning

Можно extract 0.5% extra performance... через 100 часов tuning. Diminishing returns.

Leakage

Features от future info → leak. Careful.

No validation

Trust training accuracy → overfit. Always holdout.

Raw numbers

5 decimal places accuracy — not business-meaningful. Round.

Связанные темы

FAQ

Python или другие языки?

Python most common. R, Scala — есть.

GPU support?

Yes. Использует CUDA.

В production?

Yes, widely used. Быстрый inference.


Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.