XGBoost для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
XGBoost — де-факто standard для tabular ML. Выигрывает Kaggle-соревнования, используется в industry для churn, fraud, credit scoring. Analyst, не умеющий запустить XGBoost — limited в ML capabilities.
На собесах middle-senior в data science / fintech / e-commerce XGBoost упоминается часто.
Что такое XGBoost
Extreme Gradient Boosting — ensemble decision trees.
Основа: последовательно строит trees, каждое tree correcting errors предыдущего.
Plus:
- Fast
- Accurate
- Handles missing values
- Regularization
- Feature importance
Установка
pip install xgboostБазовый пример (classification)
import xgboost as xgb
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = xgb.XGBClassifier(
n_estimators=100,
max_depth=6,
learning_rate=0.1
)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:, 1]Hyperparameters
Основные
- n_estimators: число trees (100-1000)
- max_depth: глубина tree (3-10)
- learning_rate: скорость обучения (0.01-0.3)
- subsample: % rows per tree (0.5-1)
- colsample_bytree: % features per tree (0.5-1)
Regularization
- reg_alpha (L1): sparsity
- reg_lambda (L2): shrinkage
- gamma: min loss reduction для split
- min_child_weight: min sum of weights per node
Early stopping
Prevent overfitting:
model = xgb.XGBClassifier(n_estimators=1000, early_stopping_rounds=10)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])Stops when validation metric не improves для 10 rounds.
Cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(scores.mean())Hyperparameter tuning
GridSearchCV
from sklearn.model_selection import GridSearchCV
params = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1],
'n_estimators': [100, 500]
}
grid = GridSearchCV(xgb.XGBClassifier(), params, cv=5, scoring='roc_auc')
grid.fit(X, y)
print(grid.best_params_)Optuna (Bayesian)
Modern, efficient:
import optuna
def objective(trial):
params = {
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.3),
'n_estimators': trial.suggest_int('n_estimators', 100, 1000)
}
model = xgb.XGBClassifier(**params)
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
return scores.mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)Imbalanced data
Class imbalance (99/1) — problem.
Solutions:
scale_pos_weight= neg / pos- Upsample minority
- Downsample majority
model = xgb.XGBClassifier(scale_pos_weight=10)Feature importance
importance = model.feature_importances_
# Or
xgb.plot_importance(model)Default: gain (biased toward high-cardinality).
SHAP для proper interpretation:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)Missing values
XGBoost handles NaN natively. Не нужно impute.
X_train.loc[0, 'feature'] = np.nan # ok
model.fit(X_train, y_train)XGBoost vs альтернативы
vs Random Forest
XGBoost обычно лучше, но slower train. RF — more robust без tuning.
vs LightGBM
LightGBM — faster, похожее качество. Popular в Russia.
vs CatBoost (Yandex)
CatBoost — handles categorical natively. Yandex use.
vs Neural networks
NN — лучше image/text. Tabular — XGBoost usually wins.
Регрессия
model = xgb.XGBRegressor(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)Same API, regression target.
Ranking
Для ranking tasks (search results):
model = xgb.XGBRanker(...)Practical tips
1. Start simple
Defaults + early stopping. See baseline.
2. Tune learning rate
Lower + more trees — usually better accuracy.
3. Regularize
For overfitting — increase reg_lambda, min_child_weight.
4. Feature engineering
Still важно. Good features > hyperparameter tuning.
5. SHAP для interpretability
Business communication.
На собесе
«Почему XGBoost?» Fast, accurate, handles missing, regularization.
«Основные hyperparameters?» max_depth, learning_rate, n_estimators.
«Overfit how to prevent?» Regularization, early stopping, subsample.
«Imbalanced?» scale_pos_weight, sampling techniques.
Частые ошибки
Over-tuning
Можно extract 0.5% extra performance... через 100 часов tuning. Diminishing returns.
Leakage
Features от future info → leak. Careful.
No validation
Trust training accuracy → overfit. Always holdout.
Raw numbers
5 decimal places accuracy — not business-meaningful. Round.
Связанные темы
- XGBoost vs Random Forest
- Decision trees для аналитика
- Feature importance: SHAP vs Gain
- Overfitting простыми словами
- Cross-validation простыми словами
FAQ
Python или другие языки?
Python most common. R, Scala — есть.
GPU support?
Yes. Использует CUDA.
В production?
Yes, widely used. Быстрый inference.
Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.