Boosting pitfalls на собеседовании Data Scientist
Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.
Содержание:
Overfitting
Boosting может overfit при large n_estimators.
Defense:
- Early stopping на validation.
- Lower
learning_rate(with more estimators). - Subsample / bagging внутри boosting.
- Regularization (L1, L2, alpha).
model = LGBMClassifier(early_stopping_rounds=50, eval_set=[(X_val, y_val)])Target leakage
Feature derived from future / target.
Examples:
total_charges— derived от target.last_action_after_purchase— knows future.- Aggregations на full dataset (включая test rows).
Detect. Suspiciously high accuracy. Investigate top features через permutation importance.
Fix. Features computed только из past (relative к event time).
Calibration issues
Boosting outputs не calibrated probabilities.
Predicted 70% confidence — actually maybe 50% или 90%.Fix.
- Sigmoid / Platt scaling on validation.
- Isotonic regression.
from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(boosting_model, method='isotonic')Categorical encoding
XGBoost. Не native — one-hot или target encoding.
LightGBM, CatBoost. Native. Specify cat_features.
Target encoding done wrong → leakage. Use OOF target encoding (or CatBoost ordered target encoding).
Monotonic constraints
Constrain feature direction.
"Probability buy increases с income."xgb.train(..., monotone_constraints=(1, 0, -1, ...))1 increasing, -1 decreasing, 0 no constraint.
Useful — regulatory (financial models), interpretability.
Связанные темы
- XGBoost vs LightGBM vs CatBoost для DS
- Bagging vs Boosting для DS
- Feature engineering для DS
- Calibration / kalibrovka для DS
- Подготовка к собесу Data Scientist
FAQ
Это официальная информация?
Нет. Статья основана на стандартных GBM practices.
Тренируйте Data Science — откройте тренажёр с 1500+ вопросами для собесов.