CatBoost для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
CatBoost — Яндексовская ML library. В российских компаниях (особенно Yandex ecosystem) часто standard. Отличается от XGBoost / LightGBM native categorical handling и robust defaults.
На собесах Yandex / tech-Russian CatBoost встречается. Good to know as alternative.
Что такое CatBoost
Gradient boosting от Yandex:
- Open-source
- Native categorical feature handling
- Good defaults (minimal tuning)
- GPU support
- ML library + visualization
Категориальные features
Классический подход
One-hot encoding:
country: [RU, US, KZ] → 3 new columnsДля high-cardinality (1000+ уникальных) — explodes dimensions.
Label encoding
RU → 1, US → 2, KZ → 3Tree считает 3 > 2 > 1. Not semantic.
CatBoost approach
Ordered boosting: statistic encoding с rotation avoid leakage.
Works out of box, usually best performance на categorical data.
В Python
from catboost import CatBoostClassifier
model = CatBoostClassifier(
iterations=500,
learning_rate=0.1,
depth=6,
cat_features=['country', 'gender', 'device'] # указать categorical
)
model.fit(X_train, y_train, eval_set=(X_val, y_val))
predictions = model.predict(X_test)Плюсы
- Handles categorical natively — главное преимущество
- Good defaults — less tuning
- Robust к overfitting
- Visualization plots training progress
- GPU support
Минусы
- Slower train vs XGBoost / LightGBM
- Bigger models vs LightGBM
- Less community vs XGBoost (но solid enough)
vs XGBoost vs LightGBM
| CatBoost | XGBoost | LightGBM | |
|---|---|---|---|
| Speed | Medium | Medium | Fast |
| Categorical | Native | Manual | Native (ish) |
| Defaults | Good | Need tuning | Medium |
| Community | Medium | Largest | Medium |
| Russian ecosystem | Primary | Common | Common |
Hyperparameters
Основные
- iterations (n_estimators): trees (500-5000)
- learning_rate: default 0.03 ok
- depth: 4-10
- l2_leaf_reg: L2 regularization
Auto-tuning
CatBoost имеет built-in auto-tune (experimental):
model.grid_search(param_grid, X, y)Или Optuna (standard).
Categorical encoding
Specify features
cat_features = ['country', 'device_type']
model.fit(X, y, cat_features=cat_features)Auto-detect
Можно positional indices или column names.
Target encoding
CatBoost uses ordered target statistics internally — avoid leakage.
Early stopping
model = CatBoostClassifier(iterations=1000, early_stopping_rounds=20)
model.fit(X_train, y_train, eval_set=(X_val, y_val))Stops if validation not improving.
SHAP integration
import shap
explainer = shap.Explainer(model)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)Works как других tree models.
GPU support
model = CatBoostClassifier(task_type='GPU')Often faster для big datasets.
CV встроенный
from catboost import cv
params = {'iterations': 500, 'depth': 6}
cv_results = cv(pool, params, fold_count=5)Imbalanced data
model = CatBoostClassifier(class_weights=[1, 10])Или auto_class_weights='Balanced'.
Visualization
model.plot_tree(tree_idx=0)
# Или
from catboost import MetricVisualizer
MetricVisualizer(['train_dir']).start()Training plots в Jupyter.
Use cases
Yandex ecosystem
Search ranking, recommendations — CatBoost internal.
Fintech
Credit scoring с categorical features.
E-commerce
Product categorization, churn.
Marketing
Lead scoring, attribution.
На собесе
«Почему CatBoost?» Native categorical handling, good defaults.
«vs XGBoost?» CatBoost better на high-cardinality categorical. XGBoost faster/smaller.
«В Яндексе?» Yes, primary tool для tabular ML.
Частые ошибки
Not specifying cat_features
CatBoost treats как numerical → no benefit.
Ignoring defaults
«Must tune каждый parameter» — не always true с CatBoost.
Small dataset + all categorical
CatBoost может overfit. Regularize / simplify.
Связанные темы
- XGBoost для аналитика
- Random Forest для аналитика
- XGBoost vs Random Forest
- One-hot encoding простыми словами
FAQ
Open-source?
Yes, Apache 2.
Best для российских компаний?
Для Yandex-culture — yes. Для others — preference.
Production?
Yes, widely deployed. Fast inference.
Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.