CatBoost для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

CatBoost — Яндексовская ML library. В российских компаниях (особенно Yandex ecosystem) часто standard. Отличается от XGBoost / LightGBM native categorical handling и robust defaults.

На собесах Yandex / tech-Russian CatBoost встречается. Good to know as alternative.

Что такое CatBoost

Gradient boosting от Yandex:

  • Open-source
  • Native categorical feature handling
  • Good defaults (minimal tuning)
  • GPU support
  • ML library + visualization

Категориальные features

Классический подход

One-hot encoding:

country: [RU, US, KZ] → 3 new columns

Для high-cardinality (1000+ уникальных) — explodes dimensions.

Label encoding

RU → 1, US → 2, KZ → 3

Tree считает 3 > 2 > 1. Not semantic.

CatBoost approach

Ordered boosting: statistic encoding с rotation avoid leakage.

Works out of box, usually best performance на categorical data.

В Python

from catboost import CatBoostClassifier

model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.1,
    depth=6,
    cat_features=['country', 'gender', 'device']  # указать categorical
)

model.fit(X_train, y_train, eval_set=(X_val, y_val))
predictions = model.predict(X_test)

Плюсы

  • Handles categorical natively — главное преимущество
  • Good defaults — less tuning
  • Robust к overfitting
  • Visualization plots training progress
  • GPU support

Минусы

  • Slower train vs XGBoost / LightGBM
  • Bigger models vs LightGBM
  • Less community vs XGBoost (но solid enough)

vs XGBoost vs LightGBM

CatBoost XGBoost LightGBM
Speed Medium Medium Fast
Categorical Native Manual Native (ish)
Defaults Good Need tuning Medium
Community Medium Largest Medium
Russian ecosystem Primary Common Common

Hyperparameters

Основные

  • iterations (n_estimators): trees (500-5000)
  • learning_rate: default 0.03 ok
  • depth: 4-10
  • l2_leaf_reg: L2 regularization

Auto-tuning

CatBoost имеет built-in auto-tune (experimental):

model.grid_search(param_grid, X, y)

Или Optuna (standard).

Categorical encoding

Specify features

cat_features = ['country', 'device_type']
model.fit(X, y, cat_features=cat_features)

Auto-detect

Можно positional indices или column names.

Target encoding

CatBoost uses ordered target statistics internally — avoid leakage.

Early stopping

model = CatBoostClassifier(iterations=1000, early_stopping_rounds=20)
model.fit(X_train, y_train, eval_set=(X_val, y_val))

Stops if validation not improving.

SHAP integration

import shap

explainer = shap.Explainer(model)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)

Works как других tree models.

GPU support

model = CatBoostClassifier(task_type='GPU')

Often faster для big datasets.

CV встроенный

from catboost import cv

params = {'iterations': 500, 'depth': 6}
cv_results = cv(pool, params, fold_count=5)

Imbalanced data

model = CatBoostClassifier(class_weights=[1, 10])

Или auto_class_weights='Balanced'.

Visualization

model.plot_tree(tree_idx=0)
# Или
from catboost import MetricVisualizer
MetricVisualizer(['train_dir']).start()

Training plots в Jupyter.

Use cases

Yandex ecosystem

Search ranking, recommendations — CatBoost internal.

Fintech

Credit scoring с categorical features.

E-commerce

Product categorization, churn.

Marketing

Lead scoring, attribution.

На собесе

«Почему CatBoost?» Native categorical handling, good defaults.

«vs XGBoost?» CatBoost better на high-cardinality categorical. XGBoost faster/smaller.

«В Яндексе?» Yes, primary tool для tabular ML.

Частые ошибки

Not specifying cat_features

CatBoost treats как numerical → no benefit.

Ignoring defaults

«Must tune каждый parameter» — не always true с CatBoost.

Small dataset + all categorical

CatBoost может overfit. Regularize / simplify.

Связанные темы

FAQ

Open-source?

Yes, Apache 2.

Best для российских компаний?

Для Yandex-culture — yes. Для others — preference.

Production?

Yes, widely deployed. Fast inference.


Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.