23 апреля 2026 г.·3 мин чтения

CatBoost для аналитика

Проверь себя · 1/3разбор после ответа

Дано выражение: CASE WHEN status = 'paid' THEN amount ELSE 0 END. Что вернётся, если status равно NULL, а amount равно 500?

Зачем это знать

CatBoost — Яндексовская ML library. В российских компаниях (особенно Yandex ecosystem) часто standard. Отличается от XGBoost / LightGBM native categorical handling и robust defaults.

На собесах Yandex / tech-Russian CatBoost встречается. Good to know as alternative.

Что такое CatBoost

Gradient boosting от Yandex:

Open-source
Native categorical feature handling
Good defaults (minimal tuning)
GPU support
ML library + visualization

Категориальные features

Классический подход

One-hot encoding:

country: [RU, US, KZ] → 3 new columns

Для high-cardinality (1000+ уникальных) — explodes dimensions.

Label encoding

RU → 1, US → 2, KZ → 3

Tree считает 3 > 2 > 1. Not semantic.

CatBoost approach

Ordered boosting: statistic encoding с rotation avoid leakage.

Works out of box, usually best performance на categorical data.

В Python

from catboost import CatBoostClassifier

model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.1,
    depth=6,
    cat_features=['country', 'gender', 'device']  # указать categorical
)

model.fit(X_train, y_train, eval_set=(X_val, y_val))
predictions = model.predict(X_test)

Плюсы

Handles categorical natively — главное преимущество
Good defaults — less tuning
Robust к overfitting
Visualization plots training progress
GPU support

Минусы

Slower train vs XGBoost / LightGBM
Bigger models vs LightGBM
Less community vs XGBoost (но solid enough)

vs XGBoost vs LightGBM

	CatBoost	XGBoost	LightGBM
Speed	Medium	Medium	Fast
Categorical	Native	Manual	Native (ish)
Defaults	Good	Need tuning	Medium
Community	Medium	Largest	Medium
Russian ecosystem	Primary	Common	Common

Hyperparameters

Основные

iterations (n_estimators): trees (500-5000)
learning_rate: default 0.03 ok
depth: 4-10
l2_leaf_reg: L2 regularization

Auto-tuning

CatBoost имеет built-in auto-tune (experimental):

model.grid_search(param_grid, X, y)

Или Optuna (standard).

Categorical encoding

Specify features

cat_features = ['country', 'device_type']
model.fit(X, y, cat_features=cat_features)

Auto-detect

Можно positional indices или column names.

Target encoding

CatBoost uses ordered target statistics internally — avoid leakage.

Early stopping

model = CatBoostClassifier(iterations=1000, early_stopping_rounds=20)
model.fit(X_train, y_train, eval_set=(X_val, y_val))

Stops if validation not improving.

Готовься к собесу аналитика как в Duolingo

10 минут в день — SQL, Python, A/B, метрики. 1700+ вопросов в Telegram

Открыть Карьерник в Telegram

SHAP integration

import shap

explainer = shap.Explainer(model)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)

Works как других tree models.

GPU support

model = CatBoostClassifier(task_type='GPU')

Often faster для big datasets.

CV встроенный

from catboost import cv

params = {'iterations': 500, 'depth': 6}
cv_results = cv(pool, params, fold_count=5)

Imbalanced data

model = CatBoostClassifier(class_weights=[1, 10])

Или auto_class_weights='Balanced'.

Visualization

model.plot_tree(tree_idx=0)
# Или
from catboost import MetricVisualizer
MetricVisualizer(['train_dir']).start()

Training plots в Jupyter.

Use cases

Yandex ecosystem

Search ranking, recommendations — CatBoost internal.

Fintech

Credit scoring с categorical features.

E-commerce

Product categorization, churn.

Marketing

Lead scoring, attribution.

На собесе

«Почему CatBoost?» Native categorical handling, good defaults.

«vs XGBoost?» CatBoost better на high-cardinality categorical. XGBoost faster/smaller.

«В Яндексе?» Yes, primary tool для tabular ML.

Частые ошибки

Not specifying cat_features

CatBoost treats как numerical → no benefit.

Ignoring defaults

«Must tune каждый parameter» — не always true с CatBoost.

Small dataset + all categorical

CatBoost может overfit. Regularize / simplify.

Связанные темы

FAQ

Open-source?

Yes, Apache 2.

Best для российских компаний?

Для Yandex-culture — yes. Для others — preference.

Production?

Yes, widely deployed. Fast inference.