Tabular deep learning на собеседовании Data Scientist
Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.
Содержание:
Зачем DL для tabular
XGBoost / LightGBM dominate tabular. DL pretty slow / weak в этой нише.
But research progresses. Some scenarios DL emerging better:
- Very large datasets.
- Multi-task learning.
- Pre-training + transfer.
- Mixed modality (tabular + text + image).
TabNet
Google 2019. Attention-based feature selection per sample.
Different samples → different features attended.
Pros: interpretable, sample-level feature importance.
Cons: не consistently beats LightGBM на benchmarks.
FT-Transformer
Yandex 2021. Standard Transformer applied к tabular.
Tokens — features. Each feature has embedding. Standard self-attention.
Input: [age=30, income=50k, country=RU]
Tokens: [emb_age(30), emb_income(50k), emb_country(RU)]
→ Transformer
→ Predicted (e.g., classification)Сompetitive с GBM на larger datasets.
SAINT
Self-Attention and Intersample Attention. 2021. Combines:
- Self-attention (между features).
- Inter-sample attention (между rows).
State-of-art tabular deep на 2023.
Vs gradient boosting
| GBM (LightGBM/XGB/CatBoost) | DL tabular | |
|---|---|---|
| Small dataset | Wins | Loses |
| Large dataset (>1M) | Sometimes ties | Sometimes wins |
| Speed train | Fast | Slow |
| Interpretability | OK | Variable |
| Multi-task | Hard | Native |
| Mixed modalities | Hard | Native |
Practical advice: GBM первый baseline. DL — explore только если specific reason (multi-task, mixed modality, huge data).
Связанные темы
- XGBoost vs LightGBM vs CatBoost для DS
- Multi-task learning для DS
- Self-supervised learning для DS
- Feature engineering для DS
- Подготовка к собесу Data Scientist
FAQ
Это официальная информация?
Нет. Статья основана на работах Arik 2019 (TabNet), Gorishniy 2021 (FT-Transformer).
Тренируйте Data Science — откройте тренажёр с 1500+ вопросами для собесов.