Learning curves на собеседовании Data Scientist
Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.
Содержание:
Что такое learning curve
Plot model performance vs training set size.
X: training samples (1k, 5k, 10k, ..., full).
Y: train score, validation score.Иллюстрирует, как модель learns с more data.
High bias signature
Score
^
1.0|
|
0.6|---train------val--- (both low, plateau)
|
0.0|________________ N samplesBoth curves low and converged. More data не помогает.
Action: more model capacity (deeper, more features), не more data.
High variance signature
Score
^
1.0|----train-----------
| big gap
0.7| val ↗ (rising с more data)
|
0.0|________________ N samplesBig train-val gap. Adding data closes gap.
Action: more data, regularization, dropout, simpler model.
Sample size decisions
Learning curve helps decide.
If val plateauing. No benefit от more data — focus на model improvements.
If val rising. More data would help. Justifies labeling investment.
Project planning:
- Need 5% more accuracy?
- Curve says: gain 1% per 10k samples.
- Need ~50k more labeled samples → budget annotation.
Validation curves
Different — plot performance vs hyperparameter.
X: regularization strength.
Y: train, val score.Underfitting region. Both low, large reg.
Sweet spot. Val peaks.
Overfitting region. Train high, val drops.
Pick HP near sweet spot.
Связанные темы
- Bias-variance trade-off для DS
- Cross-validation для DS
- Hyperparameter tuning для DS
- Active learning для DS
- Подготовка к собесу Data Scientist
FAQ
Это официальная информация?
Нет. Статья основана на классических ML diagnostics.
Тренируйте Data Science — откройте тренажёр с 1500+ вопросами для собесов.