Data augmentation на собеседовании Data Scientist
Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.
Содержание:
Зачем augmentation
Effectively expand training set без collecting more data.
- Improve generalization.
- Reduce overfitting.
- Make model robust к variations.
Image augmentation
Geometric.
- Random crop, resize.
- Horizontal / vertical flip.
- Rotation.
- Affine, perspective.
Color.
- Brightness, contrast, saturation.
- Hue shift.
- Gray-scale.
Noise.
- Gaussian noise.
- Blur.
- JPEG compression.
Modern. Albumentations library — fast, pipeline.
Text augmentation
Trickier — meaning должен preserve.
- Synonym replacement. Swap words с synonyms.
- Back-translation. RU → EN → RU.
- Random deletion / insertion.
- Paraphrase generation (с LLM).
EDA (Easy Data Augmentation). Simple, effective.
Tabular augmentation
SMOTE. For imbalance — synthesize minority class.
Mixup на tabular. Linearly combine rows.
Noise injection. Add Gaussian noise to numeric.
Categorical swap. Random swap from same column distribution.
Tabular augmentation менее powerful чем image, но help.
MixUp / CutMix
MixUp. Mix two samples linearly.
x_mix = λ x_i + (1-λ) x_j
y_mix = λ y_i + (1-λ) y_jλ from Beta distribution.
CutMix. Cut patch из image, paste another image. Labels mixed proportionally к area.
Pros: strong regularization, often improve 1-3%.
AutoAugment / RandAugment
AutoAugment. RL-learned policy. Specific к dataset.
RandAugment. Random combinations N transforms с magnitude M. Simpler, comparable accuracy.
RandAugment(n=2, m=15) # apply 2 random ops с magnitude 15.Standard в modern training pipelines.
Связанные темы
- Image classifier system design для DS
- Class imbalance для DS
- Self-supervised learning для DS
- Curriculum learning для DS
- Подготовка к собесу Data Scientist
FAQ
Это официальная информация?
Нет. Статья основана на CV / NLP indust practices.
Тренируйте Data Science — откройте тренажёр с 1500+ вопросами для собесов.