Image classifier ML system design на собеседовании Data Scientist
Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.
Содержание:
Постановка задачи
Image classifier для product photos / medical / content moderation.
Constraints:
- 1M+ images training.
- p99 latency на inference.
- Continuous data flow (new categories, new images).
Data pipeline
Raw images → annotation → preprocessing → train / val / test split → augmentation → model.Annotation. Internal labelers (Label Studio, V7) или external (Toloka).
Preprocessing. Resize, normalize, format consistency.
Quality check. Label disagreement, mis-labeled, near-duplicates.
Augmentation
Increase effective dataset.
- Random crop / resize.
- Flips, rotations.
- Color jitter. Brightness, contrast, saturation.
- CutMix, MixUp. Stronger.
- AutoAugment. Learned policy.
- RandAugment. Simple, effective.
Improves generalization 1-5%.
Transfer learning
Pre-trained backbone. ResNet, EfficientNet, ViT, ConvNeXt.
Pre-trained on ImageNet (1M images, 1000 classes).
Fine-tune на your domain (10k-100k images).Strategies:
- Linear probe. Freeze backbone, train classifier head. Fast, OK.
- Full fine-tune. All weights. Best, более slow.
- Discriminative fine-tune. Different LR per layer (high для head, low для early).
Часто DINO / CLIP backbone better чем ImageNet — modern self-supervised.
Deployment
Model formats. ONNX → TensorRT для GPU, CoreML для iOS.
Quantization. INT8 — 2-4× speedup, < 1% accuracy drop.
Batching. Triton Inference Server. Dynamic batching для throughput.
Edge. MobileNet / EfficientNet-Lite для phones.
Cloud. Bigger models, better accuracy. Latency network-bound.
Edge cases
Class imbalance. Rare classes underperform. Class weights / oversampling.
Domain shift. Train на stock photos → real noisy. Augmentation, fine-tune на target.
Adversarial. Robustness — adversarial training.
OOD detection. Confidence calibration. Reject если low confidence (vs misclassify confidently).
Concept drift. New categories появляются. Continual learning или periodic retrain.
Связанные темы
- CNN-архитектуры для DS
- Image segmentation для DS
- YOLO и DETR для DS
- Inference optimization для DS
- Подготовка к собесу Data Scientist
FAQ
Это официальная информация?
Нет. Статья основана на индустриальных computer vision practices.
Тренируйте Data Science — откройте тренажёр с 1500+ вопросами для собесов.