Model serving на собеседовании ML Engineer

Q: Какой framework выбрать?

PyTorch model → TorchServe или BentoML. TF → TF Serving. Multi-framework → ONNX Runtime / Triton.

Q: Self-hosted или managed?

Managed (Yandex DataSphere, SberCloud) — для small team. Self-hosted (k8s + own stack) — для control / cost.

Q: GPU vs CPU?

GPU для deep learning (BERT+, LLM). CPU достаточно для классики (LightGBM, sklearn) на тысячах QPS.

Q: Latency budget — как allocate?

Usually: feature lookup 10-20ms, model forward 20-50ms, post-processing 5-10ms. Profile и оптимизируй самое жирное.

Q: A/B-тесты моделей?

Через traffic splitting на serving level. 95% prod, 5% candidate. Compare prediction quality и business metrics.

Зачем model serving на собесе MLE

Model serving — отдача predictions модели в production. На собесе ML Engineer — главный технический навык. DS обучает модель, MLE её сервит. Слабое serving — slow / unreliable predictions = бесполезная модель.

Слабый ответ — «Flask app». Сильный — про batch vs online, latency target, throughput, scaling, hardware acceleration.

Batch vs online inference

Batch:

Predictions для всех users раз в день / час
Сохраняем в БД, читаем при запросе
Latency / cost оптимизированы
Use case: recommendations (предсказали утром, показываем весь день)

Online (real-time):

Prediction на каждый user request
Low latency required (< 100ms typical)
Use case: антифрод, dynamic pricing, search ranking

Near real-time / streaming:

Predictions для events с задержкой seconds
Через Kafka Streams / Flink
Use case: notification triggers

Часто гибрид: batch для медленных features + online для real-time scoring.

Latency и throughput

Latency = время на одну prediction (ms).

Targets:

Search ranking: < 50ms
Антифрод: < 100ms
Recommendation: < 200ms
Async (notifications): seconds OK

Throughput = predictions per second (QPS).

Optimization:

Batching (если допустима latency)
Horizontal scaling (more replicas)
Hardware: GPU vs CPU
Model optimization (quantization, distillation)

Model serving frameworks

TorchServe (PyTorch):

PyTorch-native
REST + gRPC API
Multi-model serving
Versioning

TensorFlow Serving:

TF-native
Production-grade
High performance

BentoML:

Framework-agnostic
Containerization out-of-box
Адаптеры для разных моделей

ONNX Runtime:

Cross-framework (PyTorch + TF + sklearn → ONNX)
Hardware optimization

FastAPI + custom:

Maximum flexibility
DIY production-ready
Популярно для small / simple models

Triton Inference Server:

NVIDIA, multi-framework, multi-GPU

Выбор зависит от framework и нагрузки.

Containerization и deployment

Docker:

Reproducible runtime
Dependencies frozen
Один image на model version

Kubernetes:

Horizontal Pod Autoscaling (HPA) по CPU / GPU / custom metrics
Service mesh для routing
Rolling updates

Serverless:

AWS Lambda / Yandex Cloud Functions
Cold start проблемы для ML (model load)
Хорошо для редкого / unpredictable traffic

Hardware acceleration

CPU:

Most models достаточно
Lower cost
OK для < 100 QPS

GPU (NVIDIA):

Required для deep learning
10-100x speedup vs CPU
Higher cost ($$$/час)
Batching критично для full GPU utilization

Other accelerators:

TPU (Google, не в РФ)
AWS Inferentia (не в РФ)
В РФ: A100 / H100 если достать

Model optimization

Снизить latency / cost без сильной потери accuracy.

Quantization: float32 → int8. 4x меньше памяти, faster inference, ~1-2% accuracy loss.

Pruning: убираем неважные weights. Smaller model, fewer ops.

Distillation: student-модель учится у teacher. Меньшая, но похожая accuracy.

ONNX optimization: graph fusion, kernel optimization.

Типичные вопросы

«Real-time vs batch — когда что?»

Real-time: prediction зависит от request (search, dynamic pricing, fraud). Batch: predictions stable за день (daily recommendations).

«Снизить latency с 200ms до 50ms — как?»

Profile: что занимает время (feature lookup vs model forward).
Feature: batch lookup, cache, online store.
Model: quantize, distill, switch to ONNX.
Infra: GPU instead of CPU, larger pod, dedicated hardware.

«Throughput 100 → 10000 QPS — план?»

Horizontal scale (more replicas).
Batching requests within window.
GPU + larger batch size.
Cache predictions для popular queries.
Caching на edge / CDN если applicable.

«Cold start для serverless inference?»

Pre-warm replicas, использовать provisioned concurrency (AWS). Pre-load model в image. Альтернатива — long-running k8s pods.

Частые ошибки

Flask без gunicorn / uvicorn. Не production-ready.
No batching на GPU. GPU utilization 5%, дорого простаивает.
Каждый request — model load. Должен load один раз при startup.
Без autoscaling. Traffic spike → all requests time out.
No latency / throughput monitoring. Не знаем performance в проде.

FAQ

Какой framework выбрать?