MLOps на собеседовании ML Engineer

Зачем MLOps на собесе MLE

MLOps — практика автоматизации ML-pipeline от эксперимента до production. На собесе ML Engineer MLOps — главная отличительная компетенция. DS экспериментирует, MLE автоматизирует.

Слабый ответ — «MLflow я знаю». Сильный — описать full MLOps lifecycle: data → training → registry → deployment → monitoring.

ML lifecycle

1. Data:

Source connectivity
Data versioning (DVC, lakeFS)
Data quality checks
Feature store

2. Training:

Experiment tracking (MLflow, W&B, ClearML)
Hyperparameter tuning (Optuna, Ray Tune)
Reproducible builds (Docker, dependency pinning)

3. Registry:

Model registry (MLflow Models, BentoML, SageMaker)
Versioning + lineage
Approval workflow (staging → production)

4. Deployment:

Containers (Docker)
Kubernetes / serverless
Strategies: canary, blue-green, shadow
A/B-тесты моделей

5. Monitoring:

Data drift, concept drift
Performance metrics
Latency / cost monitoring

Experiment tracking

Без tracking — каждый experiment в notebook = потерянный. Стандарт 2026:

MLflow: open-source, popular. Tracks: params, metrics, artifacts. UI для сравнения runs.

Weights & Biases: managed, лучший UI. Для академии и индустрии.

ClearML: open-source альтернатива W&B.

Что трекать в run:

Hyperparameters
Train / validation metrics
Model artifact (saved model)
Data version
Code commit hash

Model registry

После эксперимента — production-ready модель в registry.

MLflow Model Registry:

Versioned models
Stages: None → Staging → Production → Archived
Аннотации: «approved by SRE», «passed A/B»
API для CI/CD pipelines

Lineage:

Каждая модель ссылается на: code commit, data version, training config
Reproducibility — главная цель

CI/CD для моделей

Continuous Integration (CI):

Unit tests на feature engineering, model code
Integration tests с stub data
Reproducibility test (тренировка дает same metrics)

Continuous Deployment (CD):

Deploy new model version в staging
Shadow mode — модель делает predictions, но не influences (compare with prod)
Canary — 5% traffic → 25% → 100%
Rollback при degradation

Tools: GitHub Actions, GitLab CI, Jenkins + MLflow / BentoML.

ML pipelines

Orchestration для training / batch inference.

Airflow: общий orchestrator, also for ML.

Kubeflow Pipelines: Kubernetes-native ML pipelines.

Metaflow: Netflix open-source, simple Python.

Prefect / Dagster: modern alternatives to Airflow.

ZenML / Flyte: ML-specific orchestrators.

В большинстве реальных команд — Airflow + DAG для ML pipelines.

Reproducibility

ML без reproducibility = debugging ад.

Components:

Docker container with frozen dependencies
Random seeds fixed
Data version pinned
Code commit recorded
Config in YAML / hydra

Test: Same training → same metrics within tolerance.

Типичные вопросы

«Опиши full MLOps lifecycle»

Data ingestion → feature engineering (feature store) → experiment tracking (MLflow) → model registry → deployment (canary) → monitoring (drift + performance) → retraining trigger.

«Как обеспечить reproducibility?»

Docker, fixed seeds, pinned data version, pinned dependencies, code commit recorded. Hydra / config files для tunable params.

«CI/CD pipeline для ML — что в нём?»

PR: unit tests + integration tests + sanity training run. Merge: docker build → push to registry → deploy to staging → shadow / canary → production.

«Когда retrain модели?»

Schedule (e.g., weekly).
Performance drop detected.
Data drift detected.
Major data source change.

«Airflow или Kubeflow для ML pipelines?»

Airflow если в команде уже есть. Kubeflow для k8s-heavy. Реализация чаще через Airflow + ML-tools (MLflow, KFP operators).

Частые ошибки

Без model registry. Production model — где он? Никто не знает.
Без reproducibility. Models, которые не воспроизводятся, нельзя retrain.
Train/serve skew. Feature engineering отличается в train vs prod → silent bugs.
Без monitoring. Models в проде «протухают» без alerts.
Manual deploy. SSH + cp file → конец карьеры.

FAQ

MLOps platforms (Vertex AI, SageMaker)?

В РФ Vertex / SageMaker недоступны. Self-hosted: MLflow + Airflow + k8s + custom glue. Yandex Cloud — managed ML-services.

Сколько занимает MLOps на проект?

Базовый pipeline: 1-2 недели. Зрелый с monitoring + drift: 1-2 месяца.

MLOps engineer или ML engineer?

MLOps engineer более узкий — focus на платформу. ML engineer — широкий. В крупных компаниях оба роли, в small — overlap.

Где практиковать MLOps?

Поднять MLflow + Airflow + Docker. Простую модель прогнать через full lifecycle на personal infra.

Книги / курсы по MLOps?

«Designing Machine Learning Systems» Chip Huyen, «MLOps Specialization» on Coursera (Andrew Ng / DeepLearning.AI).

Смотрите также

Тренироваться в Telegram