Fraud detection ML system design на собеседовании Data Scientist
Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.
Содержание:
Постановка задачи
Real-time fraud detection — payments, account takeover.
Constraints:
- Score per transaction < 100ms.
- Imbalanced (0.1-1% fraud).
- Label delayed (chargeback after weeks).
- Adversaries adapt.
Real-time scoring
Transaction → Feature service → Model → Decision (allow / review / block).Decision tiers:
- Allow. Low risk.
- Review. Manual queue.
- Block. High risk.
- Step-up auth. Send 2FA, Сlarify.
Threshold tuning depends на business cost (false positives expensive).
Features
Transaction. Amount, currency, country, merchant, time.
User history. Recent transactions, mean amount, country pattern.
Velocity. N transactions last 1h / 1day. Rapid changes — risk.
Device. Fingerprint, IP geolocation, mismatch с user country.
Network. Graph features (linked accounts, shared payment methods).
Historical risk. Score user based на past activity.
Feature store обязателен — pre-computed real-time.
Label delay
Fraud confirmed только когда chargeback (1-90 days).
Implications:
- Можно train только на «old enough» data.
- Recent transactions — labels missing → unable retrain immediately.
Mitigation:
- Weak labels. User reports / suspicious patterns как proxy.
- Sequence modeling. Predict pre-chargeback signals.
Class imbalance
99%+ legitimate.
Approaches:
- Sampling. Undersample majority, oversample minority (SMOTE).
- Class weights.
- Focal loss.
- Anomaly detection instead of supervised (если labels rare).
Adversarial nature
Fraudsters adapt. Models drift fast.
Defense:
- Frequent retrain (daily / hourly).
- Monitor model output drift.
- Manual review feedback loop.
- Adversarial training (simulate attacks).
- Layered defenses (rules + ML + manual).
Метрики
Offline:
- PR-AUC (better для imbalance).
- Precision @ K (capacity manual review).
- Recall @ K.
Online:
- $ saved (prevented fraud) vs $ lost (false blocks).
- Manual review queue size.
- Customer complaints / chargeback rate.
Связанные темы
- Class imbalance для DS
- Anomaly detection для DS
- Feature Store для DS
- Data drift для DS
- Подготовка к собесу Data Scientist
FAQ
Это официальная информация?
Нет. Статья основана на индустриальных fraud detection practices.
Тренируйте Data Science — откройте тренажёр с 1500+ вопросами для собесов.