Fraud detection ML system design на собеседовании Data Scientist

Готовься к собесу аналитика как в Duolingo
10 минут в день — SQL, Python, A/B, метрики. 1700+ вопросов в Telegram
Открыть Карьерник в Telegram

Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.

Постановка задачи

Real-time fraud detection — payments, account takeover.

Constraints:

  • Score per transaction < 100ms.
  • Imbalanced (0.1-1% fraud).
  • Label delayed (chargeback after weeks).
  • Adversaries adapt.

Real-time scoring

Transaction → Feature service → Model → Decision (allow / review / block).

Decision tiers:

  • Allow. Low risk.
  • Review. Manual queue.
  • Block. High risk.
  • Step-up auth. Send 2FA, Сlarify.

Threshold tuning depends на business cost (false positives expensive).

Features

Transaction. Amount, currency, country, merchant, time.

User history. Recent transactions, mean amount, country pattern.

Velocity. N transactions last 1h / 1day. Rapid changes — risk.

Device. Fingerprint, IP geolocation, mismatch с user country.

Network. Graph features (linked accounts, shared payment methods).

Historical risk. Score user based на past activity.

Feature store обязателен — pre-computed real-time.

Label delay

Fraud confirmed только когда chargeback (1-90 days).

Implications:

  • Можно train только на «old enough» data.
  • Recent transactions — labels missing → unable retrain immediately.

Mitigation:

  • Weak labels. User reports / suspicious patterns как proxy.
  • Sequence modeling. Predict pre-chargeback signals.
Готовься к собесу аналитика как в Duolingo
10 минут в день — SQL, Python, A/B, метрики. 1700+ вопросов в Telegram
Открыть Карьерник в Telegram

Class imbalance

99%+ legitimate.

Approaches:

  • Sampling. Undersample majority, oversample minority (SMOTE).
  • Class weights.
  • Focal loss.
  • Anomaly detection instead of supervised (если labels rare).

Adversarial nature

Fraudsters adapt. Models drift fast.

Defense:

  • Frequent retrain (daily / hourly).
  • Monitor model output drift.
  • Manual review feedback loop.
  • Adversarial training (simulate attacks).
  • Layered defenses (rules + ML + manual).

Метрики

Offline:

  • PR-AUC (better для imbalance).
  • Precision @ K (capacity manual review).
  • Recall @ K.

Online:

  • $ saved (prevented fraud) vs $ lost (false blocks).
  • Manual review queue size.
  • Customer complaints / chargeback rate.

Связанные темы

FAQ

Это официальная информация?

Нет. Статья основана на индустриальных fraud detection practices.


Тренируйте Data Science — откройте тренажёр с 1500+ вопросами для собесов.