Рекомендательные системы для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Netflix, YouTube, Amazon, Wildberries — рекомендательные системы драйвят huge % их revenue. Даже если вы не строите модели, понимание recommenders важно для analyst роли в consumer tech.

На собесах в Amazon, Yandex, Ozon, Avito могут спросить basics recommendation systems.

Что это

Система, предлагающая items users (products, видео, music, friends), основываясь на patterns.

Goals:

  • Increase engagement / consumption
  • Discovery
  • Conversion (buy)
  • Retention

Основные подходы

1. Popularity-based

«Топ-10 просмотренных сегодня».

Plus: работает без user data. Minus: no personalization.

Часто baseline.

2. Content-based

Match based на item attributes.

«You liked action movies → here's more action».

Requires:

  • Item features (genre, price, brand)
  • User history

3. Collaborative filtering

«Users like you also liked».

Не нужны item attributes. Только user-item interactions.

Типы:

  • User-based: find similar users → recommend what they liked
  • Item-based: find similar items → recommend similar к liked
  • Matrix factorization: decompose user-item matrix

4. Hybrid

Combine multiple approaches. Netflix, Amazon.

Matrix factorization

Rating matrix R ≈ U × V

  • U: users × latent factors
  • V: items × latent factors

Predict missing ratings via product.

Popular algos: SVD, ALS, NMF.

Deep learning

Embeddings + neural networks для complex interactions.

YouTube, TikTok: transformers, GNNs.

Beyond classical capabilities, но требует MASSIVE data.

Cold start problem

New user или new item → no history. Problematic.

Solutions:

  • Popularity recs для new users
  • Content-based для new items
  • Active learning: ask preferences
  • Demographics-based

Metrics

Offline

Evaluated on historical data:

Rating prediction

RMSE, MAE (для prediction accuracy).

Ranking

  • Precision@K
  • Recall@K
  • MAP (Mean Average Precision)
  • NDCG (Normalized Discounted Cumulative Gain)

Online (A/B)

  • CTR на recommendations
  • Conversion rate
  • Revenue per session
  • Retention long-term

Online metrics — truth. Offline — approximation.

Exploration vs exploitation

  • Exploitation: show known-good recommendations
  • Exploration: try new items (learn)

Balance важен. Иначе filter bubble → user bored.

Tools:

  • Epsilon-greedy
  • Thompson sampling
  • Multi-armed bandits

Recommendation pipeline

  1. Candidate generation: get ~100-1000 items user might like
  2. Ranking: score each candidate
  3. Re-ranking: apply business rules (diversity, fairness)
  4. Serving: return top-K

Different models на different stages.

Data нужна

  • User interactions: views, clicks, purchases, likes
  • User features: demographics, tenure, segment
  • Item features: categories, price, descriptions

Richer data → better recs.

Implicit vs explicit feedback

Explicit

Ratings, likes — clear signal, но sparse.

Implicit

Clicks, views, time-on-page — noisy, но dense.

Modern systems mostly implicit.

В аналитике

Analyst роли вокруг recommenders:

1. Analyze impact

A/B recommendations algorithm vs baseline. Revenue lift.

2. Content insights

Какие items trending, underserved.

3. User segmentation

Different users — different reasons для recs (discovery vs efficient shopping).

4. Fairness / diversity

Не только «top clickable», но и exposure various items.

Python libraries

  • Surprise: classical algos
  • LightFM: hybrid
  • implicit: collaborative filtering
  • Recbole: modern benchmarks
  • TensorFlow Recommenders: deep learning

На собесе

«Как работает collaborative filtering?» User-item matrix, find similar users/items, recommend.

«Cold start как решить?» Popularity, content-based, demographic.

«Metrics?» Precision@K, NDCG — offline. CTR, revenue — online.

«Explore vs exploit?» Bandits. Show known good, но periodically try new.

Частые ошибки

Only offline metrics

Offline metrics не correlate с online 100%. Всегда A/B.

Ignoring business rules

«Show most similar» → не diversity. Re-rank.

Old data

Preferences меняются. Fresh data важна.

No cold start plan

New users / items — large % traffic. Plan для.

Связанные темы

FAQ

Analyst надо уметь строить?

Нет. Understand concepts + interpret results.

Какой подход лучший?

Hybrid обычно. Зависит от data.

LLM based recs?

Emerging. ChatGPT-style recommendations — новая область.


Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.