Рекомендательные системы для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Netflix, YouTube, Amazon, Wildberries — рекомендательные системы драйвят huge % их revenue. Даже если вы не строите модели, понимание recommenders важно для analyst роли в consumer tech.
На собесах в Amazon, Yandex, Ozon, Avito могут спросить basics recommendation systems.
Что это
Система, предлагающая items users (products, видео, music, friends), основываясь на patterns.
Goals:
- Increase engagement / consumption
- Discovery
- Conversion (buy)
- Retention
Основные подходы
1. Popularity-based
«Топ-10 просмотренных сегодня».
Plus: работает без user data. Minus: no personalization.
Часто baseline.
2. Content-based
Match based на item attributes.
«You liked action movies → here's more action».
Requires:
- Item features (genre, price, brand)
- User history
3. Collaborative filtering
«Users like you also liked».
Не нужны item attributes. Только user-item interactions.
Типы:
- User-based: find similar users → recommend what they liked
- Item-based: find similar items → recommend similar к liked
- Matrix factorization: decompose user-item matrix
4. Hybrid
Combine multiple approaches. Netflix, Amazon.
Matrix factorization
Rating matrix R ≈ U × V
- U: users × latent factors
- V: items × latent factors
Predict missing ratings via product.
Popular algos: SVD, ALS, NMF.
Deep learning
Embeddings + neural networks для complex interactions.
YouTube, TikTok: transformers, GNNs.
Beyond classical capabilities, но требует MASSIVE data.
Cold start problem
New user или new item → no history. Problematic.
Solutions:
- Popularity recs для new users
- Content-based для new items
- Active learning: ask preferences
- Demographics-based
Metrics
Offline
Evaluated on historical data:
Rating prediction
RMSE, MAE (для prediction accuracy).
Ranking
- Precision@K
- Recall@K
- MAP (Mean Average Precision)
- NDCG (Normalized Discounted Cumulative Gain)
Online (A/B)
- CTR на recommendations
- Conversion rate
- Revenue per session
- Retention long-term
Online metrics — truth. Offline — approximation.
Exploration vs exploitation
- Exploitation: show known-good recommendations
- Exploration: try new items (learn)
Balance важен. Иначе filter bubble → user bored.
Tools:
- Epsilon-greedy
- Thompson sampling
- Multi-armed bandits
Recommendation pipeline
- Candidate generation: get ~100-1000 items user might like
- Ranking: score each candidate
- Re-ranking: apply business rules (diversity, fairness)
- Serving: return top-K
Different models на different stages.
Data нужна
- User interactions: views, clicks, purchases, likes
- User features: demographics, tenure, segment
- Item features: categories, price, descriptions
Richer data → better recs.
Implicit vs explicit feedback
Explicit
Ratings, likes — clear signal, но sparse.
Implicit
Clicks, views, time-on-page — noisy, но dense.
Modern systems mostly implicit.
В аналитике
Analyst роли вокруг recommenders:
1. Analyze impact
A/B recommendations algorithm vs baseline. Revenue lift.
2. Content insights
Какие items trending, underserved.
3. User segmentation
Different users — different reasons для recs (discovery vs efficient shopping).
4. Fairness / diversity
Не только «top clickable», но и exposure various items.
Python libraries
- Surprise: classical algos
- LightFM: hybrid
- implicit: collaborative filtering
- Recbole: modern benchmarks
- TensorFlow Recommenders: deep learning
На собесе
«Как работает collaborative filtering?» User-item matrix, find similar users/items, recommend.
«Cold start как решить?» Popularity, content-based, demographic.
«Metrics?» Precision@K, NDCG — offline. CTR, revenue — online.
«Explore vs exploit?» Bandits. Show known good, но periodically try new.
Частые ошибки
Only offline metrics
Offline metrics не correlate с online 100%. Всегда A/B.
Ignoring business rules
«Show most similar» → не diversity. Re-rank.
Old data
Preferences меняются. Fresh data важна.
No cold start plan
New users / items — large % traffic. Plan для.
Связанные темы
- Что такое clustering
- K-nearest neighbors для аналитика
- Classification vs regression
- Precision и recall
FAQ
Analyst надо уметь строить?
Нет. Understand concepts + interpret results.
Какой подход лучший?
Hybrid обычно. Зависит от data.
LLM based recs?
Emerging. ChatGPT-style recommendations — новая область.
Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.