Шпаргалка по ML для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Data analyst не строит нейронные сети daily, но знать ML basics — must. Churn models, lead scoring, recommendations — ML-based.
Шпаргалка — понимать ML terminology.
Supervised vs Unsupervised
Supervised
Есть labels (Y). Predict Y.
- Classification (discrete Y)
- Regression (continuous Y)
Unsupervised
No labels. Find patterns.
- Clustering
- Dimensionality reduction
Reinforcement
Agent в environment. Reward-based.
Classification
Algorithms
- Logistic regression: linear, interpretable
- Decision tree: if-else rules
- Random Forest: ensemble trees
- XGBoost / LightGBM / CatBoost: gradient boosting
- KNN: nearest neighbors
- SVM: support vector machines
- Neural networks: complex patterns
Metrics
- Accuracy: (TP+TN) / total
- Precision: TP / (TP+FP) — out of predicted positive
- Recall: TP / (TP+FN) — out of actual positive
- F1: harmonic mean precision & recall
- AUC-ROC: ranking quality
- AUC-PR: для imbalanced
Regression
Algorithms
- Linear regression
- Polynomial
- Ridge / Lasso (regularization)
- Tree-based (RF, XGBoost)
- Neural networks
Metrics
- MSE: mean squared error
- RMSE: √MSE
- MAE: mean absolute error
- MAPE: mean absolute percentage error
- R²: coefficient of determination
Clustering
Algorithms
- K-means: partitional
- Hierarchical: tree-based
- DBSCAN: density-based
- Gaussian Mixture
Evaluation
- Silhouette score
- Davies-Bouldin index
- Business intuition
Train / test split
Basic
Random split data. 70/30 или 80/20.
Stratified
Preserve class distribution для classification.
Cross-validation
K-fold split, averaging metrics.
Time series
Chronological split! Never random.
Feature engineering
Categorical encoding
- One-hot: K categories → K columns
- Label encoding: convert to integers
- Target encoding: mean target per category
Numerical
- Standardization: (x - mean) / std
- Normalization: (x - min) / (max - min)
- Log transform: log(x + 1) для skewed
- Binning: continuous → categorical
Interactions
Product features: A × B.
Dates
Extract day, month, weekday, etc.
Model evaluation pipeline
- Split data
- Preprocess (scale, encode)
- Train on train set
- Validate on val set (hyperparameter tuning)
- Evaluate final model on test set
- Deploy с monitoring
Hyperparameter tuning
Grid search
Test каждую combination.
Random search
Random sampling. Often lucky.
Bayesian (Optuna)
Efficient.
Cross-validation
Always use для reliable tune.
Overfitting
Signs
Train error low, test error high.
Fixes
- More data
- Regularization (L1, L2)
- Dropout (NN)
- Early stopping
- Simpler model
Underfitting
Signs: both errors high.
Fixes: complex model, more features.
Bias-variance tradeoff
- High bias: underfitting. Simpler.
- High variance: overfitting. Complex.
Balance.
Imbalanced data
Classes unequal (99/1).
Methods
- Class weights
- SMOTE (synthetic oversampling)
- Undersampling
- Threshold tuning
- Different metric (F1, AUC-PR, не accuracy)
Data leakage
Feature has info from future.
Examples:
- Use target stats (target encoding без proper fold)
- Include post-event features
Detect: suspiciously high performance.
Feature importance
Tree-based
Gini / gain importance. Fast but biased toward high-cardinality.
Permutation
Shuffle column, measure drop. Model-agnostic.
SHAP
Theory-based. Best для interpretation.
Interpretability
Simple models
Linear, logistic, tree — interpretable.
Complex models
NN, XGBoost — black box.
Tools
- SHAP: feature contribution
- LIME: local explanations
- Partial dependence plots
Common algorithms для analyst
Predictive
- Logistic regression (churn, fraud)
- XGBoost / Random Forest (most tabular)
Clustering
- K-means (user segments)
Recommendations
- Collaborative filtering
- Matrix factorization
Time series
- Prophet
- ARIMA
- XGBoost с lag features
Libraries
Python
- scikit-learn — general-purpose
- XGBoost / LightGBM / CatBoost — gradient boosting
- statsmodels — statistical models
- TensorFlow / PyTorch — neural networks
- prophet — time series
- lifelines — survival
Deployment
Batch
Predict на schedule. Simple.
Online
Real-time API. Complex.
Monitoring
- Performance drift
- Feature drift
- Data drift
Model decays. Retrain periodically.
MLOps
Key concepts
- Versioning models
- A/B тесты моделей
- Logging inputs/outputs
- Continuous retraining
Tools: MLflow, Airflow, Kubeflow.
Собесные questions
«Supervised vs unsupervised?» Supervised — labels. Unsupervised — no labels, discover patterns.
«AUC 0.9 — хорошо?» Yes, очень good. Depends на task.
«Imbalanced classes?» Weighted loss, SMOTE, threshold, F1.
«Overfitting?» Regularization, more data, simpler model.
«Feature importance?» SHAP для best.
Красная линия
Analyst ≠ ML engineer. But knowing basics:
- Evaluate ML projects
- Interpret results
- Communicate к stakeholders
- Do simple models yourself
Связанные темы
FAQ
Нужно глубоко для analyst?
Middle: basics. Senior DS: deep.
Python или R?
Python mainstream. R в academia, statistics.
Kaggle полезно?
Yes. Practice, portfolio.
Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.