Шпаргалка по ML для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Data analyst не строит нейронные сети daily, но знать ML basics — must. Churn models, lead scoring, recommendations — ML-based.

Шпаргалка — понимать ML terminology.

Supervised vs Unsupervised

Supervised

Есть labels (Y). Predict Y.

  • Classification (discrete Y)
  • Regression (continuous Y)

Unsupervised

No labels. Find patterns.

  • Clustering
  • Dimensionality reduction

Reinforcement

Agent в environment. Reward-based.

Classification

Algorithms

  • Logistic regression: linear, interpretable
  • Decision tree: if-else rules
  • Random Forest: ensemble trees
  • XGBoost / LightGBM / CatBoost: gradient boosting
  • KNN: nearest neighbors
  • SVM: support vector machines
  • Neural networks: complex patterns

Metrics

  • Accuracy: (TP+TN) / total
  • Precision: TP / (TP+FP) — out of predicted positive
  • Recall: TP / (TP+FN) — out of actual positive
  • F1: harmonic mean precision & recall
  • AUC-ROC: ranking quality
  • AUC-PR: для imbalanced

Regression

Algorithms

  • Linear regression
  • Polynomial
  • Ridge / Lasso (regularization)
  • Tree-based (RF, XGBoost)
  • Neural networks

Metrics

  • MSE: mean squared error
  • RMSE: √MSE
  • MAE: mean absolute error
  • MAPE: mean absolute percentage error
  • R²: coefficient of determination

Clustering

Algorithms

  • K-means: partitional
  • Hierarchical: tree-based
  • DBSCAN: density-based
  • Gaussian Mixture

Evaluation

  • Silhouette score
  • Davies-Bouldin index
  • Business intuition

Train / test split

Basic

Random split data. 70/30 или 80/20.

Stratified

Preserve class distribution для classification.

Cross-validation

K-fold split, averaging metrics.

Time series

Chronological split! Never random.

Feature engineering

Categorical encoding

  • One-hot: K categories → K columns
  • Label encoding: convert to integers
  • Target encoding: mean target per category

Numerical

  • Standardization: (x - mean) / std
  • Normalization: (x - min) / (max - min)
  • Log transform: log(x + 1) для skewed
  • Binning: continuous → categorical

Interactions

Product features: A × B.

Dates

Extract day, month, weekday, etc.

Model evaluation pipeline

  1. Split data
  2. Preprocess (scale, encode)
  3. Train on train set
  4. Validate on val set (hyperparameter tuning)
  5. Evaluate final model on test set
  6. Deploy с monitoring

Hyperparameter tuning

Grid search

Test каждую combination.

Random search

Random sampling. Often lucky.

Bayesian (Optuna)

Efficient.

Cross-validation

Always use для reliable tune.

Overfitting

Signs

Train error low, test error high.

Fixes

  • More data
  • Regularization (L1, L2)
  • Dropout (NN)
  • Early stopping
  • Simpler model

Underfitting

Signs: both errors high.

Fixes: complex model, more features.

Bias-variance tradeoff

  • High bias: underfitting. Simpler.
  • High variance: overfitting. Complex.

Balance.

Imbalanced data

Classes unequal (99/1).

Methods

  • Class weights
  • SMOTE (synthetic oversampling)
  • Undersampling
  • Threshold tuning
  • Different metric (F1, AUC-PR, не accuracy)

Data leakage

Feature has info from future.

Examples:

  • Use target stats (target encoding без proper fold)
  • Include post-event features

Detect: suspiciously high performance.

Feature importance

Tree-based

Gini / gain importance. Fast but biased toward high-cardinality.

Permutation

Shuffle column, measure drop. Model-agnostic.

SHAP

Theory-based. Best для interpretation.

Interpretability

Simple models

Linear, logistic, tree — interpretable.

Complex models

NN, XGBoost — black box.

Tools

  • SHAP: feature contribution
  • LIME: local explanations
  • Partial dependence plots

Common algorithms для analyst

Predictive

  • Logistic regression (churn, fraud)
  • XGBoost / Random Forest (most tabular)

Clustering

  • K-means (user segments)

Recommendations

  • Collaborative filtering
  • Matrix factorization

Time series

  • Prophet
  • ARIMA
  • XGBoost с lag features

Libraries

Python

  • scikit-learn — general-purpose
  • XGBoost / LightGBM / CatBoost — gradient boosting
  • statsmodels — statistical models
  • TensorFlow / PyTorch — neural networks
  • prophet — time series
  • lifelines — survival

Deployment

Batch

Predict на schedule. Simple.

Online

Real-time API. Complex.

Monitoring

  • Performance drift
  • Feature drift
  • Data drift

Model decays. Retrain periodically.

MLOps

Key concepts

  • Versioning models
  • A/B тесты моделей
  • Logging inputs/outputs
  • Continuous retraining

Tools: MLflow, Airflow, Kubeflow.

Собесные questions

«Supervised vs unsupervised?» Supervised — labels. Unsupervised — no labels, discover patterns.

«AUC 0.9 — хорошо?» Yes, очень good. Depends на task.

«Imbalanced classes?» Weighted loss, SMOTE, threshold, F1.

«Overfitting?» Regularization, more data, simpler model.

«Feature importance?» SHAP для best.

Красная линия

Analyst ≠ ML engineer. But knowing basics:

  • Evaluate ML projects
  • Interpret results
  • Communicate к stakeholders
  • Do simple models yourself

Связанные темы

FAQ

Нужно глубоко для analyst?

Middle: basics. Senior DS: deep.

Python или R?

Python mainstream. R в academia, statistics.

Kaggle полезно?

Yes. Practice, portfolio.


Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.