Churn modeling для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Предсказание churn — ~50% analytics job у subscription / telecom / banking. Ранняя identification at-risk customers → proactive interventions.
На собесах в SaaS / telecom / banks — typical case study.
Определение churn
Зависит от бизнес модели:
Subscription
Обычно: cancel subscription.
Usage-based
Inactive N days (customizable).
Banking
Close all accounts, или primary account.
Mobile app
Not opened app N days (usually 7 / 14 / 30).
Choose clear definition для model.
Problem setup
Binary classification:
- Target (Y): did customer churn в prediction window?
- Features (X): behavior, demographics, engagement
Prediction window typical: 30 или 90 days.
Data preparation
1. Define prediction point
Today, или historical cutoff.
2. Define window
30 days ahead predict: will churn в next 30 days?
3. Label
For historical data: did churn в [cutoff, cutoff + 30]? 1 или 0.
4. Features
Pre-cutoff data. NO data from prediction window (leakage).
Features
Behavioral
- Days active last 30
- Session count
- Actions per session
- Features used
- Time of day patterns
Transactional
- Spend last N months
- Frequency trend (last vs previous)
- Product categories
Engagement
- Email opens
- Support tickets
- NPS score
Tenure
- Days since signup
- Subscription length
- Plan tier
Demographics
- Country
- Age (if known)
- Device
Recent changes
- Price change?
- Feature deprecation?
- Issue / outage experience?
Feature engineering
Aggregations
Avg, max, min по time windows (7d, 30d, 90d).
Trends
Last period / previous period ratio.
Recency
Days since last action.
Flags
«Has premium», «Has integrated payment», etc.
Modeling
Logistic regression
Interpretable. Good baseline.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)
# Probability of churn
probs = model.predict_proba(X_test)[:, 1]XGBoost / LightGBM
Often higher accuracy. Still interpretable через SHAP.
Random Forest
Robust baseline.
Survival analysis
Predict time-to-event, не just binary.
Cox Proportional Hazards. Good когда timing matters.
Evaluation
Imbalanced
Churn typically rare (1-5%). Avoid accuracy.
Metrics:
- AUC-ROC
- Precision-recall curve
- F1
- Lift curve
Lift
«Top decile predicted — что reality?»
Ideally: top 10% predicted churners → 50%+ actual churners. 5x lift.
Business metric
Ultimately — reduced churn после interventions.
Interventions
Model predicts churner → what do?
Retention campaigns
- Email with discount
- Personal outreach
- Feature education
- Success manager call
Product fixes
If churn reasons identifiable:
- Address UX issues
- Offer alternatives
- Pricing flexibility
Prevent vs rescue
Earlier catch → easier save.
Score daily → intervene promptly.
Test effectiveness
A/B intervention:
- Split predicted churners
- Half get intervention
- Half control
- Compare retention
Without A/B — можно привiolated causation.
В SQL + Python
Data prep (SQL)
WITH labels AS (
SELECT
user_id,
CASE WHEN MAX(last_active) < CURRENT_DATE - 30 THEN 1 ELSE 0 END AS churned
FROM user_activity
WHERE last_active BETWEEN CURRENT_DATE - 90 AND CURRENT_DATE - 60
-- Look-back period to define churners
GROUP BY user_id
),
features AS (
SELECT
user_id,
COUNT(*) AS session_count_30d,
AVG(session_duration) AS avg_session,
COUNT(DISTINCT feature) AS features_used,
MAX(last_active) AS last_seen
FROM events
WHERE event_date BETWEEN CURRENT_DATE - 120 AND CURRENT_DATE - 90
-- Features observed before prediction window
GROUP BY user_id
)
SELECT * FROM features JOIN labels USING (user_id);Model (Python)
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
df = pd.read_sql(query, conn)
X = df.drop(['user_id', 'churned'], axis=1)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
model = XGBClassifier(scale_pos_weight=(len(y) - sum(y)) / sum(y))
model.fit(X_train, y_train)
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))Pitfalls
1. Label leakage
Feature include data from prediction window → model «cheats».
2. Wrong label timeframe
Too short — incomplete signal. Too long — model slow к react.
3. Ignore changes
Product change → model trained ON old behavior. Retrain periodically.
4. Static models
Dynamic business. Quarterly retraining.
5. One-size-fits-all
Maybe different segments need different models. B2C vs B2B, paying vs free.
Survival analysis
Для time-to-event:
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df, duration_col='tenure', event_col='churned')
# Predict survival curves
cph.predict_survival_function(new_customers)Answers: «how long will this customer survive»? Not just yes / no.
Useful for LTV calculations.
На собесе
«Build churn model для [product]?»
Walk:
- Definition churn
- Prediction window
- Features (behavior + other)
- Model (start simple — logistic)
- Evaluation (AUC, lift)
- Actions (interventions)
- A/B test effectiveness
Structured thinking.
«Class imbalance?»
Class weights, oversampling (SMOTE), threshold tuning.
Связанные темы
- Churn простыми словами
- Customer health score
- Как посчитать churn в SQL
- Logistic regression
- XGBoost
- Survival analysis
FAQ
Modeling vs rules?
Rules simple, start there. ML когда rules insufficient.
Neural network для churn?
Overkill typically. Tabular — tree-based often wins.
Proprietary data?
Usually. Real-world data — messy, bounded.
Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.