Churn modeling для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Предсказание churn — ~50% analytics job у subscription / telecom / banking. Ранняя identification at-risk customers → proactive interventions.

На собесах в SaaS / telecom / banks — typical case study.

Определение churn

Зависит от бизнес модели:

Subscription

Обычно: cancel subscription.

Usage-based

Inactive N days (customizable).

Banking

Close all accounts, или primary account.

Mobile app

Not opened app N days (usually 7 / 14 / 30).

Choose clear definition для model.

Problem setup

Binary classification:

  • Target (Y): did customer churn в prediction window?
  • Features (X): behavior, demographics, engagement

Prediction window typical: 30 или 90 days.

Data preparation

1. Define prediction point

Today, или historical cutoff.

2. Define window

30 days ahead predict: will churn в next 30 days?

3. Label

For historical data: did churn в [cutoff, cutoff + 30]? 1 или 0.

4. Features

Pre-cutoff data. NO data from prediction window (leakage).

Features

Behavioral

  • Days active last 30
  • Session count
  • Actions per session
  • Features used
  • Time of day patterns

Transactional

  • Spend last N months
  • Frequency trend (last vs previous)
  • Product categories

Engagement

  • Email opens
  • Support tickets
  • NPS score

Tenure

  • Days since signup
  • Subscription length
  • Plan tier

Demographics

  • Country
  • Age (if known)
  • Device

Recent changes

  • Price change?
  • Feature deprecation?
  • Issue / outage experience?

Feature engineering

Aggregations

Avg, max, min по time windows (7d, 30d, 90d).

Trends

Last period / previous period ratio.

Recency

Days since last action.

Flags

«Has premium», «Has integrated payment», etc.

Modeling

Logistic regression

Interpretable. Good baseline.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

# Probability of churn
probs = model.predict_proba(X_test)[:, 1]

XGBoost / LightGBM

Often higher accuracy. Still interpretable через SHAP.

Random Forest

Robust baseline.

Survival analysis

Predict time-to-event, не just binary.

Cox Proportional Hazards. Good когда timing matters.

Evaluation

Imbalanced

Churn typically rare (1-5%). Avoid accuracy.

Metrics:

  • AUC-ROC
  • Precision-recall curve
  • F1
  • Lift curve

Lift

«Top decile predicted — что reality?»

Ideally: top 10% predicted churners → 50%+ actual churners. 5x lift.

Business metric

Ultimately — reduced churn после interventions.

Interventions

Model predicts churner → what do?

Retention campaigns

  • Email with discount
  • Personal outreach
  • Feature education
  • Success manager call

Product fixes

If churn reasons identifiable:

  • Address UX issues
  • Offer alternatives
  • Pricing flexibility

Prevent vs rescue

Earlier catch → easier save.

Score daily → intervene promptly.

Test effectiveness

A/B intervention:

  • Split predicted churners
  • Half get intervention
  • Half control
  • Compare retention

Without A/B — можно привiolated causation.

В SQL + Python

Data prep (SQL)

WITH labels AS (
    SELECT
        user_id,
        CASE WHEN MAX(last_active) < CURRENT_DATE - 30 THEN 1 ELSE 0 END AS churned
    FROM user_activity
    WHERE last_active BETWEEN CURRENT_DATE - 90 AND CURRENT_DATE - 60
    -- Look-back period to define churners
    GROUP BY user_id
),
features AS (
    SELECT
        user_id,
        COUNT(*) AS session_count_30d,
        AVG(session_duration) AS avg_session,
        COUNT(DISTINCT feature) AS features_used,
        MAX(last_active) AS last_seen
    FROM events
    WHERE event_date BETWEEN CURRENT_DATE - 120 AND CURRENT_DATE - 90
    -- Features observed before prediction window
    GROUP BY user_id
)
SELECT * FROM features JOIN labels USING (user_id);

Model (Python)

import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

df = pd.read_sql(query, conn)
X = df.drop(['user_id', 'churned'], axis=1)
y = df['churned']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

model = XGBClassifier(scale_pos_weight=(len(y) - sum(y)) / sum(y))
model.fit(X_train, y_train)

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))

Pitfalls

1. Label leakage

Feature include data from prediction window → model «cheats».

2. Wrong label timeframe

Too short — incomplete signal. Too long — model slow к react.

3. Ignore changes

Product change → model trained ON old behavior. Retrain periodically.

4. Static models

Dynamic business. Quarterly retraining.

5. One-size-fits-all

Maybe different segments need different models. B2C vs B2B, paying vs free.

Survival analysis

Для time-to-event:

from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(df, duration_col='tenure', event_col='churned')

# Predict survival curves
cph.predict_survival_function(new_customers)

Answers: «how long will this customer survive»? Not just yes / no.

Useful for LTV calculations.

На собесе

«Build churn model для [product]?»

Walk:

  1. Definition churn
  2. Prediction window
  3. Features (behavior + other)
  4. Model (start simple — logistic)
  5. Evaluation (AUC, lift)
  6. Actions (interventions)
  7. A/B test effectiveness

Structured thinking.

«Class imbalance?»

Class weights, oversampling (SMOTE), threshold tuning.

Связанные темы

FAQ

Modeling vs rules?

Rules simple, start there. ML когда rules insufficient.

Neural network для churn?

Overkill typically. Tabular — tree-based often wins.

Proprietary data?

Usually. Real-world data — messy, bounded.


Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.