23 апреля 2026 г.·4 мин чтения

Churn modeling для аналитика

Проверь себя · 1/3разбор после ответа

В подписочном продукте вы ищете рычаги роста LTV. Что чаще всего напрямую увеличивает LTV при прочих равных?

Зачем это знать

Предсказание churn — ~50% analytics job у subscription / telecom / banking. Ранняя identification at-risk customers → proactive interventions.

На собесах в SaaS / telecom / banks — typical case study.

Определение churn

Зависит от бизнес модели:

Subscription

Обычно: cancel subscription.

Usage-based

Inactive N days (customizable).

Banking

Close all accounts, или primary account.

Mobile app

Not opened app N days (usually 7 / 14 / 30).

Choose clear definition для model.

Problem setup

Binary classification:

Target (Y): did customer churn в prediction window?
Features (X): behavior, demographics, engagement

Prediction window typical: 30 или 90 days.

Data preparation

1. Define prediction point

Today, или historical cutoff.

2. Define window

30 days ahead predict: will churn в next 30 days?

3. Label

For historical data: did churn в [cutoff, cutoff + 30]? 1 или 0.

4. Features

Pre-cutoff data. NO data from prediction window (leakage).

Features

Behavioral

Days active last 30
Session count
Actions per session
Features used
Time of day patterns

Transactional

Spend last N months
Frequency trend (last vs previous)
Product categories

Engagement

Email opens
Support tickets
NPS score

Tenure

Days since signup
Subscription length
Plan tier

Demographics

Country
Age (if known)
Device

Recent changes

Price change?
Feature deprecation?
Issue / outage experience?

Feature engineering

Aggregations

Avg, max, min по time windows (7d, 30d, 90d).

Trends

Last period / previous period ratio.

Recency

Days since last action.

Flags

«Has premium», «Has integrated payment», etc.

Modeling

Logistic regression

Interpretable. Good baseline.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

# Probability of churn
probs = model.predict_proba(X_test)[:, 1]

XGBoost / LightGBM

Often higher accuracy. Still interpretable через SHAP.

Random Forest

Robust baseline.

Survival analysis

Predict time-to-event, не just binary.

Cox Proportional Hazards. Good когда timing matters.

Evaluation

Imbalanced

Churn typically rare (1-5%). Avoid accuracy.

Metrics:

AUC-ROC
Precision-recall curve
F1
Lift curve

Lift

«Top decile predicted — что reality?»

Ideally: top 10% predicted churners → 50%+ actual churners. 5x lift.

Business metric

Ultimately — reduced churn после interventions.

Тренируйся к собесу продуктового аналитика

400+ вопросов по метрикам, кейсам, retention и unit-экономике

Тренировать продукт в Telegram

Interventions

Model predicts churner → what do?

Retention campaigns

Email with discount
Personal outreach
Feature education
Success manager call

Product fixes

If churn reasons identifiable:

Address UX issues
Offer alternatives
Pricing flexibility

Prevent vs rescue

Earlier catch → easier save.

Score daily → intervene promptly.

Test effectiveness

A/B intervention:

Split predicted churners
Half get intervention
Half control
Compare retention

Without A/B — можно привiolated causation.

В SQL + Python

Data prep (SQL)

WITH labels AS (
    SELECT
        user_id,
        CASE WHEN MAX(last_active) < CURRENT_DATE - 30 THEN 1 ELSE 0 END AS churned
    FROM user_activity
    WHERE last_active BETWEEN CURRENT_DATE - 90 AND CURRENT_DATE - 60
    -- Look-back period to define churners
    GROUP BY user_id
),
features AS (
    SELECT
        user_id,
        COUNT(*) AS session_count_30d,
        AVG(session_duration) AS avg_session,
        COUNT(DISTINCT feature) AS features_used,
        MAX(last_active) AS last_seen
    FROM events
    WHERE event_date BETWEEN CURRENT_DATE - 120 AND CURRENT_DATE - 90
    -- Features observed before prediction window
    GROUP BY user_id
)
SELECT * FROM features JOIN labels USING (user_id);

Model (Python)

import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

df = pd.read_sql(query, conn)
X = df.drop(['user_id', 'churned'], axis=1)
y = df['churned']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

model = XGBClassifier(scale_pos_weight=(len(y) - sum(y)) / sum(y))
model.fit(X_train, y_train)

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))

Pitfalls

1. Label leakage

Feature include data from prediction window → model «cheats».

2. Wrong label timeframe

Too short — incomplete signal. Too long — model slow к react.

3. Ignore changes

Product change → model trained ON old behavior. Retrain periodically.

4. Static models

Dynamic business. Quarterly retraining.

5. One-size-fits-all

Maybe different segments need different models. B2C vs B2B, paying vs free.

Survival analysis

Для time-to-event:

from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(df, duration_col='tenure', event_col='churned')

# Predict survival curves
cph.predict_survival_function(new_customers)

Answers: «how long will this customer survive»? Not just yes / no.

Useful for LTV calculations.

На собесе

«Build churn model для [product]?»

Walk:

Definition churn
Prediction window
Features (behavior + other)
Model (start simple — logistic)
Evaluation (AUC, lift)
Actions (interventions)
A/B test effectiveness

Structured thinking.

«Class imbalance?»

Class weights, oversampling (SMOTE), threshold tuning.

Связанные темы

FAQ

Modeling vs rules?

Rules simple, start there. ML когда rules insufficient.

Neural network для churn?

Overkill typically. Tabular — tree-based often wins.

Proprietary data?

Usually. Real-world data — messy, bounded.

Churn modeling для аналитика

Зачем это знать

Определение churn

Subscription

Usage-based

Banking

Mobile app

Problem setup

Data preparation

1. Define prediction point

2. Define window

3. Label

4. Features

Features

Behavioral

Transactional

Engagement

Tenure

Demographics

Recent changes

Feature engineering

Aggregations

Trends

Recency

Flags

Modeling

Logistic regression

XGBoost / LightGBM

Random Forest

Survival analysis

Evaluation

Imbalanced

Lift

Business metric

Interventions

Retention campaigns

Product fixes

Prevent vs rescue

Test effectiveness

В SQL + Python

Data prep (SQL)

Model (Python)

Pitfalls

1. Label leakage

2. Wrong label timeframe

3. Ignore changes

4. Static models

5. One-size-fits-all

Survival analysis

На собесе

Связанные темы

FAQ

Modeling vs rules?

Neural network для churn?

Proprietary data?

Ещё по теме