Как посчитать Propensity Score в SQL

Закрепи формулу propensity score в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать propensity score в Telegram

Зачем Propensity Score

Propensity Score = probability event happens (purchase, churn, click). Model output. Использует для: 1) Lift в targeting (top deciles → email). 2) Matching в causal analysis. 3) Risk scoring.

Базовый расчёт

Score обычно generated ML model (out of scope SQL). В SQL: query results:

SELECT
    user_id,
    propensity_to_purchase,
    propensity_to_churn,
    -- Bucketed
    CASE
        WHEN propensity_to_churn >= 0.8 THEN 'high-risk'
        WHEN propensity_to_churn >= 0.5 THEN 'medium-risk'
        WHEN propensity_to_churn >= 0.2 THEN 'low-risk'
        ELSE 'safe'
    END AS churn_risk_segment
FROM user_scores
WHERE scoring_date = CURRENT_DATE;

Calibration

Calibrated model: predicted prob 0.8 = 80% actually positive.

WITH buckets AS (
    SELECT
        WIDTH_BUCKET(propensity_score, 0, 1, 10) AS decile,
        AVG(propensity_score) AS avg_predicted,
        AVG(CASE WHEN actual = 1 THEN 1.0 ELSE 0 END) AS actual_rate,
        COUNT(*) AS n
    FROM scored_predictions
    WHERE scoring_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY 1
)
SELECT
    decile,
    n,
    avg_predicted,
    actual_rate,
    actual_rate - avg_predicted AS calibration_gap
FROM buckets
ORDER BY decile;

If gap значительный → model miscalibrated, нужна изotropic regression или Platt scaling.

Top deciles

SELECT
    NTILE(10) OVER (ORDER BY propensity_score DESC) AS decile,
    COUNT(*) AS users,
    AVG(propensity_score) AS avg_score,
    SUM(actual_purchases) AS total_purchases,
    SUM(actual_purchases)::NUMERIC * 100 / SUM(SUM(actual_purchases)) OVER () AS purchase_share_pct
FROM (
    SELECT user_id, propensity_score, actual_purchases
    FROM user_scores
    WHERE scoring_date = CURRENT_DATE
) t
GROUP BY decile
ORDER BY decile;

Top decile часто 20-30% of all positives — efficient targeting.

Закрепи формулу propensity score в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать propensity score в Telegram

Lift chart

WITH ordered AS (
    SELECT
        propensity_score,
        actual,
        ROW_NUMBER() OVER (ORDER BY propensity_score DESC) AS rn,
        COUNT(*) OVER () AS total
    FROM scored_users
)
SELECT
    rn::NUMERIC / total AS pct_of_population_targeted,
    SUM(CASE WHEN actual = 1 THEN 1 ELSE 0 END) OVER (ORDER BY rn)::NUMERIC
    / NULLIF(SUM(CASE WHEN actual = 1 THEN 1 ELSE 0 END) OVER (), 0) AS pct_positives_captured
FROM ordered
WHERE rn IN (
    SELECT total * x / 10 FROM ordered, UNNEST(ARRAY[1,2,3,4,5,6,7,8,9,10]) AS x
    GROUP BY total, x
)
ORDER BY rn;

Targeting

Top X% propensity → marketing list:

SELECT
    user_id,
    email,
    propensity_to_purchase,
    NTILE(100) OVER (ORDER BY propensity_to_purchase DESC) AS percentile
FROM user_scores
WHERE NTILE(100) OVER (ORDER BY propensity_to_purchase DESC) <= 10  -- top 10%
ORDER BY propensity_to_purchase DESC;

Частые ошибки

Ошибка 1. Uncalibrated scores. Random Forest scores не calibrated. Logistic regression — yes.

Ошибка 2. Threshold = 0.5. 0.5 default. Optimize for business cost (FP vs FN).

Ошибка 3. Stale scores. Scores days old. Refresh frequently для high-velocity events.

Ошибка 4. Causal misuse. Propensity Score → matching for causal estimation. Not just targeting.

Ошибка 5. Score per user vs event. Same user, multi events. Per-event scoring more precise.

Связанные темы

FAQ

Какие models дают propensity?

Logistic regression (calibrated naturally). Gradient boosting, neural nets (often need calibration).

Calibration важна?

Yes если score используется как probability. Для ranking — less critical.

Top decile lift?

Random = 10% (uniform). Good model: 30-50% positives in top decile.

Score refresh frequency?

Daily для most use cases. Hourly для critical (fraud).

Score for matching (causal)?

Yes — propensity matching technique. Same propensity treated / not treated for ATE estimation.