Как посчитать Propensity Score в SQL
Содержание:
Зачем Propensity Score
Propensity Score = probability event happens (purchase, churn, click). Model output. Использует для: 1) Lift в targeting (top deciles → email). 2) Matching в causal analysis. 3) Risk scoring.
Базовый расчёт
Score обычно generated ML model (out of scope SQL). В SQL: query results:
SELECT
user_id,
propensity_to_purchase,
propensity_to_churn,
-- Bucketed
CASE
WHEN propensity_to_churn >= 0.8 THEN 'high-risk'
WHEN propensity_to_churn >= 0.5 THEN 'medium-risk'
WHEN propensity_to_churn >= 0.2 THEN 'low-risk'
ELSE 'safe'
END AS churn_risk_segment
FROM user_scores
WHERE scoring_date = CURRENT_DATE;Calibration
Calibrated model: predicted prob 0.8 = 80% actually positive.
WITH buckets AS (
SELECT
WIDTH_BUCKET(propensity_score, 0, 1, 10) AS decile,
AVG(propensity_score) AS avg_predicted,
AVG(CASE WHEN actual = 1 THEN 1.0 ELSE 0 END) AS actual_rate,
COUNT(*) AS n
FROM scored_predictions
WHERE scoring_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
)
SELECT
decile,
n,
avg_predicted,
actual_rate,
actual_rate - avg_predicted AS calibration_gap
FROM buckets
ORDER BY decile;If gap значительный → model miscalibrated, нужна изotropic regression или Platt scaling.
Top deciles
SELECT
NTILE(10) OVER (ORDER BY propensity_score DESC) AS decile,
COUNT(*) AS users,
AVG(propensity_score) AS avg_score,
SUM(actual_purchases) AS total_purchases,
SUM(actual_purchases)::NUMERIC * 100 / SUM(SUM(actual_purchases)) OVER () AS purchase_share_pct
FROM (
SELECT user_id, propensity_score, actual_purchases
FROM user_scores
WHERE scoring_date = CURRENT_DATE
) t
GROUP BY decile
ORDER BY decile;Top decile часто 20-30% of all positives — efficient targeting.
Lift chart
WITH ordered AS (
SELECT
propensity_score,
actual,
ROW_NUMBER() OVER (ORDER BY propensity_score DESC) AS rn,
COUNT(*) OVER () AS total
FROM scored_users
)
SELECT
rn::NUMERIC / total AS pct_of_population_targeted,
SUM(CASE WHEN actual = 1 THEN 1 ELSE 0 END) OVER (ORDER BY rn)::NUMERIC
/ NULLIF(SUM(CASE WHEN actual = 1 THEN 1 ELSE 0 END) OVER (), 0) AS pct_positives_captured
FROM ordered
WHERE rn IN (
SELECT total * x / 10 FROM ordered, UNNEST(ARRAY[1,2,3,4,5,6,7,8,9,10]) AS x
GROUP BY total, x
)
ORDER BY rn;Targeting
Top X% propensity → marketing list:
SELECT
user_id,
email,
propensity_to_purchase,
NTILE(100) OVER (ORDER BY propensity_to_purchase DESC) AS percentile
FROM user_scores
WHERE NTILE(100) OVER (ORDER BY propensity_to_purchase DESC) <= 10 -- top 10%
ORDER BY propensity_to_purchase DESC;Частые ошибки
Ошибка 1. Uncalibrated scores. Random Forest scores не calibrated. Logistic regression — yes.
Ошибка 2. Threshold = 0.5. 0.5 default. Optimize for business cost (FP vs FN).
Ошибка 3. Stale scores. Scores days old. Refresh frequently для high-velocity events.
Ошибка 4. Causal misuse. Propensity Score → matching for causal estimation. Not just targeting.
Ошибка 5. Score per user vs event. Same user, multi events. Per-event scoring more precise.
Связанные темы
- Как посчитать F1 score в SQL
- Как посчитать AUC ROC в SQL
- Как посчитать uplift в SQL
- Churn prediction
FAQ
Какие models дают propensity?
Logistic regression (calibrated naturally). Gradient boosting, neural nets (often need calibration).
Calibration важна?
Yes если score используется как probability. Для ranking — less critical.
Top decile lift?
Random = 10% (uniform). Good model: 30-50% positives in top decile.
Score refresh frequency?
Daily для most use cases. Hourly для critical (fraud).
Score for matching (causal)?
Yes — propensity matching technique. Same propensity treated / not treated for ATE estimation.