SQL для сегментации пользователей

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Сегментация users — базовая задача marketing / product. Кого retarget, кому email, кому premium offer. Аналитик без segmentation skills — ограничен.

На собесах в retail / e-commerce / subscription businesses segmentation — bread and butter.

Типы сегментации

1. Demographic

Age, gender, location, company size (B2B).

2. Behavioral

Actions, frequency, recency. Most actionable.

3. RFM

Recency, Frequency, Monetary — classic для e-commerce.

4. Psychographic

Interests, lifestyle. Harder в quantify.

5. Tier-based

Free / Basic / Premium. Straightforward.

RFM в SQL

WITH rfm_raw AS (
    SELECT
        user_id,
        MAX(created_at) AS last_purchase,
        COUNT(*) AS frequency,
        SUM(amount) AS monetary
    FROM orders
    WHERE created_at >= CURRENT_DATE - 365
    GROUP BY user_id
),
rfm_scores AS (
    SELECT
        user_id,
        NTILE(5) OVER (ORDER BY last_purchase DESC) AS r_score,
        NTILE(5) OVER (ORDER BY frequency DESC) AS f_score,
        NTILE(5) OVER (ORDER BY monetary DESC) AS m_score
    FROM rfm_raw
)
SELECT
    user_id,
    r_score, f_score, m_score,
    r_score * 100 + f_score * 10 + m_score AS rfm_code,
    CASE
        WHEN r_score = 5 AND f_score = 5 AND m_score = 5 THEN 'Champions'
        WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Loyal'
        WHEN r_score = 5 AND f_score <= 2 THEN 'New Customers'
        WHEN r_score <= 2 AND f_score >= 4 THEN 'At Risk'
        WHEN r_score = 1 THEN 'Churned'
        ELSE 'Normal'
    END AS segment
FROM rfm_scores;

Behavioral

Power user

SELECT
    user_id,
    CASE
        WHEN sessions_30d >= 20 AND minutes_30d >= 300 THEN 'Power'
        WHEN sessions_30d >= 10 THEN 'Regular'
        WHEN sessions_30d >= 3 THEN 'Occasional'
        ELSE 'Casual'
    END AS segment
FROM (
    SELECT
        user_id,
        COUNT(*) AS sessions_30d,
        SUM(duration_min) AS minutes_30d
    FROM sessions
    WHERE created_at >= CURRENT_DATE - 30
    GROUP BY user_id
) s;

Engagement tiers

Percentile-based:

SELECT
    user_id,
    NTILE(4) OVER (ORDER BY engagement_score DESC) AS tier
FROM (
    SELECT user_id, SUM(actions_count) AS engagement_score
    FROM events
    WHERE created_at >= CURRENT_DATE - 30
    GROUP BY user_id
) s;

1 = top 25%, 4 = bottom 25%.

New vs returning

SELECT
    user_id,
    CASE
        WHEN DATE_DIFF('day', signup_at, CURRENT_DATE) <= 30 THEN 'New'
        WHEN last_active >= CURRENT_DATE - 30 THEN 'Active Returning'
        WHEN last_active >= CURRENT_DATE - 90 THEN 'At Risk'
        ELSE 'Churned'
    END AS lifecycle_stage
FROM users;

Premium / Free

SELECT
    user_id,
    CASE
        WHEN current_plan = 'premium' THEN 'Premium'
        WHEN free_trial_active THEN 'Trial'
        WHEN premium_history THEN 'Former Premium'
        ELSE 'Free'
    END AS subscription_segment
FROM users;

Combined segments

WITH segments AS (
    SELECT
        user_id,
        rfm_segment,
        behavior_segment,
        lifecycle_stage
    FROM segmentation_combined
)
SELECT
    rfm_segment,
    lifecycle_stage,
    COUNT(*) AS users,
    AVG(ltv) AS avg_ltv
FROM segments
GROUP BY rfm_segment, lifecycle_stage
ORDER BY 1, 2;

Crosstab shows interactions.

Clustering через SQL (approximation)

Simple K-means-like через NTILE:

WITH features AS (
    SELECT
        user_id,
        NTILE(3) OVER (ORDER BY recency DESC) AS r,
        NTILE(3) OVER (ORDER BY frequency DESC) AS f,
        NTILE(3) OVER (ORDER BY monetary DESC) AS m
    FROM rfm
)
SELECT
    r, f, m,
    COUNT(*) AS segment_size,
    AVG(ltv) AS avg_ltv
FROM features
GROUP BY r, f, m
ORDER BY avg_ltv DESC;

27 «clusters». Больше — harder.

Для proper clustering — use Python K-means.

Propensity scoring

Score по model → segment by threshold:

SELECT
    user_id,
    churn_score,
    CASE
        WHEN churn_score >= 0.8 THEN 'High Risk'
        WHEN churn_score >= 0.5 THEN 'Medium Risk'
        ELSE 'Low Risk'
    END AS churn_segment
FROM users_with_scores;

Churn_score from ML model loaded.

Actionable segments

Каждый segment → action:

  • Power users: increase engagement, referrals, case studies
  • New customers: welcome journey, first-purchase discount
  • At Risk: re-engagement email, win-back offer
  • Churned: major discount, product updates

Validation

Business sense

Segments должны быть actionable, distinct, stable.

Size distribution

5% power, 20% regular, 40% casual, 35% occasional — reasonable.

10% super-power, 0.1% at risk — might be mis-defined.

Behavior differs

Segments должны differ in KPIs significantly.

Частые ошибки

Too many segments

20 segments → ops team confused.

Keep 5-7 max.

Static definitions

Segments меняются over time. Refresh periodically.

Single dimension

«High LTV» only → miss engagement. Combine dimensions.

No business adoption

Best segmentation — none if не used. Align with marketing / CS.

На собесе

«Как бы сегментировали users для [product]?»

Walk through:

  1. Business goal (retention? LTV?)
  2. Features relevant
  3. Method (rule-based vs data-driven)
  4. Sizes
  5. Actions

Show product thinking.

Связанные темы

FAQ

SQL или Python?

SQL для простых. Python для ML-clustering.

Как часто обновлять?

Weekly / monthly. Слишком often — unstable для marketing workflows.

Overlap ok?

Hierarchical (tier) — no. Cross-cutting (power × premium) — often yes.


Тренируйте SQL — откройте тренажёр с 1500+ вопросами для собесов.