Как сделать Lookalike Audience в SQL

Закрепи формулу lookalike audience в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать lookalike audience в Telegram

Зачем Lookalike Audience

Lookalike = audience похожий на successful customers. Facebook / Google Ads use embeddings. В SQL — простой similarity by feature matching.

Базовая идея

  1. Define «seed» (high-value customers)
  2. Compute их profile (features avg / distribution)
  3. Score other users по similarity
  4. Pick top X% match — lookalike audience

Define seed cohort

WITH seed AS (
    SELECT
        user_id,
        age,
        country,
        device,
        signup_source
    FROM users
    WHERE user_id IN (
        SELECT user_id FROM transactions
        WHERE status = 'paid'
        GROUP BY user_id
        HAVING SUM(amount) >= 500  -- high-value threshold
    )
)
SELECT
    age,
    country,
    AVG(amount) AS avg_revenue
FROM seed
GROUP BY age, country;

Match features

Find non-seed users matching seed profile:

WITH seed AS (
    SELECT *
    FROM users
    WHERE has_paid_premium = TRUE
),
seed_profile AS (
    SELECT
        country,
        device,
        AGE_BUCKET(age) AS age_bucket,
        signup_source,
        COUNT(*) AS seed_count
    FROM seed
    GROUP BY 1, 2, 3, 4
)
SELECT
    u.user_id,
    u.email,
    sp.seed_count AS lookalike_score
FROM users u
JOIN seed_profile sp ON sp.country = u.country
  AND sp.device = u.device
  AND sp.age_bucket = AGE_BUCKET(u.age)
  AND sp.signup_source = u.signup_source
WHERE u.user_id NOT IN (SELECT user_id FROM seed)
ORDER BY sp.seed_count DESC;
Закрепи формулу lookalike audience в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать lookalike audience в Telegram

Cosine similarity (advanced)

For higher accuracy use feature vectors:

WITH seed_centroid AS (
    SELECT
        AVG(feature_1) AS f1_mean,
        AVG(feature_2) AS f2_mean,
        AVG(feature_3) AS f3_mean
    FROM user_features
    WHERE is_seed = TRUE
)
SELECT
    u.user_id,
    -- Cosine similarity (simplified, assume normalized)
    (u.feature_1 * s.f1_mean + u.feature_2 * s.f2_mean + u.feature_3 * s.f3_mean) AS similarity_score
FROM user_features u, seed_centroid s
WHERE u.is_seed = FALSE
ORDER BY similarity_score DESC
LIMIT 10000;

Score lookalikes

WITH features AS (
    SELECT
        user_id,
        country,
        device,
        AGE_BUCKET(age) AS age_bucket
    FROM users
),
seed_probabilities AS (
    SELECT
        country,
        device,
        AGE_BUCKET(age) AS age_bucket,
        COUNT(*) FILTER (WHERE is_seed) AS seed_count,
        COUNT(*) AS total
    FROM users
    GROUP BY 1, 2, AGE_BUCKET(age)
)
SELECT
    f.user_id,
    s.seed_count::NUMERIC / NULLIF(s.total, 0) AS lookalike_likelihood
FROM features f
JOIN seed_probabilities s USING (country, device, age_bucket)
WHERE NOT EXISTS (SELECT 1 FROM users WHERE user_id = f.user_id AND is_seed)
ORDER BY lookalike_likelihood DESC;

Частые ошибки

Ошибка 1. Seed too narrow. 50 users — small base. Need 500+ для stable lookalike.

Ошибка 2. Old seed. Seed updated yearly — yes. Daily — overfit.

Ошибка 3. Feature engineering. Raw features (signed up Tue vs Wed) — noise. Use semantically meaningful.

Ошибка 4. Lookalike across countries. US whale ≠ India whale. Segment lookalike by major dimensions.

Ошибка 5. Lookalike vs targeting overlap. Already-targeted users excluded? Yes typically.

Связанные темы

FAQ

Что такое lookalike?

Audience similar к defined high-value seed. Marketing targeting tactic.

Seed size minimum?

500-1000 для stable patterns. Facebook recommends 100 минимум, lebih ok.

Lookalike в Facebook Ads vs SQL?

Facebook uses platform embeddings (way more accurate). SQL — quick / simple alternative когда нет ad platform.

Update frequency?

Monthly typical. Seasonal businesses — quarterly.

Lookalike vs retargeting?

Retargeting — already-touched users. Lookalike — similar-to-customers новые.