Как сделать Lookalike Audience в SQL
Содержание:
Зачем Lookalike Audience
Lookalike = audience похожий на successful customers. Facebook / Google Ads use embeddings. В SQL — простой similarity by feature matching.
Базовая идея
- Define «seed» (high-value customers)
- Compute их profile (features avg / distribution)
- Score other users по similarity
- Pick top X% match — lookalike audience
Define seed cohort
WITH seed AS (
SELECT
user_id,
age,
country,
device,
signup_source
FROM users
WHERE user_id IN (
SELECT user_id FROM transactions
WHERE status = 'paid'
GROUP BY user_id
HAVING SUM(amount) >= 500 -- high-value threshold
)
)
SELECT
age,
country,
AVG(amount) AS avg_revenue
FROM seed
GROUP BY age, country;Match features
Find non-seed users matching seed profile:
WITH seed AS (
SELECT *
FROM users
WHERE has_paid_premium = TRUE
),
seed_profile AS (
SELECT
country,
device,
AGE_BUCKET(age) AS age_bucket,
signup_source,
COUNT(*) AS seed_count
FROM seed
GROUP BY 1, 2, 3, 4
)
SELECT
u.user_id,
u.email,
sp.seed_count AS lookalike_score
FROM users u
JOIN seed_profile sp ON sp.country = u.country
AND sp.device = u.device
AND sp.age_bucket = AGE_BUCKET(u.age)
AND sp.signup_source = u.signup_source
WHERE u.user_id NOT IN (SELECT user_id FROM seed)
ORDER BY sp.seed_count DESC;Cosine similarity (advanced)
For higher accuracy use feature vectors:
WITH seed_centroid AS (
SELECT
AVG(feature_1) AS f1_mean,
AVG(feature_2) AS f2_mean,
AVG(feature_3) AS f3_mean
FROM user_features
WHERE is_seed = TRUE
)
SELECT
u.user_id,
-- Cosine similarity (simplified, assume normalized)
(u.feature_1 * s.f1_mean + u.feature_2 * s.f2_mean + u.feature_3 * s.f3_mean) AS similarity_score
FROM user_features u, seed_centroid s
WHERE u.is_seed = FALSE
ORDER BY similarity_score DESC
LIMIT 10000;Score lookalikes
WITH features AS (
SELECT
user_id,
country,
device,
AGE_BUCKET(age) AS age_bucket
FROM users
),
seed_probabilities AS (
SELECT
country,
device,
AGE_BUCKET(age) AS age_bucket,
COUNT(*) FILTER (WHERE is_seed) AS seed_count,
COUNT(*) AS total
FROM users
GROUP BY 1, 2, AGE_BUCKET(age)
)
SELECT
f.user_id,
s.seed_count::NUMERIC / NULLIF(s.total, 0) AS lookalike_likelihood
FROM features f
JOIN seed_probabilities s USING (country, device, age_bucket)
WHERE NOT EXISTS (SELECT 1 FROM users WHERE user_id = f.user_id AND is_seed)
ORDER BY lookalike_likelihood DESC;Частые ошибки
Ошибка 1. Seed too narrow. 50 users — small base. Need 500+ для stable lookalike.
Ошибка 2. Old seed. Seed updated yearly — yes. Daily — overfit.
Ошибка 3. Feature engineering. Raw features (signed up Tue vs Wed) — noise. Use semantically meaningful.
Ошибка 4. Lookalike across countries. US whale ≠ India whale. Segment lookalike by major dimensions.
Ошибка 5. Lookalike vs targeting overlap. Already-targeted users excluded? Yes typically.
Связанные темы
- Как сделать customer segments в SQL
- Как посчитать propensity score в SQL
- Как посчитать RFM segmentation в SQL
- Как посчитать LTV в SQL
FAQ
Что такое lookalike?
Audience similar к defined high-value seed. Marketing targeting tactic.
Seed size minimum?
500-1000 для stable patterns. Facebook recommends 100 минимум, lebih ok.
Lookalike в Facebook Ads vs SQL?
Facebook uses platform embeddings (way more accurate). SQL — quick / simple alternative когда нет ad platform.
Update frequency?
Monthly typical. Seasonal businesses — quarterly.
Lookalike vs retargeting?
Retargeting — already-touched users. Lookalike — similar-to-customers новые.