Как посчитать Duplicate Rate в SQL
Содержание:
Зачем Duplicate Rate
Duplicates — частый источник bugs в analytics: revenue overcounted, users double-counted, метрики inflated. Tracking duplicate rate — early warning.
Базовый расчёт
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT user_id) AS unique_users,
COUNT(*) - COUNT(DISTINCT user_id) AS duplicate_rows,
(COUNT(*) - COUNT(DISTINCT user_id))::NUMERIC * 100 / COUNT(*) AS duplicate_rate_pct
FROM users;Если duplicate_rate > 0 в users table — primary key broken.
По составному ключу
Event tracking часто использует (user_id, event_name, timestamp):
WITH dedup AS (
SELECT
COUNT(*) AS total,
COUNT(DISTINCT (user_id, event_name, event_time)) AS unique_combos
FROM events
WHERE DATE = CURRENT_DATE - 1
)
SELECT
total,
unique_combos,
total - unique_combos AS duplicates,
(total - unique_combos)::NUMERIC * 100 / NULLIF(total, 0) AS duplicate_pct
FROM dedup;Поиск duplicates
SELECT
user_id, event_name, event_time,
COUNT(*) AS occurrences
FROM events
WHERE DATE = CURRENT_DATE - 1
GROUP BY user_id, event_name, event_time
HAVING COUNT(*) > 1
ORDER BY occurrences DESC
LIMIT 20;Дедупликация ROW_NUMBER
«Keep first row, drop rest»:
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id, event_name, event_time
ORDER BY ingested_at
) AS rn
FROM events
)
SELECT *
FROM ranked
WHERE rn = 1;Альтернатива через DISTINCT ON (PostgreSQL):
SELECT DISTINCT ON (user_id, event_name, event_time) *
FROM events
ORDER BY user_id, event_name, event_time, ingested_at;Daily duplicate rate
WITH daily AS (
SELECT
DATE,
COUNT(*) AS ROWS,
COUNT(DISTINCT (user_id, event_name, event_time)) AS unique_combos
FROM events
WHERE DATE >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE
)
SELECT
DATE,
ROWS,
unique_combos,
ROWS - unique_combos AS dups,
(ROWS - unique_combos)::NUMERIC * 100 / NULLIF(ROWS, 0) AS dup_pct
FROM daily
ORDER BY DATE;Sudden spike → check ingest pipeline.
Частые ошибки
Ошибка 1. COUNT DISTINCT на one column. Need composite key. (user_id) alone ≠ unique event.
Ошибка 2. NULL в key. COUNT DISTINCT не учитывает NULLs. SQL standard. Может пропустить duplicates.
Ошибка 3. Edge cases. Single-row tables «no duplicates» — but actually broken (1 expected of 100).
Ошибка 4. CASE-sensitive keys. "User@Email.com" и "user@email.com" — different. Normalize.
Ошибка 5. Timestamp precision. Same event ms-rounded → false duplicates. Truncate timestamp wisely.
Связанные темы
- Как посчитать data quality score в SQL
- Как посчитать data freshness в SQL
- Data quality dimensions на собесе DE
- Cumulative distinct на собесе DE
FAQ
Какой duplicate rate acceptable?
Critical tables (users, transactions): 0%. Event tables: <0.1%. Analytics: <1%.
Duplicates где появляются?
- Retry without idempotency. 2) ETL re-runs. 3) Bug в business logic. 4) Multi-source ingest.
DISTINCT ON vs ROW_NUMBER?
DISTINCT ON — PostgreSQL only, concise. ROW_NUMBER — universal.
Composite key duplicates?
Use tuple: (col1, col2, col3). Same group dedup.
NULL handling в composite?
NULL ≠ NULL → considered different rows. Use COALESCE для consistent dedup.