Как посчитать Duplicate Rate в SQL

Закрепи формулу duplicate rate в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать duplicate rate в Telegram

Зачем Duplicate Rate

Duplicates — частый источник bugs в analytics: revenue overcounted, users double-counted, метрики inflated. Tracking duplicate rate — early warning.

Базовый расчёт

SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT user_id) AS unique_users,
    COUNT(*) - COUNT(DISTINCT user_id) AS duplicate_rows,
    (COUNT(*) - COUNT(DISTINCT user_id))::NUMERIC * 100 / COUNT(*) AS duplicate_rate_pct
FROM users;

Если duplicate_rate > 0 в users table — primary key broken.

По составному ключу

Event tracking часто использует (user_id, event_name, timestamp):

WITH dedup AS (
    SELECT
        COUNT(*) AS total,
        COUNT(DISTINCT (user_id, event_name, event_time)) AS unique_combos
    FROM events
    WHERE DATE = CURRENT_DATE - 1
)
SELECT
    total,
    unique_combos,
    total - unique_combos AS duplicates,
    (total - unique_combos)::NUMERIC * 100 / NULLIF(total, 0) AS duplicate_pct
FROM dedup;

Поиск duplicates

SELECT
    user_id, event_name, event_time,
    COUNT(*) AS occurrences
FROM events
WHERE DATE = CURRENT_DATE - 1
GROUP BY user_id, event_name, event_time
HAVING COUNT(*) > 1
ORDER BY occurrences DESC
LIMIT 20;
Закрепи формулу duplicate rate в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать duplicate rate в Telegram

Дедупликация ROW_NUMBER

«Keep first row, drop rest»:

WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY user_id, event_name, event_time
            ORDER BY ingested_at
        ) AS rn
    FROM events
)
SELECT *
FROM ranked
WHERE rn = 1;

Альтернатива через DISTINCT ON (PostgreSQL):

SELECT DISTINCT ON (user_id, event_name, event_time) *
FROM events
ORDER BY user_id, event_name, event_time, ingested_at;

Daily duplicate rate

WITH daily AS (
    SELECT
        DATE,
        COUNT(*) AS ROWS,
        COUNT(DISTINCT (user_id, event_name, event_time)) AS unique_combos
    FROM events
    WHERE DATE >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY DATE
)
SELECT
    DATE,
    ROWS,
    unique_combos,
    ROWS - unique_combos AS dups,
    (ROWS - unique_combos)::NUMERIC * 100 / NULLIF(ROWS, 0) AS dup_pct
FROM daily
ORDER BY DATE;

Sudden spike → check ingest pipeline.

Частые ошибки

Ошибка 1. COUNT DISTINCT на one column. Need composite key. (user_id) alone ≠ unique event.

Ошибка 2. NULL в key. COUNT DISTINCT не учитывает NULLs. SQL standard. Может пропустить duplicates.

Ошибка 3. Edge cases. Single-row tables «no duplicates» — but actually broken (1 expected of 100).

Ошибка 4. CASE-sensitive keys. "User@Email.com" и "user@email.com" — different. Normalize.

Ошибка 5. Timestamp precision. Same event ms-rounded → false duplicates. Truncate timestamp wisely.

Связанные темы

FAQ

Какой duplicate rate acceptable?

Critical tables (users, transactions): 0%. Event tables: <0.1%. Analytics: <1%.

Duplicates где появляются?

  1. Retry without idempotency. 2) ETL re-runs. 3) Bug в business logic. 4) Multi-source ingest.

DISTINCT ON vs ROW_NUMBER?

DISTINCT ON — PostgreSQL only, concise. ROW_NUMBER — universal.

Composite key duplicates?

Use tuple: (col1, col2, col3). Same group dedup.

NULL handling в composite?

NULL ≠ NULL → considered different rows. Use COALESCE для consistent dedup.