Как посчитать Data Quality Score в SQL

Закрепи формулу data quality score в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать data quality score в Telegram

Зачем Data Quality Score

«Garbage in, garbage out». Без DQ tracking insights могут быть completely wrong. DQ score — single number — easy для monitoring trends.

6 dimensions

  1. Completeness — % non-null
  2. Uniqueness — no duplicates на key
  3. Validity — values within constraints
  4. Consistency — same value across systems
  5. Accuracy — matches real-world
  6. Timeliness — fresh data

Completeness

SELECT
    'user_email' AS column_name,
    COUNT(*) AS total_rows,
    COUNT(*) FILTER (WHERE user_email IS NOT NULL) AS non_null,
    COUNT(*) FILTER (WHERE user_email IS NOT NULL)::NUMERIC * 100 / COUNT(*) AS completeness_pct
FROM users
UNION ALL
SELECT 'created_at', COUNT(*), COUNT(*) FILTER (WHERE created_at IS NOT NULL),
    COUNT(*) FILTER (WHERE created_at IS NOT NULL)::NUMERIC * 100 / COUNT(*)
FROM users;

Uniqueness

WITH dup_stats AS (
    SELECT
        user_id,
        COUNT(*) AS occurrences
    FROM users
    GROUP BY user_id
)
SELECT
    COUNT(*) AS total_unique_keys,
    COUNT(*) FILTER (WHERE occurrences > 1) AS duplicate_keys,
    COUNT(*) FILTER (WHERE occurrences = 1)::NUMERIC * 100 / COUNT(*) AS uniqueness_pct
FROM dup_stats;

Validity

Rule-based check (email format, range constraints):

SELECT
    -- Email valid format
    COUNT(*) FILTER (WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$')::NUMERIC * 100 / COUNT(*) AS email_valid_pct,
    -- Age valid range
    COUNT(*) FILTER (WHERE age BETWEEN 0 AND 120)::NUMERIC * 100 / COUNT(*) AS age_valid_pct,
    -- Country in allowed list
    COUNT(*) FILTER (WHERE country IN ('US', 'CA', 'UK', 'RU', 'DE'))::NUMERIC * 100 / COUNT(*) AS country_valid_pct
FROM users
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days';
Закрепи формулу data quality score в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать data quality score в Telegram

Composite Score

WITH metrics AS (
    SELECT
        'users' AS table_name,
        (SELECT COUNT(*) FILTER (WHERE user_id IS NOT NULL)::NUMERIC * 100 / COUNT(*) FROM users) AS completeness,
        (SELECT COUNT(DISTINCT user_id)::NUMERIC * 100 / COUNT(*) FROM users) AS uniqueness,
        (SELECT COUNT(*) FILTER (WHERE email ~ '^.+@.+\..+$')::NUMERIC * 100 / COUNT(*) FROM users) AS validity
)
SELECT
    table_name,
    completeness,
    uniqueness,
    validity,
    (completeness + uniqueness + validity) / 3 AS overall_score
FROM metrics;

Weighted score (some dimensions matter more):

Score = 0.4 × completeness + 0.3 × uniqueness + 0.2 × validity + 0.1 × timeliness

Daily monitoring

SELECT
    snapshot_date,
    table_name,
    completeness_pct,
    uniqueness_pct,
    validity_pct,
    (completeness_pct + uniqueness_pct + validity_pct) / 3 AS overall_dq
FROM dq_snapshots
WHERE snapshot_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY snapshot_date DESC, table_name;

Sudden drop → investigate ingest.

Частые ошибки

Ошибка 1. Single metric not enough. Completeness 99%, но nulls в critical column → still broken.

Ошибка 2. Validity rules drift. «valid country» list changes. Re-evaluate periodically.

Ошибка 3. Stale baseline. Compare with last week, not 2 years ago. Drift normal.

Ошибка 4. Same weights для всех tables. Critical PII (users) — high weights. Analytics events — lower.

Ошибка 5. DQ score 100% suspicious. Real data has issues. 100% suggests checks weak.

Связанные темы

FAQ

6 dimensions vs 4?

Original Wang & Strong — 6. Some org consolidate to 4-5. Pick whatever your team trusts.

Weights устанавливать как?

Business impact. Critical column down 10% — major. Optional column down 10% — minor.

DQ tools?

Great Expectations, dbt tests, Soda. SQL-based DQ — possible но maintenance.

Daily snapshot vs real-time?

Daily — manageable. Real-time — needed для critical pipelines (fraud, billing).

Acceptable threshold?

Critical: 99%+. Operational: 95%+. Analytics: 90%+. Anything <80% — fix.