Как посчитать Data Quality Score в SQL
Содержание:
Зачем Data Quality Score
«Garbage in, garbage out». Без DQ tracking insights могут быть completely wrong. DQ score — single number — easy для monitoring trends.
6 dimensions
- Completeness — % non-null
- Uniqueness — no duplicates на key
- Validity — values within constraints
- Consistency — same value across systems
- Accuracy — matches real-world
- Timeliness — fresh data
Completeness
SELECT
'user_email' AS column_name,
COUNT(*) AS total_rows,
COUNT(*) FILTER (WHERE user_email IS NOT NULL) AS non_null,
COUNT(*) FILTER (WHERE user_email IS NOT NULL)::NUMERIC * 100 / COUNT(*) AS completeness_pct
FROM users
UNION ALL
SELECT 'created_at', COUNT(*), COUNT(*) FILTER (WHERE created_at IS NOT NULL),
COUNT(*) FILTER (WHERE created_at IS NOT NULL)::NUMERIC * 100 / COUNT(*)
FROM users;Uniqueness
WITH dup_stats AS (
SELECT
user_id,
COUNT(*) AS occurrences
FROM users
GROUP BY user_id
)
SELECT
COUNT(*) AS total_unique_keys,
COUNT(*) FILTER (WHERE occurrences > 1) AS duplicate_keys,
COUNT(*) FILTER (WHERE occurrences = 1)::NUMERIC * 100 / COUNT(*) AS uniqueness_pct
FROM dup_stats;Validity
Rule-based check (email format, range constraints):
SELECT
-- Email valid format
COUNT(*) FILTER (WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$')::NUMERIC * 100 / COUNT(*) AS email_valid_pct,
-- Age valid range
COUNT(*) FILTER (WHERE age BETWEEN 0 AND 120)::NUMERIC * 100 / COUNT(*) AS age_valid_pct,
-- Country in allowed list
COUNT(*) FILTER (WHERE country IN ('US', 'CA', 'UK', 'RU', 'DE'))::NUMERIC * 100 / COUNT(*) AS country_valid_pct
FROM users
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days';Composite Score
WITH metrics AS (
SELECT
'users' AS table_name,
(SELECT COUNT(*) FILTER (WHERE user_id IS NOT NULL)::NUMERIC * 100 / COUNT(*) FROM users) AS completeness,
(SELECT COUNT(DISTINCT user_id)::NUMERIC * 100 / COUNT(*) FROM users) AS uniqueness,
(SELECT COUNT(*) FILTER (WHERE email ~ '^.+@.+\..+$')::NUMERIC * 100 / COUNT(*) FROM users) AS validity
)
SELECT
table_name,
completeness,
uniqueness,
validity,
(completeness + uniqueness + validity) / 3 AS overall_score
FROM metrics;Weighted score (some dimensions matter more):
Score = 0.4 × completeness + 0.3 × uniqueness + 0.2 × validity + 0.1 × timelinessDaily monitoring
SELECT
snapshot_date,
table_name,
completeness_pct,
uniqueness_pct,
validity_pct,
(completeness_pct + uniqueness_pct + validity_pct) / 3 AS overall_dq
FROM dq_snapshots
WHERE snapshot_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY snapshot_date DESC, table_name;Sudden drop → investigate ingest.
Частые ошибки
Ошибка 1. Single metric not enough. Completeness 99%, но nulls в critical column → still broken.
Ошибка 2. Validity rules drift. «valid country» list changes. Re-evaluate periodically.
Ошибка 3. Stale baseline. Compare with last week, not 2 years ago. Drift normal.
Ошибка 4. Same weights для всех tables. Critical PII (users) — high weights. Analytics events — lower.
Ошибка 5. DQ score 100% suspicious. Real data has issues. 100% suggests checks weak.
Связанные темы
- Как посчитать data freshness в SQL
- Как посчитать duplicate rate в SQL
- Data quality dimensions на собесе DE
- Schema evolution на собесе DE
FAQ
6 dimensions vs 4?
Original Wang & Strong — 6. Some org consolidate to 4-5. Pick whatever your team trusts.
Weights устанавливать как?
Business impact. Critical column down 10% — major. Optional column down 10% — minor.
DQ tools?
Great Expectations, dbt tests, Soda. SQL-based DQ — possible но maintenance.
Daily snapshot vs real-time?
Daily — manageable. Real-time — needed для critical pipelines (fraud, billing).
Acceptable threshold?
Critical: 99%+. Operational: 95%+. Analytics: 90%+. Anything <80% — fix.