Как посчитать Data Freshness в SQL
Содержание:
Зачем Data Freshness
Data Freshness = lag между event happen и available в DWH. Stale data = bad decisions. SLA обычно: critical metrics — <1 hour, daily reports — <24 hours.
Формула
Freshness Lag = max(event_time) - now()или для batch jobs:
Freshness = now() - last_successful_run_timestampБазовый расчёт
Latest event vs now:
SELECT
MAX(event_time) AS latest_event,
NOW() AS now_ts,
EXTRACT(EPOCH FROM (NOW() - MAX(event_time))) / 60 AS lag_minutes
FROM events;Per table:
SELECT
'events' AS table_name,
MAX(event_time) AS latest,
EXTRACT(EPOCH FROM (NOW() - MAX(event_time))) / 60 AS lag_minutes
FROM events
UNION ALL
SELECT
'transactions',
MAX(created_at),
EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) / 60
FROM transactions;По pipelines
Если есть metadata table:
SELECT
pipeline_name,
last_run_timestamp,
last_successful_timestamp,
EXTRACT(EPOCH FROM (NOW() - last_successful_timestamp)) / 60 AS minutes_since_success,
rows_processed_last_run,
status_last_run
FROM pipeline_metadata
WHERE pipeline_name LIKE 'critical_%'
ORDER BY minutes_since_success DESC;SLA monitoring
WITH freshness AS (
SELECT
pipeline_name,
EXTRACT(EPOCH FROM (NOW() - last_successful_timestamp)) / 60 AS lag_min,
sla_minutes
FROM pipeline_metadata
)
SELECT
pipeline_name,
lag_min,
sla_minutes,
CASE
WHEN lag_min > sla_minutes * 1.5 THEN 'CRITICAL'
WHEN lag_min > sla_minutes THEN 'WARN'
ELSE 'OK'
END AS status,
lag_min / sla_minutes AS sla_ratio
FROM freshness
ORDER BY sla_ratio DESC;Historical freshness
Track freshness trend (не только current):
SELECT
DATE_TRUNC('hour', TIMESTAMP) AS hour,
pipeline_name,
AVG(lag_minutes) AS avg_lag,
MAX(lag_minutes) AS max_lag,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lag_minutes) AS p95_lag
FROM freshness_log
WHERE TIMESTAMP >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY 1, 2
ORDER BY 1, 2;Частые ошибки
Ошибка 1. Wall-clock vs event-time. Lag = ingest time − event time? Or run time − previous run? Define.
Ошибка 2. Late-arriving events. Mobile events delayed by sync. event_time stale relative to ingest.
Ошибка 3. Empty table = 0 lag. MAX(...) NULL → no rows. Don't confuse with «fresh».
Ошибка 4. Different time-zones. Source UTC, DWH local. Lag artificially huge.
Ошибка 5. Hot path vs batch. Streaming pipeline freshness in minutes. Batch — hours. Don't compare.
Связанные темы
- Как посчитать data quality score в SQL
- Как посчитать duplicate rate в SQL
- SLA / SLO / SLI на собесе SA
- Data quality dimensions на собесе DE
FAQ
Какой freshness SLA ok?
Real-time: <5 min. Near-real-time: <1 hour. Daily batch: <24 hours.
Wall-clock vs event-time?
Wall-clock — operational. Event-time — analytical. Track both.
Late events how handle?
Watermarking (Flink/Beam concept). Acknowledge late events до threshold, drop after.
Freshness vs latency?
Latency = single event. Freshness = aggregate (max lag).
Schema for freshness tracking?
pipeline_metadata(pipeline, last_run, last_success, sla, status). Append-only log + materialized view current state.