Как посчитать Data Freshness в SQL

Закрепи формулу data freshness в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать data freshness в Telegram

Зачем Data Freshness

Data Freshness = lag между event happen и available в DWH. Stale data = bad decisions. SLA обычно: critical metrics — <1 hour, daily reports — <24 hours.

Формула

Freshness Lag = max(event_time) - now()

или для batch jobs:

Freshness = now() - last_successful_run_timestamp

Базовый расчёт

Latest event vs now:

SELECT
    MAX(event_time) AS latest_event,
    NOW() AS now_ts,
    EXTRACT(EPOCH FROM (NOW() - MAX(event_time))) / 60 AS lag_minutes
FROM events;

Per table:

SELECT
    'events' AS table_name,
    MAX(event_time) AS latest,
    EXTRACT(EPOCH FROM (NOW() - MAX(event_time))) / 60 AS lag_minutes
FROM events
UNION ALL
SELECT
    'transactions',
    MAX(created_at),
    EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) / 60
FROM transactions;

По pipelines

Если есть metadata table:

SELECT
    pipeline_name,
    last_run_timestamp,
    last_successful_timestamp,
    EXTRACT(EPOCH FROM (NOW() - last_successful_timestamp)) / 60 AS minutes_since_success,
    rows_processed_last_run,
    status_last_run
FROM pipeline_metadata
WHERE pipeline_name LIKE 'critical_%'
ORDER BY minutes_since_success DESC;
Закрепи формулу data freshness в Карьернике
Запомнить надолго — 5 коротких сессий с задачами на эту тему. Бесплатно
Тренировать data freshness в Telegram

SLA monitoring

WITH freshness AS (
    SELECT
        pipeline_name,
        EXTRACT(EPOCH FROM (NOW() - last_successful_timestamp)) / 60 AS lag_min,
        sla_minutes
    FROM pipeline_metadata
)
SELECT
    pipeline_name,
    lag_min,
    sla_minutes,
    CASE
        WHEN lag_min > sla_minutes * 1.5 THEN 'CRITICAL'
        WHEN lag_min > sla_minutes THEN 'WARN'
        ELSE 'OK'
    END AS status,
    lag_min / sla_minutes AS sla_ratio
FROM freshness
ORDER BY sla_ratio DESC;

Historical freshness

Track freshness trend (не только current):

SELECT
    DATE_TRUNC('hour', TIMESTAMP) AS hour,
    pipeline_name,
    AVG(lag_minutes) AS avg_lag,
    MAX(lag_minutes) AS max_lag,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lag_minutes) AS p95_lag
FROM freshness_log
WHERE TIMESTAMP >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY 1, 2
ORDER BY 1, 2;

Частые ошибки

Ошибка 1. Wall-clock vs event-time. Lag = ingest time − event time? Or run time − previous run? Define.

Ошибка 2. Late-arriving events. Mobile events delayed by sync. event_time stale relative to ingest.

Ошибка 3. Empty table = 0 lag. MAX(...) NULL → no rows. Don't confuse with «fresh».

Ошибка 4. Different time-zones. Source UTC, DWH local. Lag artificially huge.

Ошибка 5. Hot path vs batch. Streaming pipeline freshness in minutes. Batch — hours. Don't compare.

Связанные темы

FAQ

Какой freshness SLA ok?

Real-time: <5 min. Near-real-time: <1 hour. Daily batch: <24 hours.

Wall-clock vs event-time?

Wall-clock — operational. Event-time — analytical. Track both.

Late events how handle?

Watermarking (Flink/Beam concept). Acknowledge late events до threshold, drop after.

Freshness vs latency?

Latency = single event. Freshness = aggregate (max lag).

Schema for freshness tracking?

pipeline_metadata(pipeline, last_run, last_success, sla, status). Append-only log + materialized view current state.