Как работать с real-time данными

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Fraud detection, live dashboards, alerting — всё real-time. Batch ETL workflow ограничен. В modern companies — real-time становится standard.

На собесах в growing tech — basics real-time expected.

Что такое real-time

Data processed fast после event — seconds / minutes, не hours.

Contrast:

Batch

  • Process ежедневно / почасно
  • Large chunks
  • Predictable
  • Standard warehouse

Streaming (real-time)

  • Continuous flow
  • Incremental updates
  • Low latency
  • Separate tools

Use cases

Fraud detection

Transaction happens → score → block if fraudulent.

Seconds matter.

Live dashboards

Active users now. Sales live.

Alerting

Metric crosses threshold → page someone.

Personalization

Recent user actions → tailored content.

Operational

Monitoring, outage detection.

Architecture

Producers

Apps, sensors — send events.

Streaming platform

  • Kafka — most common
  • Pulsar — alternative
  • Kinesis (AWS)
  • Pub/Sub (GCP)
  • Event Hubs (Azure)

Processors

  • Flink
  • Spark Streaming
  • KSQL / ksqlDB
  • Beam

Storage

  • ClickHouse — real-time OLAP
  • Druid
  • Pinot
  • TimescaleDB

Для analyst

Obычно:

Consume data

Don't build streams. Use existing.

Query real-time DB

ClickHouse real-time ingested — standard SQL queries.

Dashboards

Tools supporting streaming (Grafana, internal).

Alerts

Configure queries → notify on conditions.

ClickHouse real-time

Insert streaming

-- Kafka table engine
CREATE TABLE events_kafka (
    user_id UInt64,
    event_name String,
    TIMESTAMP DateTime
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_format = 'JSONEachRow';

-- Materialized view — consume to MergeTree
CREATE MATERIALIZED VIEW events_mv
TO events_store
AS SELECT * FROM events_kafka;

Events arrive → stored → queryable.

Query

SELECT COUNT(*), uniqExact(user_id)
FROM events_store
WHERE TIMESTAMP >= now() - INTERVAL 5 MINUTE;

«DAU last 5 min» possible.

Metrics sliding window

SELECT
    toStartOfMinute(TIMESTAMP) AS minute,
    COUNT(*) AS events
FROM events_store
WHERE TIMESTAMP >= now() - INTERVAL 60 MINUTE
GROUP BY minute
ORDER BY minute;

Last 60 minutes по minute.

Alerts

Via Grafana / similar

SQL query runs каждую минуту.

If conditions → Slack / email / PagerDuty.

Via custom

while True:
    result = query_clickhouse("SELECT COUNT(*) ...")
    if result > threshold:
        send_alert(...)
    time.sleep(60)

Not great scalable. Tools лучше.

Anomaly detection

Separate article.

Real-time dashboards

Live

Refreshing every 10s / 1min.

Tools:

  • Grafana
  • Tableau Server (live connections)
  • Custom React + WebSocket

Streaming visualization

Continuous chart updates.

Challenges

  • Balance refresh с load
  • Consistency (partial data)
  • Storage (real-time tables smaller)

Problems real-time

Data accuracy

Late events — arrive delayed. Affect running totals.

Deduplication

Event emitted twice (retry). Count once.

State management

Running aggregates — need state.

Ordering

Events arrive out-of-order due network. Handle.

Когда batch достаточно

  • Daily reports
  • Weekly cohorts
  • Monthly financial close
  • Not time-sensitive

Batch simpler, cheaper. Use when fits.

Когда real-time нужен

  • Fraud / abuse
  • Operational monitoring
  • User-facing (personalization)
  • Business KPIs с minute-level

Lambda architecture

Combine both:

  • Batch layer: historical, precise
  • Speed layer: real-time, approximate
  • Serve: merge results

Increasingly replaced by Kappa (streaming only).

Для собеса

Не expected build

Analyst не строит streaming platforms typically.

Expected understand

  • Batch vs streaming tradeoffs
  • Basic architecture
  • Use cases
  • Tools (Kafka, ClickHouse, etc.)

«Have you worked real-time?» — «Yes, queried real-time ClickHouse dashboards, set up alerts».

Specific companies

Yandex

Real-time everywhere (search, ads, delivery).

Tinkoff

Transaction monitoring, fraud.

Ozon / WB

Inventory, pricing.

Smaller

Часто batch-only. Real-time expensive.

Learning

Basic

Understand Kafka concepts. «Topics», «producers», «consumers».

Intermediate

Query real-time OLAP (ClickHouse), set up alerts.

Advanced

Build streaming pipeline (Flink, Spark).

Last — usually data engineer scope.

Performance tips

Query efficient

Same tips как batch SQL: indexes, partition pruning.

Time window limits

Don't query unlimited past в real-time table — slow.

Pre-aggregate

Materialized views за minute/hour aggregations.

Subsample

For exploration.

Ethical

Real-time surveillance potential. Consider privacy.

Data retention

Keep long? Privacy laws limit.

Access

Limit real-time sensitive data.

На собесе

«Real-time опыт?»

Specific:

  • Queried ClickHouse real-time tables
  • Set up alerts на metric changes
  • Dashboards live data

Honest, не exaggerate.

«Batch vs streaming?»

Trade-offs: latency, cost, complexity, accuracy.

«Fraud detection live?»

Event → feature extraction → model → decision. Milliseconds.

Связанные темы

FAQ

Kafka alternative?

Pulsar, Kinesis, Pub/Sub.

Analyst билд real-time?

Usually не. Understands и consumes.

Cost real-time?

Much higher batch. Worth только when needed.


Тренируйте — откройте тренажёр с 1500+ вопросами для собесов.