Как работать с real-time данными
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Fraud detection, live dashboards, alerting — всё real-time. Batch ETL workflow ограничен. В modern companies — real-time становится standard.
На собесах в growing tech — basics real-time expected.
Что такое real-time
Data processed fast после event — seconds / minutes, не hours.
Contrast:
Batch
- Process ежедневно / почасно
- Large chunks
- Predictable
- Standard warehouse
Streaming (real-time)
- Continuous flow
- Incremental updates
- Low latency
- Separate tools
Use cases
Fraud detection
Transaction happens → score → block if fraudulent.
Seconds matter.
Live dashboards
Active users now. Sales live.
Alerting
Metric crosses threshold → page someone.
Personalization
Recent user actions → tailored content.
Operational
Monitoring, outage detection.
Architecture
Producers
Apps, sensors — send events.
Streaming platform
- Kafka — most common
- Pulsar — alternative
- Kinesis (AWS)
- Pub/Sub (GCP)
- Event Hubs (Azure)
Processors
- Flink
- Spark Streaming
- KSQL / ksqlDB
- Beam
Storage
- ClickHouse — real-time OLAP
- Druid
- Pinot
- TimescaleDB
Для analyst
Obычно:
Consume data
Don't build streams. Use existing.
Query real-time DB
ClickHouse real-time ingested — standard SQL queries.
Dashboards
Tools supporting streaming (Grafana, internal).
Alerts
Configure queries → notify on conditions.
ClickHouse real-time
Insert streaming
-- Kafka table engine
CREATE TABLE events_kafka (
user_id UInt64,
event_name String,
TIMESTAMP DateTime
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'events',
kafka_format = 'JSONEachRow';
-- Materialized view — consume to MergeTree
CREATE MATERIALIZED VIEW events_mv
TO events_store
AS SELECT * FROM events_kafka;Events arrive → stored → queryable.
Query
SELECT COUNT(*), uniqExact(user_id)
FROM events_store
WHERE TIMESTAMP >= now() - INTERVAL 5 MINUTE;«DAU last 5 min» possible.
Metrics sliding window
SELECT
toStartOfMinute(TIMESTAMP) AS minute,
COUNT(*) AS events
FROM events_store
WHERE TIMESTAMP >= now() - INTERVAL 60 MINUTE
GROUP BY minute
ORDER BY minute;Last 60 minutes по minute.
Alerts
Via Grafana / similar
SQL query runs каждую минуту.
If conditions → Slack / email / PagerDuty.
Via custom
while True:
result = query_clickhouse("SELECT COUNT(*) ...")
if result > threshold:
send_alert(...)
time.sleep(60)Not great scalable. Tools лучше.
Anomaly detection
Separate article.
Real-time dashboards
Live
Refreshing every 10s / 1min.
Tools:
- Grafana
- Tableau Server (live connections)
- Custom React + WebSocket
Streaming visualization
Continuous chart updates.
Challenges
- Balance refresh с load
- Consistency (partial data)
- Storage (real-time tables smaller)
Problems real-time
Data accuracy
Late events — arrive delayed. Affect running totals.
Deduplication
Event emitted twice (retry). Count once.
State management
Running aggregates — need state.
Ordering
Events arrive out-of-order due network. Handle.
Когда batch достаточно
- Daily reports
- Weekly cohorts
- Monthly financial close
- Not time-sensitive
Batch simpler, cheaper. Use when fits.
Когда real-time нужен
- Fraud / abuse
- Operational monitoring
- User-facing (personalization)
- Business KPIs с minute-level
Lambda architecture
Combine both:
- Batch layer: historical, precise
- Speed layer: real-time, approximate
- Serve: merge results
Increasingly replaced by Kappa (streaming only).
Для собеса
Не expected build
Analyst не строит streaming platforms typically.
Expected understand
- Batch vs streaming tradeoffs
- Basic architecture
- Use cases
- Tools (Kafka, ClickHouse, etc.)
«Have you worked real-time?» — «Yes, queried real-time ClickHouse dashboards, set up alerts».
Specific companies
Yandex
Real-time everywhere (search, ads, delivery).
Tinkoff
Transaction monitoring, fraud.
Ozon / WB
Inventory, pricing.
Smaller
Часто batch-only. Real-time expensive.
Learning
Basic
Understand Kafka concepts. «Topics», «producers», «consumers».
Intermediate
Query real-time OLAP (ClickHouse), set up alerts.
Advanced
Build streaming pipeline (Flink, Spark).
Last — usually data engineer scope.
Performance tips
Query efficient
Same tips как batch SQL: indexes, partition pruning.
Time window limits
Don't query unlimited past в real-time table — slow.
Pre-aggregate
Materialized views за minute/hour aggregations.
Subsample
For exploration.
Ethical
Real-time surveillance potential. Consider privacy.
Data retention
Keep long? Privacy laws limit.
Access
Limit real-time sensitive data.
На собесе
«Real-time опыт?»
Specific:
- Queried ClickHouse real-time tables
- Set up alerts на metric changes
- Dashboards live data
Honest, не exaggerate.
«Batch vs streaming?»
Trade-offs: latency, cost, complexity, accuracy.
«Fraud detection live?»
Event → feature extraction → model → decision. Milliseconds.
Связанные темы
- ClickHouse шпаргалка
- Batch vs stream processing
- Anomaly detection
- Airflow для аналитика
- SQL для fraud detection
FAQ
Kafka alternative?
Pulsar, Kinesis, Pub/Sub.
Analyst билд real-time?
Usually не. Understands и consumes.
Cost real-time?
Much higher batch. Worth только when needed.
Тренируйте — откройте тренажёр с 1500+ вопросами для собесов.