Как построить ETL-pipeline для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Analyst работает с data всегда processed через pipeline. Understanding pipeline — better analyses, faster debugging, ability propose improvements.
Not обычно analyst builds pipelines. But knows concepts.
ETL vs ELT
ETL (classic)
- Extract source
- Transform (Python / Spark)
- Load warehouse
Transform ДО load.
ELT (modern)
- Extract source
- Load raw в warehouse
- Transform в warehouse (SQL)
Transform в DWH using dbt.
Modern stack
ELT dominates. Reasons:
- Cheap storage
- Fast warehouses
- SQL-centric (analyst-friendly)
- dbt ecosystem
Типичная architecture
Sources (DB, APIs, Events)
↓ Extract (Fivetran / Airbyte)
Cloud Storage / Raw DWH
↓ Transform (dbt)
Clean DWH (dim / fact tables)
↓ Serve
BI / Dashboards / MLComponents
Sources
- Production databases (Postgres, MySQL)
- SaaS APIs (Stripe, Salesforce, Google Ads)
- Event streams (Kafka)
- Files (CSV, JSON)
Extraction
- Fivetran — managed, expensive, reliable
- Airbyte — open-source, self-host
- Custom Python — flexibility
- Singer — open-source specs
Storage
- Cloud storage (S3, GCS) — raw
- Warehouse: loading destination
Warehouses
- Snowflake — cloud-native
- BigQuery — Google
- Redshift — AWS
- Databricks SQL
- ClickHouse (real-time focus)
Transformation
- dbt — SQL-centric modern
- Apache Spark — big data
- Airflow + Python — custom
Orchestration
- Airflow — scheduler
- Dagster — modern alternative
- Prefect — Python-first
BI / serve
- Tableau, Metabase, Looker
- Reverse ETL (Hightouch) — warehouse → apps
Analyst touch points
Consume
Query warehouse. SQL.
Understand data lineage:
- Where from?
- How transformed?
- Refresh schedule?
Contribute
- Write dbt models (transformations)
- Propose new sources
- Define metrics
Debug
- Data wrong — trace back
- Pipeline broken — understand
Need pipeline знание даже не building.
Typical workflow
1. Data available
Fivetran syncs Stripe every hour.
Raw table: raw.stripe.charges.
2. Staging
dbt model cleans:
-- models/staging/stg_stripe_charges.sql
SELECT
id AS charge_id,
amount / 100 AS amount_usd,
status,
created
FROM {{ source('stripe', 'charges') }}
WHERE status != 'test'3. Core dimensional
-- models/core/fact_payments.sql
SELECT ... FROM {{ ref('stg_stripe_charges') }}
JOIN ...4. Mart (use case specific)
-- models/marts/finance/revenue_daily.sql
SELECT DATE(created) AS day, SUM(amount_usd) FROM ...
GROUP BY 15. BI reads mart
Tableau dashboard.
dbt project structure
dbt_project/
├── models/
│ ├── staging/
│ │ └── stg_*.sql
│ ├── intermediate/
│ │ └── int_*.sql
│ ├── marts/
│ │ ├── core/
│ │ │ └── dim_*.sql, fact_*.sql
│ │ └── marketing/
│ │ └── ...
├── tests/
├── macros/
└── sources.ymlNaming conventions help readability.
Data quality
Tests
dbt tests:
models:
- name: fact_payments
columns:
- name: charge_id
tests: [unique, not_null]
- name: amount_usd
tests:
- not_null
- dbt_utils.expression_is_true:
expression: '>= 0'Monitoring
- Freshness (new data arrives?)
- Volume (expected counts?)
- Schema changes
Tools: Great Expectations, dbt test, Monte Carlo Data.
Performance
Incremental models
Don't rebuild everything. Only new data:
{{ config(materialized='incremental') }}
SELECT * FROM source
{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}Speed up 100x для large tables.
Partitioning
DWH features. Query только relevant partition.
Clustering
Snowflake / BigQuery specific.
Materialization strategy
- View: dynamic, no storage
- Table: materialized, fast read
- Incremental: table + incremental updates
- Ephemeral: inline, no table
Cost
Cloud warehouse
Can be expensive. Optimize:
- Right materialization
- Partitioning
- Kill slow queries
- Monitor usage
SaaS ETL
Per-row billing обычно. Costly if high volume.
Common pitfalls
1. No schema contracts
Source changes break downstream.
2. Silent failures
Pipeline «succeeds» but data wrong. No validation.
3. Manual processes
Anything manual breaks. Automate.
4. No monitoring
Detect issues via broken dashboards, не alerts.
5. Sprawl
Hundreds models, no structure. Complexity exponential.
Modern trends
Streaming
Real-time ingestion growing.
Reverse ETL
DWH → apps (Salesforce, marketing tools).
Data mesh
Decentralized data ownership per team.
AI-assisted
AI generates dbt models, SQL queries.
Lakehouse
Data lake + warehouse. Databricks, Apache Iceberg.
Russian context
Similar stack
- DWH: ClickHouse widely. Snowflake less common (geopolitics).
- Orchestration: Airflow dominant.
- Transform: dbt adoption growing.
- BI: DataLens Yandex, Metabase, Tableau.
Internal variants (Yandex YQL).
Для собеса
Questions
- ETL vs ELT?
- dbt experience?
- Airflow workflows?
- Data quality tests?
Не глубоко
Analyst не expected build pipelines. Understand.
Architecture question
«How would design pipeline для X?»
Walk high-level: source → extract → load → transform → serve.
Show basics.
Learn
dbt
Official tutorial. Build small project.
Airflow
Run local (Docker). Sample DAG.
Warehouse
Snowflake / BigQuery free tier.
Real project
Contribute team pipeline. Start small (staging model).
Analytics engineer
Role между analyst и data engineer.
- dbt
- Data modeling
- Testing
- Documentation
Hybrid role. Growing.
Good transition path.
Связанные темы
- dbt для аналитика
- Airflow для аналитика
- ETL vs ELT
- Data warehouse vs database
- Dimensional modeling
- Data governance
FAQ
Analyst строит pipeline?
Typically no. Contribute dbt models yes.
Python обязателен?
For pipelines — yes. For dbt — no (SQL primary).
Cloud vs on-prem?
Cloud dominant. On-prem legacy.
Тренируйте — откройте тренажёр с 1500+ вопросами для собесов.