23 апреля 2026 г.·4 мин чтения

Как построить ETL-pipeline для аналитика

Проверь себя · 1/3разбор после ответа

В выводе EXPLAIN вы видите оценку cost=0.00..431.00. Какой вывод аналитик может сделать безопасно?

Зачем это знать

Analyst работает с data всегда processed через pipeline. Understanding pipeline — better analyses, faster debugging, ability propose improvements.

Not обычно analyst builds pipelines. But knows concepts.

ETL vs ELT

ETL (classic)

Extract source
Transform (Python / Spark)
Load warehouse

Transform ДО load.

ELT (modern)

Extract source
Load raw в warehouse
Transform в warehouse (SQL)

Transform в DWH using dbt.

Modern stack

ELT dominates. Reasons:

Cheap storage
Fast warehouses
SQL-centric (analyst-friendly)
dbt ecosystem

Типичная architecture

Sources (DB, APIs, Events)
    ↓ Extract (Fivetran / Airbyte)
Cloud Storage / Raw DWH
    ↓ Transform (dbt)
Clean DWH (dim / fact tables)
    ↓ Serve
BI / Dashboards / ML

Components

Sources

Production databases (Postgres, MySQL)
SaaS APIs (Stripe, Salesforce, Google Ads)
Event streams (Kafka)
Files (CSV, JSON)

Extraction

Fivetran — managed, expensive, reliable
Airbyte — open-source, self-host
Custom Python — flexibility
Singer — open-source specs

Storage

Cloud storage (S3, GCS) — raw
Warehouse: loading destination

Warehouses

Snowflake — cloud-native
BigQuery — Google
Redshift — AWS
Databricks SQL
ClickHouse (real-time focus)

Transformation

dbt — SQL-centric modern
Apache Spark — big data
Airflow + Python — custom

Orchestration

Airflow — scheduler
Dagster — modern alternative
Prefect — Python-first

BI / serve

Tableau, Metabase, Looker
Reverse ETL (Hightouch) — warehouse → apps

Analyst touch points

Consume

Query warehouse. SQL.

Understand data lineage:

Where from?
How transformed?
Refresh schedule?

Contribute

Write dbt models (transformations)
Propose new sources
Define metrics

Debug

Data wrong — trace back
Pipeline broken — understand

Need pipeline знание даже не building.

Typical workflow

1. Data available

Fivetran syncs Stripe every hour.

Raw table: raw.stripe.charges.

2. Staging

dbt model cleans:

-- models/staging/stg_stripe_charges.sql
SELECT
    id AS charge_id,
    amount / 100 AS amount_usd,
    status,
    created
FROM {{ source('stripe', 'charges') }}
WHERE status != 'test'

3. Core dimensional

-- models/core/fact_payments.sql
SELECT ... FROM {{ ref('stg_stripe_charges') }}
JOIN ...

4. Mart (use case specific)

-- models/marts/finance/revenue_daily.sql
SELECT DATE(created) AS day, SUM(amount_usd) FROM ...
GROUP BY 1

5. BI reads mart

Tableau dashboard.

dbt project structure

dbt_project/
├── models/
│   ├── staging/
│   │   └── stg_*.sql
│   ├── intermediate/
│   │   └── int_*.sql
│   ├── marts/
│   │   ├── core/
│   │   │   └── dim_*.sql, fact_*.sql
│   │   └── marketing/
│   │       └── ...
├── tests/
├── macros/
└── sources.yml

Naming conventions help readability.

Data quality

Tests

dbt tests:

models:
  - name: fact_payments
    columns:
      - name: charge_id
        tests: [unique, not_null]
      - name: amount_usd
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: '>= 0'

Monitoring

Freshness (new data arrives?)
Volume (expected counts?)
Schema changes

Tools: Great Expectations, dbt test, Monte Carlo Data.

Performance

Incremental models

Don't rebuild everything. Only new data:

{{ config(materialized='incremental') }}

SELECT * FROM source
{% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

Speed up 100x для large tables.

Partitioning

DWH features. Query только relevant partition.

Clustering

Snowflake / BigQuery specific.

Materialization strategy

View: dynamic, no storage
Table: materialized, fast read
Incremental: table + incremental updates
Ephemeral: inline, no table

Готовься к собесу аналитика как в Duolingo

10 минут в день — SQL, Python, A/B, метрики. 1700+ вопросов в Telegram

Открыть Карьерник в Telegram

Cost

Cloud warehouse

Can be expensive. Optimize:

Right materialization
Partitioning
Kill slow queries
Monitor usage

SaaS ETL

Per-row billing обычно. Costly if high volume.

Common pitfalls

1. No schema contracts

Source changes break downstream.

2. Silent failures

Pipeline «succeeds» but data wrong. No validation.

3. Manual processes

Anything manual breaks. Automate.

4. No monitoring

Detect issues via broken dashboards, не alerts.

5. Sprawl

Hundreds models, no structure. Complexity exponential.

Modern trends

Streaming

Real-time ingestion growing.

Reverse ETL

DWH → apps (Salesforce, marketing tools).

Data mesh

Decentralized data ownership per team.

AI-assisted

AI generates dbt models, SQL queries.

Lakehouse

Data lake + warehouse. Databricks, Apache Iceberg.

Russian context

Similar stack

DWH: ClickHouse widely. Snowflake less common (geopolitics).
Orchestration: Airflow dominant.
Transform: dbt adoption growing.
BI: DataLens Yandex, Metabase, Tableau.

Internal variants (Yandex YQL).

Для собеса

Questions

ETL vs ELT?
dbt experience?
Airflow workflows?
Data quality tests?

Не глубоко

Analyst не expected build pipelines. Understand.

Architecture question

«How would design pipeline для X?»

Walk high-level: source → extract → load → transform → serve.

Show basics.

Learn

dbt

Official tutorial. Build small project.

Airflow

Run local (Docker). Sample DAG.

Warehouse

Snowflake / BigQuery free tier.

Real project

Contribute team pipeline. Start small (staging model).

Analytics engineer

Role между analyst и data engineer.

dbt
Data modeling
Testing
Documentation

Hybrid role. Growing.

Good transition path.

Связанные темы

FAQ

Analyst строит pipeline?

Typically no. Contribute dbt models yes.

Python обязателен?

For pipelines — yes. For dbt — no (SQL primary).

Cloud vs on-prem?

Cloud dominant. On-prem legacy.