Как построить ETL-pipeline для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Analyst работает с data всегда processed через pipeline. Understanding pipeline — better analyses, faster debugging, ability propose improvements.

Not обычно analyst builds pipelines. But knows concepts.

ETL vs ELT

ETL (classic)

  • Extract source
  • Transform (Python / Spark)
  • Load warehouse

Transform ДО load.

ELT (modern)

  • Extract source
  • Load raw в warehouse
  • Transform в warehouse (SQL)

Transform в DWH using dbt.

Modern stack

ELT dominates. Reasons:

  • Cheap storage
  • Fast warehouses
  • SQL-centric (analyst-friendly)
  • dbt ecosystem

Типичная architecture

Sources (DB, APIs, Events)
    ↓ Extract (Fivetran / Airbyte)
Cloud Storage / Raw DWH
    ↓ Transform (dbt)
Clean DWH (dim / fact tables)
    ↓ Serve
BI / Dashboards / ML

Components

Sources

  • Production databases (Postgres, MySQL)
  • SaaS APIs (Stripe, Salesforce, Google Ads)
  • Event streams (Kafka)
  • Files (CSV, JSON)

Extraction

  • Fivetran — managed, expensive, reliable
  • Airbyte — open-source, self-host
  • Custom Python — flexibility
  • Singer — open-source specs

Storage

  • Cloud storage (S3, GCS) — raw
  • Warehouse: loading destination

Warehouses

  • Snowflake — cloud-native
  • BigQuery — Google
  • Redshift — AWS
  • Databricks SQL
  • ClickHouse (real-time focus)

Transformation

  • dbt — SQL-centric modern
  • Apache Spark — big data
  • Airflow + Python — custom

Orchestration

  • Airflow — scheduler
  • Dagster — modern alternative
  • Prefect — Python-first

BI / serve

  • Tableau, Metabase, Looker
  • Reverse ETL (Hightouch) — warehouse → apps

Analyst touch points

Consume

Query warehouse. SQL.

Understand data lineage:

  • Where from?
  • How transformed?
  • Refresh schedule?

Contribute

  • Write dbt models (transformations)
  • Propose new sources
  • Define metrics

Debug

  • Data wrong — trace back
  • Pipeline broken — understand

Need pipeline знание даже не building.

Typical workflow

1. Data available

Fivetran syncs Stripe every hour.

Raw table: raw.stripe.charges.

2. Staging

dbt model cleans:

-- models/staging/stg_stripe_charges.sql
SELECT
    id AS charge_id,
    amount / 100 AS amount_usd,
    status,
    created
FROM {{ source('stripe', 'charges') }}
WHERE status != 'test'

3. Core dimensional

-- models/core/fact_payments.sql
SELECT ... FROM {{ ref('stg_stripe_charges') }}
JOIN ...

4. Mart (use case specific)

-- models/marts/finance/revenue_daily.sql
SELECT DATE(created) AS day, SUM(amount_usd) FROM ...
GROUP BY 1

5. BI reads mart

Tableau dashboard.

dbt project structure

dbt_project/
├── models/
│   ├── staging/
│   │   └── stg_*.sql
│   ├── intermediate/
│   │   └── int_*.sql
│   ├── marts/
│   │   ├── core/
│   │   │   └── dim_*.sql, fact_*.sql
│   │   └── marketing/
│   │       └── ...
├── tests/
├── macros/
└── sources.yml

Naming conventions help readability.

Data quality

Tests

dbt tests:

models:
  - name: fact_payments
    columns:
      - name: charge_id
        tests: [unique, not_null]
      - name: amount_usd
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: '>= 0'

Monitoring

  • Freshness (new data arrives?)
  • Volume (expected counts?)
  • Schema changes

Tools: Great Expectations, dbt test, Monte Carlo Data.

Performance

Incremental models

Don't rebuild everything. Only new data:

{{ config(materialized='incremental') }}

SELECT * FROM source
{% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

Speed up 100x для large tables.

Partitioning

DWH features. Query только relevant partition.

Clustering

Snowflake / BigQuery specific.

Materialization strategy

  • View: dynamic, no storage
  • Table: materialized, fast read
  • Incremental: table + incremental updates
  • Ephemeral: inline, no table

Cost

Cloud warehouse

Can be expensive. Optimize:

  • Right materialization
  • Partitioning
  • Kill slow queries
  • Monitor usage

SaaS ETL

Per-row billing обычно. Costly if high volume.

Common pitfalls

1. No schema contracts

Source changes break downstream.

2. Silent failures

Pipeline «succeeds» but data wrong. No validation.

3. Manual processes

Anything manual breaks. Automate.

4. No monitoring

Detect issues via broken dashboards, не alerts.

5. Sprawl

Hundreds models, no structure. Complexity exponential.

Modern trends

Streaming

Real-time ingestion growing.

Reverse ETL

DWH → apps (Salesforce, marketing tools).

Data mesh

Decentralized data ownership per team.

AI-assisted

AI generates dbt models, SQL queries.

Lakehouse

Data lake + warehouse. Databricks, Apache Iceberg.

Russian context

Similar stack

  • DWH: ClickHouse widely. Snowflake less common (geopolitics).
  • Orchestration: Airflow dominant.
  • Transform: dbt adoption growing.
  • BI: DataLens Yandex, Metabase, Tableau.

Internal variants (Yandex YQL).

Для собеса

Questions

  • ETL vs ELT?
  • dbt experience?
  • Airflow workflows?
  • Data quality tests?

Не глубоко

Analyst не expected build pipelines. Understand.

Architecture question

«How would design pipeline для X?»

Walk high-level: source → extract → load → transform → serve.

Show basics.

Learn

dbt

Official tutorial. Build small project.

Airflow

Run local (Docker). Sample DAG.

Warehouse

Snowflake / BigQuery free tier.

Real project

Contribute team pipeline. Start small (staging model).

Analytics engineer

Role между analyst и data engineer.

  • dbt
  • Data modeling
  • Testing
  • Documentation

Hybrid role. Growing.

Good transition path.

Связанные темы

FAQ

Analyst строит pipeline?

Typically no. Contribute dbt models yes.

Python обязателен?

For pipelines — yes. For dbt — no (SQL primary).

Cloud vs on-prem?

Cloud dominant. On-prem legacy.


Тренируйте — откройте тренажёр с 1500+ вопросами для собесов.