dbt для аналитика с нуля

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

dbt — революция в analytics engineering. Transforms DWH через SQL, с tests, docs, lineage. За last 5 лет стал industry-standard в modern data stack. В Yandex, Tinkoff, Avito и международных tech companies — requirement middle+ аналитика.

Что такое dbt

dbt (data build tool) — framework для SQL transformations в warehouse:

  • Write SELECT queries (models)
  • dbt создаёт tables / views
  • Manage dependencies
  • Tests / docs / lineage

Не ETL (extraction, load — другие tools). Только Transform (T в ELT).

Философия

ELT не ETL

Modern stack: Extract + Load в DWH, Transform в DWH (SQL).

Раньше: python scripts transform → DWH.

Теперь: dbt transforms in warehouse (fast, scalable).

Analytics engineering

Новая роль между analyst и data engineer. Ответственна за trusted layer в DWH.

Main концепции

Models

SQL files с SELECT — становятся tables / views:

-- models/daily_revenue.sql
SELECT
    DATE(created_at) AS day,
    SUM(total) AS revenue
FROM raw.orders
GROUP BY 1

dbt runCREATE TABLE daily_revenue AS ....

ref

Dependencies между models:

SELECT * FROM {{ ref('stg_orders') }}

{{ ref() }} — dbt resolves к actual table name.

Sources

External tables:

SELECT * FROM {{ source('raw', 'orders') }}

Tests

# models/schema.yml
models:
  - name: orders
    columns:
      - name: id
        tests: [unique, not_null]
      - name: user_id
        tests:
          - relationships:
              to: ref('users')
              field: id

dbt test — runs all tests.

Docs

Auto-generated documentation site с data lineage graph.

Структура проекта

dbt_project/
├── dbt_project.yml
├── models/
│   ├── staging/
│   │   ├── stg_orders.sql
│   │   └── stg_users.sql
│   ├── intermediate/
│   └── marts/
│       ├── core/
│       └── marketing/
├── tests/
├── macros/
└── seeds/

Layers: staging (cleaned), intermediate, marts (business-ready).

Макросы (Jinja)

SQL + Python-like templating:

{% SET regions = ['RU', 'UA', 'KZ'] %}

SELECT
    {% for r IN regions %}
    SUM(CASE WHEN country = '{{ r }}' THEN revenue END) AS rev_{{ r }}{% if NOT loop.last %},{% endif %}
    {% endfor %}
FROM orders

Dry-up repeating patterns.

Incremental models

Large tables — не rebuild fully каждый run:

{{ config(materialized='incremental') }}

SELECT * FROM raw.events
{% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

Only new rows processed.

dbt Cloud vs Core

dbt Core

  • Open-source CLI
  • Self-host
  • Free

dbt Cloud

  • Managed UI
  • Scheduling
  • $100+/user/month

Teams часто: Core для dev, Cloud для production scheduling.

Workflow

# Develop
dbt run --select model_name

# Test
dbt test --select model_name

# Docs
dbt docs generate
dbt docs serve

# Full run
dbt run
dbt test

Deployment

Scheduler

  • dbt Cloud
  • Airflow + dbt Core
  • GitHub Actions

Обычно: dbt runs каждые 1-24 hours.

Git integration

Standard SWE workflow:

  • Feature branch
  • PR review
  • Merge to main
  • Deploy

Связь с BI

dbt выводит clean tables → BI (Looker, Metabase, Tableau) reads.

Single source of truth — metric definition в dbt, не в каждом BI.

Macros, packages

dbt-utils, dbt-expectations — collections of utilities:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.0.0
-- Use
SELECT {{ dbt_utils.generate_surrogate_key(['col1', 'col2']) }} AS id
FROM TABLE

Tests

Built-in

  • not_null
  • unique
  • accepted_values
  • relationships

Custom

-- tests/assert_positive_revenue.sql
SELECT * FROM {{ ref('revenue') }}
WHERE revenue < 0

Test fails если row returned.

Snapshots

Track slowly-changing dimensions:

{% snapshot users_snapshot %}
{{
    config(
      target_schema='snapshots',
      unique_key='id',
      strategy='check',
      check_cols=['email', 'plan']
    )
}}

SELECT * FROM {{ source('raw', 'users') }}
{% endsnapshot %}

Roles

Data engineer

Builds ingestion, manages infra.

Analytics engineer

Writes dbt models, owns transformations.

Analyst

Uses dbt models for analyses.

Blurring роли, но dbt — primary tool для analytics engineer.

На собесе

«dbt experience?» Modeling layers, tests, docs, incremental.

«dbt vs stored procedures?» dbt — git-based, testable, documented. Stored procs — legacy.

«Why dbt?» Software engineering best practices в SQL world.

Связанные темы

FAQ

Python нужен?

Нет. Базово — только SQL + Jinja templating.

Cloud vs Core?

Core для start. Cloud когда нужен scheduling UI и team scale.

Learning curve?

Пара недель для analyst с SQL background.


Тренируйте Data — откройте тренажёр с 1500+ вопросами для собесов.