Dimensional modeling для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Data warehouse design — если вы работаете с DWH, это необходимо. Kimball dimensional modeling — industry standard.

На собесах middle+ questions о star schema, dim / fact, SCD — expected.

Foundation: Kimball approach

Ralph Kimball — data warehouse thought leader.

Idea: analytics-optimized schema, business-user-friendly.

Contrast OLTP (operational)

  • Normalized
  • Write-optimized
  • Entities normalized tables

OLAP (analytics)

  • Denormalized
  • Read-optimized
  • Fact + dimension tables

Dimensional modeling — for OLAP.

Fact tables

Core business events / measurements.

Properties

  • Many rows (millions-billions)
  • Few columns (numeric measures + foreign keys)
  • Grain — what each row represents

Example: Sales fact

fact_sales
- date_key (FK)
- product_key (FK)
- customer_key (FK)
- store_key (FK)
- quantity
- price
- revenue

Each row: one sale.

Types

  • Transactional: single event per row
  • Periodic snapshot: state at time (e.g., daily inventory)
  • Accumulating snapshot: milestones (e.g., order stages)

Dimension tables

Descriptive context.

Properties

  • Fewer rows (< 1M usually)
  • Many columns (descriptive attributes)
  • Primary key natural or surrogate

Example: Product dim

dim_product
- product_key (PK)
- product_id (natural)
- name
- category
- brand
- price
- color

Each row: one product.

Star schema

Fact table center, dim tables around.

        dim_product
             |
dim_date — fact_sales — dim_customer
             |
         dim_store

Simple joins (fact ↔ dim). Analytics-friendly.

Snowflake schema

Like star, but dim tables normalized (sub-dims).

dim_product → dim_category → dim_supercategory

Saves space, but more JOINs. Often less useful на modern scale.

Grain

Critical concept: what single row represents.

Examples

  • One sale transaction
  • One user-day
  • One order-line

Matters

Ambiguous grain → double-counting, wrong metrics.

Document clearly.

Foreign keys

Fact → dim через keys.

Surrogate keys

Artificial ID (integer). Not business key.

Reasons:

  • Performance (integer JOIN fast)
  • Handle history (SCD)
  • Simpler schema

Natural keys

Business identifier (product_sku).

Problems:

  • Can change
  • Hard standardize
  • Slower JOIN

Warehouse convention: surrogate keys mostly.

Slowly Changing Dimensions (SCD)

Dim attributes change. How track?

Type 1: Overwrite

New value replaces old. No history.

Simple. Lose history.

dim_customer.city: 'Moscow' → 'SPb'

Type 2: Add row

New row с new values. Old row valid_until timestamp.

customer_key | customer_id | city   | valid_from | valid_to
1            | 123         | Moscow | 2020-01    | 2023-06
2            | 123         | SPb    | 2023-06    | 9999-12

History preserved. Common choice.

Type 3: Add column

Current + previous.

city_current | city_previous

Limited history.

Type 4: History table

Separate table tracks changes.

Type 6: Hybrid

Combines multiple.

Most common: Type 1 (для unimportant) + Type 2 (important historical).

Degenerate dimensions

Dim attribute stored в fact table (no separate dim).

Example: order_id в fact_sales. No separate dim_order (no other attributes).

Saves JOIN.

Junk dimensions

Combine unrelated small dims:

dim_junk: (payment_method, shipping_method, promo_code_flag)

Instead 3 separate dims.

Role-playing dimensions

Same dim used multiple contexts.

Example: dim_date как «order_date» и «shipment_date». Same dim, different FK.

Bridge tables

Many-to-many между dim и fact.

Example: Sale → Multiple salespeople.

bridge_sale_salesperson: (sale_key, salesperson_key, commission_pct)

Example: e-commerce

Facts

  • fact_orders
  • fact_order_items
  • fact_web_sessions
  • fact_inventory (snapshot)

Dimensions

  • dim_date
  • dim_customer (SCD Type 2)
  • dim_product
  • dim_store
  • dim_campaign

Star queries

«Revenue по category месячно»:

SELECT
    d.month,
    p.category,
    SUM(f.revenue) AS total
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY d.month, p.category;

Clean, fast.

In practice

Tools

  • dbt — build dimensional model в SQL
  • Informatica / other ETL — legacy
  • Fivetran / Airbyte — extraction + transform

Stages

  1. Raw: source system replica
  2. Staged: cleaned, typed
  3. Core: dimensional model (facts + dims)
  4. Marts: specific use cases

dbt example

-- models/core/dim_customer.sql
SELECT
    ROW_NUMBER() OVER (ORDER BY customer_id) AS customer_key,
    customer_id,
    first_name,
    last_name,
    city,
    signup_date,
    CURRENT_DATE AS valid_from
FROM {{ ref('stg_customers') }};

Для analyst

Benefits

  • Simple JOINs
  • Fast aggregations
  • Business-friendly
  • Reusable

Query

SELECT dim.*, fact.* FROM fact_x
JOIN dim_y ON fact_x.dim_y_key = dim_y.dim_y_key;

Predictable.

SCD aware

When querying history — filter valid_to (Type 2).

Альтернативы

Data vault

Alternative methodology (Dan Linstedt). Hub / link / satellite.

Less common. Enterprise sometimes.

Flat tables

Modern approach: wide denormalized tables. Ok при cheap storage.

Trade-off: less flexible queries.

Activity schema

Event-based. Each row = event. Simpler.

Common mistakes

Wrong grain

Ambiguous counts.

No surrogate keys

Production issues when natural changes.

Too many joins

If query requires 10 joins — questionable design.

Ignoring history

No SCD → cannot answer historical queries.

Over-engineering

Simple business → over-complex schema.

На собесе

«Star schema что это?»

Central fact, surrounding dims. Analytics-friendly.

«Star vs snowflake?»

Star flat. Snowflake normalized dims.

«SCD types?»

1: overwrite. 2: history rows. 3: column add.

«Fact vs dim?»

Fact — measurements, events. Dim — descriptive attributes.

«Grain?»

What single row represents в fact table.

Books

  • Kimball «Data Warehouse Toolkit»
  • «Star Schema Complete Reference»

Связанные темы

FAQ

Analyst должен знать?

Middle+ — yes. Especially working с DWH.

Обязательно Kimball?

Standard. Other approaches niches.

Modern stack меняет?

Some flat-table trends. Fundamentals same.


Тренируйте — откройте тренажёр с 1500+ вопросами для собесов.