Dimensional modeling для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Data warehouse design — если вы работаете с DWH, это необходимо. Kimball dimensional modeling — industry standard.
На собесах middle+ questions о star schema, dim / fact, SCD — expected.
Foundation: Kimball approach
Ralph Kimball — data warehouse thought leader.
Idea: analytics-optimized schema, business-user-friendly.
Contrast OLTP (operational)
- Normalized
- Write-optimized
- Entities normalized tables
OLAP (analytics)
- Denormalized
- Read-optimized
- Fact + dimension tables
Dimensional modeling — for OLAP.
Fact tables
Core business events / measurements.
Properties
- Many rows (millions-billions)
- Few columns (numeric measures + foreign keys)
- Grain — what each row represents
Example: Sales fact
fact_sales
- date_key (FK)
- product_key (FK)
- customer_key (FK)
- store_key (FK)
- quantity
- price
- revenueEach row: one sale.
Types
- Transactional: single event per row
- Periodic snapshot: state at time (e.g., daily inventory)
- Accumulating snapshot: milestones (e.g., order stages)
Dimension tables
Descriptive context.
Properties
- Fewer rows (< 1M usually)
- Many columns (descriptive attributes)
- Primary key natural or surrogate
Example: Product dim
dim_product
- product_key (PK)
- product_id (natural)
- name
- category
- brand
- price
- colorEach row: one product.
Star schema
Fact table center, dim tables around.
dim_product
|
dim_date — fact_sales — dim_customer
|
dim_storeSimple joins (fact ↔ dim). Analytics-friendly.
Snowflake schema
Like star, but dim tables normalized (sub-dims).
dim_product → dim_category → dim_supercategorySaves space, but more JOINs. Often less useful на modern scale.
Grain
Critical concept: what single row represents.
Examples
- One sale transaction
- One user-day
- One order-line
Matters
Ambiguous grain → double-counting, wrong metrics.
Document clearly.
Foreign keys
Fact → dim через keys.
Surrogate keys
Artificial ID (integer). Not business key.
Reasons:
- Performance (integer JOIN fast)
- Handle history (SCD)
- Simpler schema
Natural keys
Business identifier (product_sku).
Problems:
- Can change
- Hard standardize
- Slower JOIN
Warehouse convention: surrogate keys mostly.
Slowly Changing Dimensions (SCD)
Dim attributes change. How track?
Type 1: Overwrite
New value replaces old. No history.
Simple. Lose history.
dim_customer.city: 'Moscow' → 'SPb'Type 2: Add row
New row с new values. Old row valid_until timestamp.
customer_key | customer_id | city | valid_from | valid_to
1 | 123 | Moscow | 2020-01 | 2023-06
2 | 123 | SPb | 2023-06 | 9999-12History preserved. Common choice.
Type 3: Add column
Current + previous.
city_current | city_previousLimited history.
Type 4: History table
Separate table tracks changes.
Type 6: Hybrid
Combines multiple.
Most common: Type 1 (для unimportant) + Type 2 (important historical).
Degenerate dimensions
Dim attribute stored в fact table (no separate dim).
Example: order_id в fact_sales. No separate dim_order (no other attributes).
Saves JOIN.
Junk dimensions
Combine unrelated small dims:
dim_junk: (payment_method, shipping_method, promo_code_flag)Instead 3 separate dims.
Role-playing dimensions
Same dim used multiple contexts.
Example: dim_date как «order_date» и «shipment_date». Same dim, different FK.
Bridge tables
Many-to-many между dim и fact.
Example: Sale → Multiple salespeople.
bridge_sale_salesperson: (sale_key, salesperson_key, commission_pct)Example: e-commerce
Facts
- fact_orders
- fact_order_items
- fact_web_sessions
- fact_inventory (snapshot)
Dimensions
- dim_date
- dim_customer (SCD Type 2)
- dim_product
- dim_store
- dim_campaign
Star queries
«Revenue по category месячно»:
SELECT
d.month,
p.category,
SUM(f.revenue) AS total
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY d.month, p.category;Clean, fast.
In practice
Tools
- dbt — build dimensional model в SQL
- Informatica / other ETL — legacy
- Fivetran / Airbyte — extraction + transform
Stages
- Raw: source system replica
- Staged: cleaned, typed
- Core: dimensional model (facts + dims)
- Marts: specific use cases
dbt example
-- models/core/dim_customer.sql
SELECT
ROW_NUMBER() OVER (ORDER BY customer_id) AS customer_key,
customer_id,
first_name,
last_name,
city,
signup_date,
CURRENT_DATE AS valid_from
FROM {{ ref('stg_customers') }};Для analyst
Benefits
- Simple JOINs
- Fast aggregations
- Business-friendly
- Reusable
Query
SELECT dim.*, fact.* FROM fact_x
JOIN dim_y ON fact_x.dim_y_key = dim_y.dim_y_key;Predictable.
SCD aware
When querying history — filter valid_to (Type 2).
Альтернативы
Data vault
Alternative methodology (Dan Linstedt). Hub / link / satellite.
Less common. Enterprise sometimes.
Flat tables
Modern approach: wide denormalized tables. Ok при cheap storage.
Trade-off: less flexible queries.
Activity schema
Event-based. Each row = event. Simpler.
Common mistakes
Wrong grain
Ambiguous counts.
No surrogate keys
Production issues when natural changes.
Too many joins
If query requires 10 joins — questionable design.
Ignoring history
No SCD → cannot answer historical queries.
Over-engineering
Simple business → over-complex schema.
На собесе
«Star schema что это?»
Central fact, surrounding dims. Analytics-friendly.
«Star vs snowflake?»
Star flat. Snowflake normalized dims.
«SCD types?»
1: overwrite. 2: history rows. 3: column add.
«Fact vs dim?»
Fact — measurements, events. Dim — descriptive attributes.
«Grain?»
What single row represents в fact table.
Books
- Kimball «Data Warehouse Toolkit»
- «Star Schema Complete Reference»
Связанные темы
- Star schema vs snowflake
- Fact vs dimension table
- OLTP vs OLAP
- Data warehouse vs database
- Нормализация БД
- dbt для аналитика
FAQ
Analyst должен знать?
Middle+ — yes. Especially working с DWH.
Обязательно Kimball?
Standard. Other approaches niches.
Modern stack меняет?
Some flat-table trends. Fundamentals same.
Тренируйте — откройте тренажёр с 1500+ вопросами для собесов.