Data tiering на собеседовании Data Engineer
Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.
Зачем tiering
Storage costs scale с data volume. Не all data accessed equally.
Recent data = hot (queried часто). Old data = cold (rare access).
Tiered storage — match data access pattern с storage cost.
Hot tier
Storage: SSD, NVMe, в-memory caches.
Latency: ms.
Cost: highest per GB.
Use case:
- Recent data (last 30 days).
- Frequently accessed.
- Active dashboards.
Examples: Postgres, ClickHouse hot tables, Redis cache, S3 Standard.
Warm tier
Storage: HDD, S3 Standard-IA.
Latency: seconds.
Cost: medium.
Use case:
- 1-12 months old.
- Less frequent queries.
- Reporting на end of month.
Cold tier
Storage: S3 Glacier, tape, archive.
Latency: minutes-hours retrieval.
Cost: cheapest. 80-90% off hot.
Use case:
1 year old.
- Compliance retention (financial, regulatory).
- Disaster recovery.
Lifecycle policies
Auto-move data between tiers.
S3 lifecycle:
{
"Rules": [{
"Status": "Enabled",
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
]
}]
}ClickHouse TTL.
ALTER TABLE events MODIFY TTL
event_date + INTERVAL 30 DAY TO VOLUME 'warm',
event_date + INTERVAL 90 DAY TO VOLUME 'cold',
event_date + INTERVAL 365 DAY DELETE;Iceberg. Partition-level compaction / archival.
Связанные темы
- S3 и object storage для DE
- Партиционирование таблиц для DE
- ClickHouse MergeTree для DE
- Lakehouse Iceberg Delta для DE
- Подготовка к собесу Data Engineer
FAQ
Это официальная информация?
Нет. Статья основана на индустриальных storage practices.
Тренируйте Data Engineering — откройте тренажёр с 1500+ вопросами для собесов.