Шпаргалка ClickHouse для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
ClickHouse — de facto DB для analytics в Yandex, Avito, Ozon. Строит совершенно иначе чем Postgres. Специфика нужна знать.
На собесах в tech-companies на основе ClickHouse — ожидают. Много specific features.
Что делает ClickHouse специально
- Column-oriented
- Massive batch inserts
- Lightning-fast aggregations
- Poor for individual row lookups / updates
- Purpose-built для analytics
Table engines
MergeTree
Primary analytics engine.
CREATE TABLE events (
user_id UInt64,
event_type String,
TIMESTAMP DateTime
)
ENGINE = MergeTree()
ORDER BY (user_id, TIMESTAMP)
PARTITION BY toYYYYMM(TIMESTAMP);- ORDER BY — sort key (important для performance)
- PARTITION BY — partitioning strategy
ReplacingMergeTree
Dedup on merge.
SummingMergeTree
Auto-aggregate during merge.
AggregatingMergeTree
For materialized views с aggregations.
Specific functions
Aggregations
uniq(col)— approximate distinct countuniqExact(col)— exact (slower)uniqCombined(col)— hybrid
Quantiles
quantile(0.95)(col)— approximatequantileExact(0.95)(col)— exactquantileTDigest(0.95)(col)— fast approximate
Arrays
ClickHouse массивы — powerful.
-- Array aggregation
SELECT groupArray(value) FROM t;
-- Array manipulation
SELECT arrayJoin(['a', 'b', 'c']); -- разворачивает
-- Array filter
SELECT arrayFilter(x -> x > 10, [1, 20, 5, 30]); -- [20, 30]Dates
-- Current
now()
-- Truncate
toStartOfDay(dt)
toStartOfMonth(dt)
toStartOfHour(dt)
-- Diff
dateDiff('day', d1, d2)
dateDiff('hour', d1, d2)
-- Format
formatDateTime(dt, '%Y-%m-%d')Strings
splitByChar('/', url)— splitextractAll(str, pattern)— regex extractreplaceAll(str, pattern, replacement)lower()/upper()
JSON
-- Extract
JSONExtract(json, 'field', 'Int32')
JSONExtractString(json, 'field')
-- Has
JSONHas(json, 'field')Sampling
Unique ClickHouse feature:
CREATE TABLE events
ENGINE = MergeTree()
ORDER BY (user_id, ...)
SAMPLE BY user_id;
-- Queries c sampling
SELECT count() FROM events
SAMPLE 0.1; -- 10% sample — fasterApproximate results quickly.
IF / conditional aggregation
Efficient pivoting:
SELECT
sumIf(revenue, country = 'RU') AS revenue_ru,
sumIf(revenue, country = 'US') AS revenue_us,
countIf(event = 'click') AS clicks
FROM events
GROUP BY ...;Instead of CASE WHEN.
Window functions
Supported в recent versions:
SELECT
day,
revenue,
sum(revenue) OVER (ORDER BY day) AS cumulative
FROM daily;Newer addition.
Performance tips
1. ORDER BY matters
Sort key — основа performance. Choose carefully.
2. Partition wisely
Too small — too many parts. Too big — scan whole.
3. Pre-aggregate
Materialized views для common aggregations.
4. Avoid JOIN'ов больших таблиц
ClickHouse не лучший в JOINs. Denormalize.
5. Final
SELECT ... FINAL — use collapse/replace before read. Slow, но accurate.
Materialized views
Auto-aggregate:
CREATE MATERIALIZED VIEW daily_events
ENGINE = SummingMergeTree()
ORDER BY day
POPULATE
AS SELECT
toStartOfDay(TIMESTAMP) AS day,
count() AS events
FROM events
GROUP BY day;Updates as data inserts.
Array unrolling
Events often arrays:
-- Multiple rows per row
SELECT user_id, arrayJoin(tags) AS tag
FROM users;Opposite groupArray.
Common pitfalls
INSERT performance
Many small INSERTs — slow. Batch.
DELETE / UPDATE
Expensive. Avoid.
Row-level lookups
Not strength. Index к ORDER BY helps.
Complex JOINs
Limited. Denormalize или use distributed tables.
Specific useful
argMin / argMax
Get col where other minimum/maximum:
SELECT argMin(product_name, price) AS cheapest FROM products;topK
Most frequent values:
SELECT topK(10)(category) FROM products;arrayMap / arraySum
SELECT arraySum(arr) FROM t;
SELECT arrayMap(x -> x * 2, [1, 2, 3]); -- [2, 4, 6]Типы данных
Numeric
- Int8, Int16, Int32, Int64
- UInt8, UInt16, UInt32, UInt64
- Float32, Float64
- Decimal
String
- String (variable length)
- FixedString(N)
Date / Time
- Date, Date32
- DateTime, DateTime64
Arrays
- Array(Int32)
- Array(String)
- Array(Array(Int32))
Nullable
- Nullable(Int32) — allows NULL
Enum
Efficient categorical.
Client tools
- ClickHouse CLI (clickhouse-client)
- DBeaver с ClickHouse driver
- DataGrip support
- Python:
clickhouse-driver,clickhouse-connect
Practical workflow
Analyst daily
- Connect ClickHouse (Python / DBeaver)
- Explore data с
DESCRIBE,SAMPLE - Write aggregations (
GROUP BY) - Optimize (
ORDER BY, partition pruning) - Export results
На собесе
«Почему ClickHouse?» Column-oriented, fast aggregations, scales к TB.
«Когда не подходит?» Individual rows, много updates, complex JOINs.
«Особые функции?» uniq, quantile, sumIf, arrays, JSON, sampling.
«Materialized view?» Auto-computed, accelerates common queries.
Связанные темы
FAQ
ClickHouse managed?
Yandex Managed, ClickHouse Cloud available.
Learning curve?
Неделя для SQL-literate аналитика.
Replace Postgres?
Не для OLTP. Для OLAP — often yes.
Тренируйте SQL — откройте тренажёр с 1500+ вопросами для собесов.