Шпаргалка ClickHouse для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

ClickHouse — de facto DB для analytics в Yandex, Avito, Ozon. Строит совершенно иначе чем Postgres. Специфика нужна знать.

На собесах в tech-companies на основе ClickHouse — ожидают. Много specific features.

Что делает ClickHouse специально

  • Column-oriented
  • Massive batch inserts
  • Lightning-fast aggregations
  • Poor for individual row lookups / updates
  • Purpose-built для analytics

Table engines

MergeTree

Primary analytics engine.

CREATE TABLE events (
    user_id UInt64,
    event_type String,
    TIMESTAMP DateTime
)
ENGINE = MergeTree()
ORDER BY (user_id, TIMESTAMP)
PARTITION BY toYYYYMM(TIMESTAMP);
  • ORDER BY — sort key (important для performance)
  • PARTITION BY — partitioning strategy

ReplacingMergeTree

Dedup on merge.

SummingMergeTree

Auto-aggregate during merge.

AggregatingMergeTree

For materialized views с aggregations.

Specific functions

Aggregations

  • uniq(col) — approximate distinct count
  • uniqExact(col) — exact (slower)
  • uniqCombined(col) — hybrid

Quantiles

  • quantile(0.95)(col) — approximate
  • quantileExact(0.95)(col) — exact
  • quantileTDigest(0.95)(col) — fast approximate

Arrays

ClickHouse массивы — powerful.

-- Array aggregation
SELECT groupArray(value) FROM t;

-- Array manipulation
SELECT arrayJoin(['a', 'b', 'c']);  -- разворачивает

-- Array filter
SELECT arrayFilter(x -> x > 10, [1, 20, 5, 30]);  -- [20, 30]

Dates

-- Current
now()

-- Truncate
toStartOfDay(dt)
toStartOfMonth(dt)
toStartOfHour(dt)

-- Diff
dateDiff('day', d1, d2)
dateDiff('hour', d1, d2)

-- Format
formatDateTime(dt, '%Y-%m-%d')

Strings

  • splitByChar('/', url) — split
  • extractAll(str, pattern) — regex extract
  • replaceAll(str, pattern, replacement)
  • lower() / upper()

JSON

-- Extract
JSONExtract(json, 'field', 'Int32')
JSONExtractString(json, 'field')

-- Has
JSONHas(json, 'field')

Sampling

Unique ClickHouse feature:

CREATE TABLE events
ENGINE = MergeTree()
ORDER BY (user_id, ...)
SAMPLE BY user_id;

-- Queries c sampling
SELECT count() FROM events
SAMPLE 0.1;  -- 10% sample — faster

Approximate results quickly.

IF / conditional aggregation

Efficient pivoting:

SELECT
    sumIf(revenue, country = 'RU') AS revenue_ru,
    sumIf(revenue, country = 'US') AS revenue_us,
    countIf(event = 'click') AS clicks
FROM events
GROUP BY ...;

Instead of CASE WHEN.

Window functions

Supported в recent versions:

SELECT
    day,
    revenue,
    sum(revenue) OVER (ORDER BY day) AS cumulative
FROM daily;

Newer addition.

Performance tips

1. ORDER BY matters

Sort key — основа performance. Choose carefully.

2. Partition wisely

Too small — too many parts. Too big — scan whole.

3. Pre-aggregate

Materialized views для common aggregations.

4. Avoid JOIN'ов больших таблиц

ClickHouse не лучший в JOINs. Denormalize.

5. Final

SELECT ... FINAL — use collapse/replace before read. Slow, но accurate.

Materialized views

Auto-aggregate:

CREATE MATERIALIZED VIEW daily_events
ENGINE = SummingMergeTree()
ORDER BY day
POPULATE
AS SELECT
    toStartOfDay(TIMESTAMP) AS day,
    count() AS events
FROM events
GROUP BY day;

Updates as data inserts.

Array unrolling

Events often arrays:

-- Multiple rows per row
SELECT user_id, arrayJoin(tags) AS tag
FROM users;

Opposite groupArray.

Common pitfalls

INSERT performance

Many small INSERTs — slow. Batch.

DELETE / UPDATE

Expensive. Avoid.

Row-level lookups

Not strength. Index к ORDER BY helps.

Complex JOINs

Limited. Denormalize или use distributed tables.

Specific useful

argMin / argMax

Get col where other minimum/maximum:

SELECT argMin(product_name, price) AS cheapest FROM products;

topK

Most frequent values:

SELECT topK(10)(category) FROM products;

arrayMap / arraySum

SELECT arraySum(arr) FROM t;
SELECT arrayMap(x -> x * 2, [1, 2, 3]);  -- [2, 4, 6]

Типы данных

Numeric

  • Int8, Int16, Int32, Int64
  • UInt8, UInt16, UInt32, UInt64
  • Float32, Float64
  • Decimal

String

  • String (variable length)
  • FixedString(N)

Date / Time

  • Date, Date32
  • DateTime, DateTime64

Arrays

  • Array(Int32)
  • Array(String)
  • Array(Array(Int32))

Nullable

  • Nullable(Int32) — allows NULL

Enum

Efficient categorical.

Client tools

  • ClickHouse CLI (clickhouse-client)
  • DBeaver с ClickHouse driver
  • DataGrip support
  • Python: clickhouse-driver, clickhouse-connect

Practical workflow

Analyst daily

  1. Connect ClickHouse (Python / DBeaver)
  2. Explore data с DESCRIBE, SAMPLE
  3. Write aggregations (GROUP BY)
  4. Optimize (ORDER BY, partition pruning)
  5. Export results

На собесе

«Почему ClickHouse?» Column-oriented, fast aggregations, scales к TB.

«Когда не подходит?» Individual rows, много updates, complex JOINs.

«Особые функции?» uniq, quantile, sumIf, arrays, JSON, sampling.

«Materialized view?» Auto-computed, accelerates common queries.

Связанные темы

FAQ

ClickHouse managed?

Yandex Managed, ClickHouse Cloud available.

Learning curve?

Неделя для SQL-literate аналитика.

Replace Postgres?

Не для OLTP. Для OLAP — often yes.


Тренируйте SQL — откройте тренажёр с 1500+ вопросами для собесов.