K-means для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Сегментация users — frequent analyst task. «Power users vs casual», «price-sensitive vs premium», «new vs loyal». Clustering — unsupervised method для этого.

K-means — самый популярный clustering algorithm. Знать — обязательно для аналитика работающего с segmentation.

Что такое K-means

Unsupervised learning: finds K clusters в данных без pre-labeled examples.

Each cluster = группа похожих points.

Алгоритм

  1. Выбрать K (число clusters)
  2. Random initialize K центров (centroids)
  3. Assign каждый point ближайшему центру (Euclidean distance)
  4. Move каждый centroid к mean его cluster points
  5. Repeat 3-4 до convergence

В Python

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Cluster
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(X_scaled)

labels = kmeans.labels_
centers = kmeans.cluster_centers_

Feature scaling critical

K-means uses distances → features должны быть scale-comparable.

Без scaling — «age» (0-100) dominates «income» (0-1M).

Always standardize или normalize.

Выбор K

1. Elbow method

Plot inertia (sum squared distances к centroid) vs K.

«Elbow» — где curve sharply bends.

inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

plt.plot(range(1, 11), inertias)

2. Silhouette score

Metric quality cluster:

from sklearn.metrics import silhouette_score

for k in range(2, 11):
    kmeans = KMeans(n_clusters=k)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    print(f"K={k}: {score:.3f}")

Higher — better.

3. Gap statistic

Compares cluster к random data. Complex, но theoretically sound.

4. Business intuition

Часто domain-knowledge dictates K. 5 user segments? 10 customer types?

Недостатки K-means

  • Predefined K — must choose beforehand
  • Sensitive к initialization — use n_init=10 для multiple restarts
  • Assumes spherical clusters — not great для elongated shapes
  • Sensitive к outliers — distant points pull centroid

Альтернативы

DBSCAN

Density-based. No predefined K.

from sklearn.cluster import DBSCAN
clusterer = DBSCAN(eps=0.5, min_samples=5)

Good для arbitrary shapes.

Hierarchical

Builds tree clusters.

from sklearn.cluster import AgglomerativeClustering

Gaussian Mixture Models

Probabilistic, soft assignment.

from sklearn.mixture import GaussianMixture

Use cases

Customer segmentation

RFM → K-means → cluster labels.

features = ['recency', 'frequency', 'monetary']
kmeans = KMeans(n_clusters=5)
df['segment'] = kmeans.fit_predict(df[features])

Anomaly detection

Point far от all centroids → anomaly.

Dimensionality reduction

Cluster centers как compressed representation.

Image compression

Color quantization (K = number of colors).

Interpretation clusters

After clustering, analyze каждый cluster:

for i in range(5):
    cluster_data = df[df['cluster'] == i]
    print(f"\nCluster {i} (n={len(cluster_data)}):")
    print(cluster_data.describe())

Find characteristic features каждого segment.

Naming segments

«Cluster 3» не useful. Name после analysis:

  • «Power users» (high frequency, high monetary)
  • «At-risk» (high recency, low frequency)
  • «New users» (low recency, low frequency)

Curse of dimensionality

Много features → distance becomes meaningless.

Fixes:

  • Feature selection
  • PCA до K-means
  • Use domain features (не raw)

Reproducibility

Random initialization → different runs different clusters.

kmeans = KMeans(n_clusters=5, random_state=42)

random_state для reproducibility.

Validation

Internal metrics

Silhouette, inertia — model-based.

External validation

Labels (if available): compare к ground truth.

Business validation

Do segments behave differently? Actionable?

На собесе

«Как работает K-means?» Iterative: assign, update centroid, repeat.

«Как выбрать K?» Elbow, silhouette, business intuition.

«Assumptions?» Spherical clusters, similar size. Not always holds.

«Альтернативы?» DBSCAN, hierarchical, GMM.

Частые ошибки

No scaling

Features dominate others. Always scale.

One run

Random init. Use n_init=10+ or fixed random_state.

Outliers

Pull centroids. Remove before clustering или use DBSCAN.

Overthinking K

Sometimes 3-5 clusters adequate. 20 clusters — rarely useful business-wise.

Связанные темы

FAQ

Size clusters равные?

No. K-means doesn't guarantee equal sizes.

Categorical data?

K-prototypes или one-hot encoding (risky).

Optimal K существует?

Technically no. Depends на purpose и data.


Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.