K-means для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Сегментация users — frequent analyst task. «Power users vs casual», «price-sensitive vs premium», «new vs loyal». Clustering — unsupervised method для этого.
K-means — самый популярный clustering algorithm. Знать — обязательно для аналитика работающего с segmentation.
Что такое K-means
Unsupervised learning: finds K clusters в данных без pre-labeled examples.
Each cluster = группа похожих points.
Алгоритм
- Выбрать K (число clusters)
- Random initialize K центров (centroids)
- Assign каждый point ближайшему центру (Euclidean distance)
- Move каждый centroid к mean его cluster points
- Repeat 3-4 до convergence
В Python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Cluster
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(X_scaled)
labels = kmeans.labels_
centers = kmeans.cluster_centers_Feature scaling critical
K-means uses distances → features должны быть scale-comparable.
Без scaling — «age» (0-100) dominates «income» (0-1M).
Always standardize или normalize.
Выбор K
1. Elbow method
Plot inertia (sum squared distances к centroid) vs K.
«Elbow» — где curve sharply bends.
inertias = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.plot(range(1, 11), inertias)2. Silhouette score
Metric quality cluster:
from sklearn.metrics import silhouette_score
for k in range(2, 11):
kmeans = KMeans(n_clusters=k)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
print(f"K={k}: {score:.3f}")Higher — better.
3. Gap statistic
Compares cluster к random data. Complex, но theoretically sound.
4. Business intuition
Часто domain-knowledge dictates K. 5 user segments? 10 customer types?
Недостатки K-means
- Predefined K — must choose beforehand
- Sensitive к initialization — use
n_init=10для multiple restarts - Assumes spherical clusters — not great для elongated shapes
- Sensitive к outliers — distant points pull centroid
Альтернативы
DBSCAN
Density-based. No predefined K.
from sklearn.cluster import DBSCAN
clusterer = DBSCAN(eps=0.5, min_samples=5)Good для arbitrary shapes.
Hierarchical
Builds tree clusters.
from sklearn.cluster import AgglomerativeClusteringGaussian Mixture Models
Probabilistic, soft assignment.
from sklearn.mixture import GaussianMixtureUse cases
Customer segmentation
RFM → K-means → cluster labels.
features = ['recency', 'frequency', 'monetary']
kmeans = KMeans(n_clusters=5)
df['segment'] = kmeans.fit_predict(df[features])Anomaly detection
Point far от all centroids → anomaly.
Dimensionality reduction
Cluster centers как compressed representation.
Image compression
Color quantization (K = number of colors).
Interpretation clusters
After clustering, analyze каждый cluster:
for i in range(5):
cluster_data = df[df['cluster'] == i]
print(f"\nCluster {i} (n={len(cluster_data)}):")
print(cluster_data.describe())Find characteristic features каждого segment.
Naming segments
«Cluster 3» не useful. Name после analysis:
- «Power users» (high frequency, high monetary)
- «At-risk» (high recency, low frequency)
- «New users» (low recency, low frequency)
Curse of dimensionality
Много features → distance becomes meaningless.
Fixes:
- Feature selection
- PCA до K-means
- Use domain features (не raw)
Reproducibility
Random initialization → different runs different clusters.
kmeans = KMeans(n_clusters=5, random_state=42)random_state для reproducibility.
Validation
Internal metrics
Silhouette, inertia — model-based.
External validation
Labels (if available): compare к ground truth.
Business validation
Do segments behave differently? Actionable?
На собесе
«Как работает K-means?» Iterative: assign, update centroid, repeat.
«Как выбрать K?» Elbow, silhouette, business intuition.
«Assumptions?» Spherical clusters, similar size. Not always holds.
«Альтернативы?» DBSCAN, hierarchical, GMM.
Частые ошибки
No scaling
Features dominate others. Always scale.
One run
Random init. Use n_init=10+ or fixed random_state.
Outliers
Pull centroids. Remove before clustering или use DBSCAN.
Overthinking K
Sometimes 3-5 clusters adequate. 20 clusters — rarely useful business-wise.
Связанные темы
FAQ
Size clusters равные?
No. K-means doesn't guarantee equal sizes.
Categorical data?
K-prototypes или one-hot encoding (risky).
Optimal K существует?
Technically no. Depends на purpose и data.
Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.