PCA для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

100 features → можно ли свести до 5-10, которые captured most variance? Это PCA. Used для visualization (high-dim → 2D), speed up ML, remove noise. На собесах ML для analyst PCA упоминается.

Что такое PCA

Principal Component Analysis — dimensionality reduction.

Берёт correlated features → создаёт uncorrelated «principal components», ranked by variance explained.

Интуиция

Представьте 2D data с strong diagonal pattern.

X и Y correlated. Можете найти direction along которого variance maximum (diagonal). Second direction — perpendicular, меньше variance.

Projection на first direction — captures most info. Это PCA.

Математика (упрощённо)

  1. Center data: subtract mean
  2. Compute covariance matrix
  3. Eigendecomposition → eigenvectors (directions) + eigenvalues (variance)
  4. Rank eigenvectors by eigenvalues
  5. Project data на top-K eigenvectors

Result: K principal components.

В Python

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Scale (важно!)
X_scaled = StandardScaler().fit_transform(X)

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Variance explained
print(pca.explained_variance_ratio_)
# [0.65, 0.22]  — PC1 explains 65%, PC2 22%

Feature scaling critical

Без scaling — features с larger variance dominate. Always standardize.

Выбор числа компонент

Explained variance

Cumulative variance vs number components:

cumulative = np.cumsum(pca.explained_variance_ratio_)
plt.plot(cumulative)
# Choose K where curve flattens (e.g., 90% variance)

Rule of thumb: 80-95% variance.

Scree plot

Eigenvalues vs index. Find «elbow».

Kaiser rule

Keep components с eigenvalue > 1.

Use cases

1. Visualization

High-dim data → 2D / 3D для exploratory analysis.

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)

2. Speed up ML

100 features → 10 PCs → faster training.

3. Remove multicollinearity

PCs uncorrelated by construction. Fixes multicollinearity в regression.

4. Noise reduction

Last components often noise. Drop them → cleaner signal.

5. Image / signal compression

Reduce dimensionality → smaller storage.

Interpretation

Components = linear combination original features.

# Loadings
pd.DataFrame(pca.components_, columns=X.columns)

Можно see: «PC1 is mostly about age и income».

Проблемы

Loss of interpretability

Original features имеют meaning. PCs — abstract combinations.

Linear only

Captures linear relationships. Non-linear structure → Kernel PCA или UMAP / t-SNE.

Sensitive к scaling

Must scale features.

Not for categorical

Designed для continuous. Categorical — MCA (correspondence analysis).

Альтернативы

t-SNE

Non-linear dimensionality reduction. Good для visualization.

UMAP

Modern, faster than t-SNE. Preserves local и global structure.

Autoencoders

Neural-network based. Flexible.

Feature selection

Keep subset original features (interpretable).

Kernel PCA

Non-linear version:

from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf')

Good для non-linear separation.

Sparse PCA

PCs — sparse (most loadings = 0). More interpretable.

from sklearn.decomposition import SparsePCA

Incremental PCA

Для datasets too big для memory:

from sklearn.decomposition import IncrementalPCA
pca = IncrementalPCA(n_components=5, batch_size=1000)

Проверка

Reconstruction error

X_reconstructed = pca.inverse_transform(X_pca)
error = np.mean((X_scaled - X_reconstructed) ** 2)

Small → good approximation.

Preserving structure

Plot original X colored by target. Then plot X_pca. If clusters preserved — good.

На собесе

«Что делает PCA?» Dimensionality reduction, finding directions max variance.

«Preprocessing?» Standardize features.

«Как выбрать K?» Cumulative variance (80-95%), scree plot, business need.

«Альтернативы?» t-SNE, UMAP — non-linear. Feature selection — interpretable.

Частые ошибки

No scaling

Largest variance features dominate.

Overuse

Not every problem needs PCA. Interpretability often matters more.

Interpreting PC как feature

PCs are combinations. Don't interpret как semantic features.

Apply после train/test split

Apply только на train, transform test. Иначе data leakage.

Связанные темы

FAQ

PCA или feature selection?

Selection — keep original features, interpretable. PCA — combine, less interpretable.

Для A/B testing?

Не common use case. PCA — ML preprocessing.

Categorical data?

Convert к dummy variables first (risky) или use MCA.


Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.