PCA для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
100 features → можно ли свести до 5-10, которые captured most variance? Это PCA. Used для visualization (high-dim → 2D), speed up ML, remove noise. На собесах ML для analyst PCA упоминается.
Что такое PCA
Principal Component Analysis — dimensionality reduction.
Берёт correlated features → создаёт uncorrelated «principal components», ranked by variance explained.
Интуиция
Представьте 2D data с strong diagonal pattern.
X и Y correlated. Можете найти direction along которого variance maximum (diagonal). Second direction — perpendicular, меньше variance.
Projection на first direction — captures most info. Это PCA.
Математика (упрощённо)
- Center data: subtract mean
- Compute covariance matrix
- Eigendecomposition → eigenvectors (directions) + eigenvalues (variance)
- Rank eigenvectors by eigenvalues
- Project data на top-K eigenvectors
Result: K principal components.
В Python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Scale (важно!)
X_scaled = StandardScaler().fit_transform(X)
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Variance explained
print(pca.explained_variance_ratio_)
# [0.65, 0.22] — PC1 explains 65%, PC2 22%Feature scaling critical
Без scaling — features с larger variance dominate. Always standardize.
Выбор числа компонент
Explained variance
Cumulative variance vs number components:
cumulative = np.cumsum(pca.explained_variance_ratio_)
plt.plot(cumulative)
# Choose K where curve flattens (e.g., 90% variance)Rule of thumb: 80-95% variance.
Scree plot
Eigenvalues vs index. Find «elbow».
Kaiser rule
Keep components с eigenvalue > 1.
Use cases
1. Visualization
High-dim data → 2D / 3D для exploratory analysis.
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)2. Speed up ML
100 features → 10 PCs → faster training.
3. Remove multicollinearity
PCs uncorrelated by construction. Fixes multicollinearity в regression.
4. Noise reduction
Last components often noise. Drop them → cleaner signal.
5. Image / signal compression
Reduce dimensionality → smaller storage.
Interpretation
Components = linear combination original features.
# Loadings
pd.DataFrame(pca.components_, columns=X.columns)Можно see: «PC1 is mostly about age и income».
Проблемы
Loss of interpretability
Original features имеют meaning. PCs — abstract combinations.
Linear only
Captures linear relationships. Non-linear structure → Kernel PCA или UMAP / t-SNE.
Sensitive к scaling
Must scale features.
Not for categorical
Designed для continuous. Categorical — MCA (correspondence analysis).
Альтернативы
t-SNE
Non-linear dimensionality reduction. Good для visualization.
UMAP
Modern, faster than t-SNE. Preserves local и global structure.
Autoencoders
Neural-network based. Flexible.
Feature selection
Keep subset original features (interpretable).
Kernel PCA
Non-linear version:
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf')Good для non-linear separation.
Sparse PCA
PCs — sparse (most loadings = 0). More interpretable.
from sklearn.decomposition import SparsePCAIncremental PCA
Для datasets too big для memory:
from sklearn.decomposition import IncrementalPCA
pca = IncrementalPCA(n_components=5, batch_size=1000)Проверка
Reconstruction error
X_reconstructed = pca.inverse_transform(X_pca)
error = np.mean((X_scaled - X_reconstructed) ** 2)Small → good approximation.
Preserving structure
Plot original X colored by target. Then plot X_pca. If clusters preserved — good.
На собесе
«Что делает PCA?» Dimensionality reduction, finding directions max variance.
«Preprocessing?» Standardize features.
«Как выбрать K?» Cumulative variance (80-95%), scree plot, business need.
«Альтернативы?» t-SNE, UMAP — non-linear. Feature selection — interpretable.
Частые ошибки
No scaling
Largest variance features dominate.
Overuse
Not every problem needs PCA. Interpretability often matters more.
Interpreting PC как feature
PCs are combinations. Don't interpret как semantic features.
Apply после train/test split
Apply только на train, transform test. Иначе data leakage.
Связанные темы
FAQ
PCA или feature selection?
Selection — keep original features, interpretable. PCA — combine, less interpretable.
Для A/B testing?
Не common use case. PCA — ML preprocessing.
Categorical data?
Convert к dummy variables first (risky) или use MCA.
Тренируйте ML — откройте тренажёр с 1500+ вопросами для собесов.