Шпаргалка Python для аналитика
Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.
Зачем это знать
Python — second after SQL для аналитика. Шпаргалка — daily references.
pandas basics
Read
import pandas as pd
df = pd.read_csv('file.csv')
df = pd.read_excel('file.xlsx')
df = pd.read_parquet('file.parquet')
df = pd.read_sql('SELECT ...', conn)Write
df.to_csv('out.csv', index=False)
df.to_parquet('out.parquet')
df.to_excel('out.xlsx')Inspect
df.head(10)
df.shape # (rows, cols)
df.info() # types, nulls
df.describe() # stats
df.columns
df.dtypesSelect
Columns
df['col'] # Single column (Series)
df[['a', 'b']] # Multiple (DataFrame)
df.col # Same as df['col']Rows
df.iloc[0] # First row
df.iloc[0:5] # First 5 rows
df.loc[df['a'] > 10] # Boolean index
df.query('a > 10') # Query syntaxCombined
df.loc[df['a'] > 10, ['b', 'c']] # Filter rows + select cols
df.iloc[0:5, 0:3] # Position-basedFilter
df[df['age'] > 30]
df[df['country'].isin(['RU', 'US'])]
df[df['name'].str.contains('Ivan')]
df[df['email'].str.endswith('.com')]
df[df['date'] > '2026-01-01']Groupby / aggregate
df.groupby('country').size()
df.groupby('country')['revenue'].sum()
df.groupby('country').agg({'revenue': 'sum', 'orders': 'count'})
df.groupby(['country', 'product']).mean()Multiple aggs
df.groupby('country').agg({
'revenue': ['sum', 'mean', 'max'],
'orders': 'count'
})Join / merge
pd.merge(df1, df2, on='user_id')
pd.merge(df1, df2, on='user_id', how='left')
pd.merge(df1, df2, left_on='id', right_on='user_id')
# Concat
pd.concat([df1, df2], axis=0) # rows
pd.concat([df1, df2], axis=1) # colsPivot
df.pivot_table(
values='revenue',
index='country',
columns='product',
aggfunc='sum'
)Reshape
df.melt(id_vars=['date'], value_vars=['a', 'b']) # wide → longString operations
df['col'].str.upper()
df['col'].str.lower()
df['col'].str.contains('pattern')
df['col'].str.replace('a', 'b')
df['col'].str.split('-', expand=True)
df['col'].str.len()Dates
df['date'] = pd.to_datetime(df['date'])
df['date'].dt.year
df['date'].dt.month
df['date'].dt.day
df['date'].dt.weekday
df['date'].dt.hour
# Subtract
df['date2'] - df['date1'] # TimedeltaApply / transform
df['new'] = df['col'].apply(lambda x: x * 2)
df['new'] = df.apply(lambda row: row['a'] + row['b'], axis=1)
# Faster: vectorize
df['new'] = df['a'] + df['b']Handle missing
df.isnull().sum() # Count nulls
df.dropna() # Remove rows с nulls
df.dropna(subset=['a']) # Only if col a null
df.fillna(0) # Replace с 0
df['a'].fillna(df['a'].mean()) # Impute meanSort
df.sort_values('col')
df.sort_values('col', ascending=False)
df.sort_values(['a', 'b'], ascending=[True, False])Unique / duplicates
df['col'].unique()
df['col'].nunique()
df.duplicated() # Boolean
df.drop_duplicates()
df.drop_duplicates(subset=['a'])Statistics
df['col'].mean()
df['col'].median()
df['col'].std()
df['col'].var()
df['col'].min()
df['col'].max()
df['col'].quantile(0.95)
df['col'].count()
df.corr() # Correlation matrixValue counts
df['col'].value_counts()
df['col'].value_counts(normalize=True) # ProportionsConditional
df['new'] = df['col'].apply(lambda x: 'high' if x > 10 else 'low')
# Vectorized
df['new'] = np.where(df['col'] > 10, 'high', 'low')
# Multiple conditions
conditions = [df['col'] > 100, df['col'] > 10]
choices = ['high', 'medium']
df['new'] = np.select(conditions, choices, default='low')numpy basics
import numpy as np
np.array([1, 2, 3])
np.zeros(10)
np.ones((3, 3))
np.arange(10)
np.linspace(0, 1, 100)
# Ops
arr.sum(), arr.mean(), arr.std()
# Random
np.random.rand(10)
np.random.randn(10)
np.random.choice(arr, 5)matplotlib basics
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(df['x'], df['y'])
plt.title('Title')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()
# Subplots
fig, axes = plt.subplots(2, 2)
axes[0, 0].plot(...)seaborn
import seaborn as sns
sns.set_theme(style='whitegrid')
sns.lineplot(data=df, x='date', y='value')
sns.barplot(data=df, x='cat', y='count')
sns.heatmap(df.corr(), annot=True)
sns.boxplot(data=df, x='group', y='value')Regex
import re
re.search(r'\d+', text)
re.findall(r'\w+', text)
re.sub(r'\d+', 'N', text)
# In pandas
df['col'].str.extract(r'(\d+)')Statistics functions
from scipy import stats
stats.ttest_ind(a, b)
stats.chi2_contingency(table)
stats.pearsonr(x, y)
stats.norm.cdf(value, loc, scale)Useful scipy / sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifierCommon patterns
Time series rolling
df['ma7'] = df['value'].rolling(7).mean()
df['ema'] = df['value'].ewm(span=7).mean()Percentile binning
df['decile'] = pd.qcut(df['value'], 10, labels=False)
df['quartile'] = pd.qcut(df['value'], 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])Row-level operations
df['total'] = df[['a', 'b', 'c']].sum(axis=1)
df['max'] = df[['a', 'b', 'c']].max(axis=1)Shift / lag
df['prev'] = df['col'].shift(1)
df['diff'] = df['col'] - df['col'].shift(1)Rank
df['rank'] = df['col'].rank(ascending=False)
df['pct_rank'] = df['col'].rank(pct=True)Performance tips
- Vectorize (avoid loops)
- Use
.locне chained indexing - Categorical dtype для low-cardinality strings
- Downcast integers
- Read only needed columns
Useful imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')Связанные темы
FAQ
pandas 1 vs 2?
2.0+ — better perf, different defaults. Migrating.
Polars или pandas?
Polars faster, pandas standard.
Jupyter notebooks?
Useful для exploration. VS Code supports.
Тренируйте Python — откройте тренажёр с 1500+ вопросами для собесов.