Шпаргалка Python для аналитика

Карьерник — квиз-тренажёр в Telegram с 1500+ вопросами для собесов аналитика. SQL, Python, A/B, метрики. Бесплатно.

Зачем это знать

Python — second after SQL для аналитика. Шпаргалка — daily references.

pandas basics

Read

import pandas as pd

df = pd.read_csv('file.csv')
df = pd.read_excel('file.xlsx')
df = pd.read_parquet('file.parquet')
df = pd.read_sql('SELECT ...', conn)

Write

df.to_csv('out.csv', index=False)
df.to_parquet('out.parquet')
df.to_excel('out.xlsx')

Inspect

df.head(10)
df.shape          # (rows, cols)
df.info()         # types, nulls
df.describe()     # stats
df.columns
df.dtypes

Select

Columns

df['col']              # Single column (Series)
df[['a', 'b']]         # Multiple (DataFrame)
df.col                 # Same as df['col']

Rows

df.iloc[0]             # First row
df.iloc[0:5]           # First 5 rows
df.loc[df['a'] > 10]   # Boolean index
df.query('a > 10')     # Query syntax

Combined

df.loc[df['a'] > 10, ['b', 'c']]  # Filter rows + select cols
df.iloc[0:5, 0:3]                  # Position-based

Filter

df[df['age'] > 30]
df[df['country'].isin(['RU', 'US'])]
df[df['name'].str.contains('Ivan')]
df[df['email'].str.endswith('.com')]
df[df['date'] > '2026-01-01']

Groupby / aggregate

df.groupby('country').size()
df.groupby('country')['revenue'].sum()
df.groupby('country').agg({'revenue': 'sum', 'orders': 'count'})
df.groupby(['country', 'product']).mean()

Multiple aggs

df.groupby('country').agg({
    'revenue': ['sum', 'mean', 'max'],
    'orders': 'count'
})

Join / merge

pd.merge(df1, df2, on='user_id')
pd.merge(df1, df2, on='user_id', how='left')
pd.merge(df1, df2, left_on='id', right_on='user_id')

# Concat
pd.concat([df1, df2], axis=0)  # rows
pd.concat([df1, df2], axis=1)  # cols

Pivot

df.pivot_table(
    values='revenue',
    index='country',
    columns='product',
    aggfunc='sum'
)

Reshape

df.melt(id_vars=['date'], value_vars=['a', 'b'])  # wide → long

String operations

df['col'].str.upper()
df['col'].str.lower()
df['col'].str.contains('pattern')
df['col'].str.replace('a', 'b')
df['col'].str.split('-', expand=True)
df['col'].str.len()

Dates

df['date'] = pd.to_datetime(df['date'])

df['date'].dt.year
df['date'].dt.month
df['date'].dt.day
df['date'].dt.weekday
df['date'].dt.hour

# Subtract
df['date2'] - df['date1']  # Timedelta

Apply / transform

df['new'] = df['col'].apply(lambda x: x * 2)
df['new'] = df.apply(lambda row: row['a'] + row['b'], axis=1)

# Faster: vectorize
df['new'] = df['a'] + df['b']

Handle missing

df.isnull().sum()       # Count nulls
df.dropna()             # Remove rows с nulls
df.dropna(subset=['a']) # Only if col a null
df.fillna(0)            # Replace с 0
df['a'].fillna(df['a'].mean())  # Impute mean

Sort

df.sort_values('col')
df.sort_values('col', ascending=False)
df.sort_values(['a', 'b'], ascending=[True, False])

Unique / duplicates

df['col'].unique()
df['col'].nunique()
df.duplicated()          # Boolean
df.drop_duplicates()
df.drop_duplicates(subset=['a'])

Statistics

df['col'].mean()
df['col'].median()
df['col'].std()
df['col'].var()
df['col'].min()
df['col'].max()
df['col'].quantile(0.95)
df['col'].count()
df.corr()                # Correlation matrix

Value counts

df['col'].value_counts()
df['col'].value_counts(normalize=True)  # Proportions

Conditional

df['new'] = df['col'].apply(lambda x: 'high' if x > 10 else 'low')

# Vectorized
df['new'] = np.where(df['col'] > 10, 'high', 'low')

# Multiple conditions
conditions = [df['col'] > 100, df['col'] > 10]
choices = ['high', 'medium']
df['new'] = np.select(conditions, choices, default='low')

numpy basics

import numpy as np

np.array([1, 2, 3])
np.zeros(10)
np.ones((3, 3))
np.arange(10)
np.linspace(0, 1, 100)

# Ops
arr.sum(), arr.mean(), arr.std()

# Random
np.random.rand(10)
np.random.randn(10)
np.random.choice(arr, 5)

matplotlib basics

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(df['x'], df['y'])
plt.title('Title')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

# Subplots
fig, axes = plt.subplots(2, 2)
axes[0, 0].plot(...)

seaborn

import seaborn as sns

sns.set_theme(style='whitegrid')
sns.lineplot(data=df, x='date', y='value')
sns.barplot(data=df, x='cat', y='count')
sns.heatmap(df.corr(), annot=True)
sns.boxplot(data=df, x='group', y='value')

Regex

import re

re.search(r'\d+', text)
re.findall(r'\w+', text)
re.sub(r'\d+', 'N', text)

# In pandas
df['col'].str.extract(r'(\d+)')

Statistics functions

from scipy import stats

stats.ttest_ind(a, b)
stats.chi2_contingency(table)
stats.pearsonr(x, y)
stats.norm.cdf(value, loc, scale)

Useful scipy / sklearn

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

Common patterns

Time series rolling

df['ma7'] = df['value'].rolling(7).mean()
df['ema'] = df['value'].ewm(span=7).mean()

Percentile binning

df['decile'] = pd.qcut(df['value'], 10, labels=False)
df['quartile'] = pd.qcut(df['value'], 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

Row-level operations

df['total'] = df[['a', 'b', 'c']].sum(axis=1)
df['max'] = df[['a', 'b', 'c']].max(axis=1)

Shift / lag

df['prev'] = df['col'].shift(1)
df['diff'] = df['col'] - df['col'].shift(1)

Rank

df['rank'] = df['col'].rank(ascending=False)
df['pct_rank'] = df['col'].rank(pct=True)

Performance tips

  • Vectorize (avoid loops)
  • Use .loc не chained indexing
  • Categorical dtype для low-cardinality strings
  • Downcast integers
  • Read only needed columns

Useful imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

Связанные темы

FAQ

pandas 1 vs 2?

2.0+ — better perf, different defaults. Migrating.

Polars или pandas?

Polars faster, pandas standard.

Jupyter notebooks?

Useful для exploration. VS Code supports.


Тренируйте Python — откройте тренажёр с 1500+ вопросами для собесов.