Incident response на собеседовании системного аналитика
Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.
Содержание:
Severity
SEV-1. Total outage / data loss / security breach. All-hands.
SEV-2. Major impact, partial outage. Senior on-call.
SEV-3. Minor functionality affected. Workaround exists.
SEV-4. Cosmetic, low impact.
Priority response соответствует.
Roles
Incident Commander. Coordinates, не technical lead. Decisions.
Tech lead. Diagnostic, remediation work.
Communications lead. Updates stakeholders / customers.
Scribe. Logs decisions, timeline.
В small teams — roles combined. Big incident — separate.
Mitigation
Priority — restore service.
Steps:
- Acknowledge — confirm incident, page team.
- Triage — assess impact, severity.
- Mitigate — restore service (rollback, failover).
- Verify — confirm fixed.
- RCA позже.
Don't. Try root cause first. Fix mode.
RCA
Root Cause Analysis. What deeper cause led к incident.
5 Whys.
Service down. Why? — Out of memory.
Why? — Memory leak в new release.
Why? — Bug в caching code.
Why? — Missing test case.
Why? — Code review didn't catch.Fix root, не симптом.
Postmortem
Document incident, learnings.
Standard format:
- Timeline.
- Impact.
- Root cause.
- Resolution.
- Action items (preventions).
Blameless culture. Не personal blame. Focus systemic improvements.
Public postmortems (some companies) — build trust, share learnings.
Связанные темы
- SLA SLO SLI для SA
- Chaos engineering для SA
- Observability для SA
- Severity vs priority для SA
- Подготовка к собесу системного аналитика
FAQ
Это официальная информация?
Нет. Статья основана на Google SRE practices.
Тренируйте системный анализ — откройте тренажёр с 1500+ вопросами для собесов.