AI-метрики и evaluation на собесе AI PM

Q: LLM-as-judge — какую модель использовать?

Stronger than the one being evaluated. GPT-4o для evaluations GPT-3.5 outputs.

Q: Сколько в eval dataset?

100 для quick iteration, 1000 для production launch decision.

Q: Как detect novelty effect?

Compare cohorts: early adopters vs later users. Если early > later — novelty.

Зачем AI-метрики на собесе AI PM

LLM-фичи сложно оценивать: output текстовый, нет однозначного «правильно/неправильно». На собесе AI PM типовой вопрос — «как измеришь quality своей AI-фичи?». Без чёткого фреймворка ответ — слабая позиция.

Слабый ответ — «спрошу пользователей». Сильный — про evaluation pipelines, LLM-as-judge, business metrics, hallucination rate, A/B-тесты с power calc.

Типы метрик для AI

1. Output quality metrics:

Faithfulness (output grounded в context)
Relevance (output отвечает на вопрос)
Coherence (output sense?)
Style consistency
Hallucination rate

2. User experience metrics:

Thumbs up / down rate
Edit rate (user перепишет output)
Regeneration rate
Time to satisfaction
NPS / CSAT для AI-фич

3. Engagement metrics:

Adoption rate (% активных users used feature)
Repeat usage
Session duration after AI-feature

4. Business metrics:

Revenue impact (paywall conversion / retention uplift)
Cost per request × usage
Support deflection rate

Evaluation pipeline

Offline eval (pre-deployment):

Build eval dataset: 100-1000 examples с expected output.
Run pipeline: для каждого input — generate AI output.
Score: automatic + manual.
Aggregate: stats per metric.
Compare: new model / prompt vs baseline.

Online eval (post-deployment):

Sample sessions для review
Implicit signals (edits, regenerations)
Explicit feedback (thumbs)
A/B-тесты business metrics

LLM-as-judge

LLM evaluates другой LLM. Scaleable.

Pattern:

[System] Ты — судья качества ответов AI-ассистента. 
Оцени output по criteria: faithfulness, relevance.

[Input] Question: <q>
Output: <ai_output>
Expected: <gold_answer>

Score 1-5 per criterion + reason.

Plus:

Scales: 1000 examples за час
Consistent (если prompt stable)

Минусы:

Bias judge model
Disagreement с human
Cost

Best practice: judge + human spot-check (10-20% sample).

Hallucination measurement

Methods:

LLM-as-judge: «output grounded в context?»
NLI (Natural Language Inference): output entails context?
Citation accuracy: cited sources actually support claim?

Tools: Ragas, TruLens, custom.

Targets:

< 5% для critical (medical, legal)
< 10% для general
100% impossible

A/B-тесты AI-фич

Specific challenges:

High variance в outputs (даже same prompt)
Long-tail behavior (rare inputs)
Novelty effect (users excited первое время)

Setup:

50/50 traffic (или 5/95 canary)
Metric: business outcome, не output quality
Sample size: usually больше чем для классики UI (high variance)
Duration: 1-4 недели

Powerful guards:

Cost cap per arm
Quality threshold rollback

Подробнее — A/B-эксперименты.

Tools для AI evaluation

Open-source:

Ragas — для RAG evaluation
TruLens — observability + evals
LangSmith — managed, от LangChain
DeepEval — pytest для LLM

Managed:

OpenAI Evals
Weights & Biases
Patronus AI

Human evaluation

Без human для critical metrics не обойтись.

Setup:

50-200 examples
2-3 annotators per example (cross-rater agreement)
Clear rubric (faithfulness 1-5 with definitions)
Iterate rubric → re-annotate

Cost: $0.1-1 per annotation. Total $20-500 per eval cycle.

Платформы: Scale AI, Toloka (РФ), Surge AI.

Типичные вопросы

«Как измеришь quality своей AI-фичи?»

Multi-layer:

Offline eval pipeline (gold dataset + LLM-judge).
Online metrics (thumbs, edit rate, regen rate).
Business metrics (conversion / retention / cost).
Human spot-checks (10% sample).

«Hallucination в ответе AI — как detect?»

Multi-method: LLM-judge faithfulness, NLI entailment, citation accuracy. Combine для robust signal.

«AI A/B-тест дольше, чем обычный — почему?»

High output variance + novelty effect. Need больше sample + longer duration для confidence.

«User лайкнул output. Это хороший сигнал?»

Implicit + biased (rating positive easier). Дополни: edit rate (lower = better), regen rate (lower = better), business outcome (eventual purchase / retention).

Частые ошибки

Только output quality, no business metrics. AI красивый, бизнес не растёт.
Без offline eval pipeline. Каждый prompt change — gamble.
LLM-as-judge без human spot-check. Bias unchecked.
A/B без power calculations. Inconclusive results.
Игнор novelty effect. Initial uplift fade — нужен longer test.

FAQ

LLM-as-judge — какую модель использовать?

Stronger than the one being evaluated. GPT-4o для evaluations GPT-3.5 outputs.

Сколько в eval dataset?