AI cost и latency на собесе AI PM

Зачем cost и latency на собесе AI PM

LLM-фичи стоят денег per token и tendency не быстрые. AI PM должен спроектировать продукт, который не разорит компанию на масштабе и UX, который выдерживает пользователь. На собесе AI PM cost и latency спрашивают через unit economics и UX challenges.

Слабый ответ — «оптимизируем потом». Сильный — про per-request cost модель, прогноз на масштаб, UX patterns для latency, специфические optimization (caching, smaller models, streaming).

Per-token pricing

LLM-провайдеры считают по токенам (input + output).

Примеры цен (2026, ориентир):

GPT-4o: $5/M input, $15/M output
GPT-4o-mini: $0.15/M input, $0.60/M output
Claude Sonnet 4: $3/M input, $15/M output
Claude Haiku: $0.80/M input, $4/M output
GigaChat / YandexGPT: ~ ₽-prices, проверь актуальные

Калькуляция per-request:

input: 500 tokens system + 200 user = 700
output: 300 tokens
Total per req: input cost + output cost

На масштабе:

1M requests / month × $0.005 / req = $5,000 / month
100M req / month × $0.005 = $500,000

Cost optimization

1. Smaller models:

GPT-4o-mini vs GPT-4o — 10-30x cheaper
Если quality OK — переключайся
Cascade: cheap model first, escalate to expensive если confidence low

2. Caching:

Identical requests → cache hit (instant + free)
Semantic caching: similar queries → cache (через embedding match)
30-70% cache hit rate возможно для FAQ-style

3. Prompt compression:

Trim system prompt
Few-shot only когда нужны
Compress context (summarization)

4. Output limits:

max_tokens cap
Stop sequences

5. Batch processing:

Когда latency не critical — batch API скидка (50% OpenAI)
For overnight jobs

6. Self-hosted:

Open-source LLM (Llama, Mistral, Qwen) на own GPU
Fixed cost вместо variable
Break-even при > 1M req / month обычно

Latency budget

Components:

Network: 50-200ms
Embedding (если RAG): 50-200ms
Vector search: 10-50ms
LLM generation: 1-10s (зависит от output длины)
Total: 2-15s

User patience:

Synchronous response: < 3s ideal, < 10s max
Streaming: моментально start, full result в 10-30s OK
Async (notifications): seconds-minutes OK

Latency optimization

1. Streaming:

Token-by-token output
User видит progress сразу
Лучше perception even если total time same
Стандарт в 2026 для chat UX

2. Smaller models:

Mini / haiku модели — 2-5x faster
Trade-off с quality

3. Speculative decoding:

Small draft model generates → big model validates
2-3x speedup

4. Prompt caching:

Anthropic prompt caching: system prompt cached → faster + cheaper
OpenAI cached input: 50% discount

5. Parallel calls:

Если есть multi-step → parallelize

6. Edge inference:

Smaller models локально (browser / mobile)
No network latency

UX patterns для slow AI

Progress indicators:

Spinner недостаточно
«Анализирую...», «Пишу первый абзац...» — лучше
Token streaming — best

Optimistic UI:

Show prediction immediately (cached / template)
AI refines в background

Async patterns:

«Отправь себе результат по email»
Push notification when ready

Cancellation:

User должен иметь возможность cancel slow generation

Unit economics LLM-фичи

Cost-per-user / month:

Avg N requests × cost per request
Часто $0.5-5 per active user / month

Revenue uplift:

Paywall conversion +X%
Retention +Y%
ARPU +$Z

Break-even:

LTV uplift > AI cost increase
LTV / CAC сохраняется > 3

Sensitivity:

Что если usage 2x? cost 2x → ROI changes?
Что если price LLM 2x от провайдера?

Типичные вопросы

«AI-фича стоит $X / month / user. Стоит запускать?»

Compare с revenue uplift / retention uplift. LTV change > cost increase? Sensitivity на usage growth. Опасные scenarios: viral usage спайк.

«User waits 8s for AI response. Acceptable?»

Depends. For complex generation (legal contract) — fine. For chat — нет. Stream output → perceived faster. If still slow → cascade / cache / smaller model.

«AI-фича scales: 1k → 100k users. Что break?»

Cost (linear scale). Rate limits провайдера. Latency under load. Cost monitoring + alerts.

«Switch from GPT-4 to mini-model — что rules?»

Eval pipeline: quality dropped > X%? Если acceptable → switch. Cost savings 10-30x.

Частые ошибки

Без cost model. Запустили → bill $50k через месяц.
Без caching. FAQ-style queries hit LLM каждый раз.
No streaming. User waits 10s before any output.
One-size-fits-all model. Используем GPT-4 для всего, тогда как mini хватило бы.
No rate limits. User can DOS your AI cost.

FAQ

Cost-per-user expectations?

$0.5-5 / month per active user — типично для AI-features в SaaS. Less для passive (smart search), больше для heavy (AI-coding tools).

Какие монитор cost метрики?

Cost per request (P50, P99), cost per active user, total daily / monthly spend, cache hit rate.

Self-hosted LLM — окупается когда?

1M requests / month (зависит от model). Plus infra ops + maintenance. Часто после break-even cost-wise, ops cost ещё больше.

Latency p99 vs avg?

P99 важнее для UX. AVG может скрывать tail latency.

Anthropic prompt caching vs OpenAI cached input?

Оба reduce cost для повторяющихся system prompts. Implement: Anthropic — explicit cache_control, OpenAI — automatic.

Смотрите также

Тренироваться в Telegram