Multimodal LLM на собеседовании Data Scientist
Карьерник — Duolingo для аналитиков: 10 минут в день тренируй SQL, Python, A/B, статистику, метрики и ещё 3 темы собеса. 1500+ вопросов в Telegram-боте. Бесплатно.
Что такое multimodal LLM
LLM accepts multiple modalities — text + image + audio + video.
User: [image of receipt] "What's the total?"
Model: "$45.67"Архитектура
Common pattern.
Image → vision encoder (ViT) → image tokens.
Text → text tokens.
Combined → LLM decoder → output text.Vision encoder usually pre-trained CLIP-style.
Cross-attention. LLM attends к image tokens.
LLaVA. ViT + projection layer + LLaMA.
GPT-4V, Claude 3 Sonnet/Opus. Closed-source proprietary.
Models
GPT-4V / GPT-4o. OpenAI. Strong general purpose.
Claude 3.5 Sonnet / Opus. Anthropic. Strong reasoning.
Gemini. Google. Native multimodal.
LLaVA. Open source. Various sizes.
Qwen-VL. Alibaba. Strong multilingual.
Pixtral. Mistral. Open multimodal.
InternVL. Strong open research.
В РФ: GigaChat имеет multimodal capabilities.
Применения
OCR-like extraction. Receipts, invoices, IDs.
Visual QA. «What's wrong with this code screenshot?»
Image captioning / описание. For accessibility.
Document understanding. Tables, charts, mixed layouts.
Diagram interpretation. Architecture diagrams, flowcharts.
Code from screenshot. Take UI mockup → produce HTML/CSS.
Video understanding (limited). Frame-by-frame analysis.
Limitations
Hallucinations. Worse чем text-only — describes things not present.
Spatial reasoning. «Object on the left» — sometimes confused.
Counting. «How many cars?» — often wrong.
Fine details. Small text, complex tables — missed.
Cost. Image tokens expensive (1 image = thousands tokens).
Связанные темы
- BERT vs GPT для DS
- CLIP multimodal для DS
- Generative AI applications для DS
- Hallucinations и LLM evals для DS
- Подготовка к собесу Data Scientist
FAQ
Это официальная информация?
Нет. Статья основана на докуменации OpenAI / Anthropic / Google и open source models.
Тренируйте Data Science — откройте тренажёр с 1500+ вопросами для собесов.