Qwen3 VL 235B A22B Instruct

qwen3-vl-235b-a22b-instruct
by qwen|Created Sep 24, 2025

Qwen3-VL-235B-A22B Instruct is an open-weight multimodal model that combines strong text generation with advanced visual understanding for images and video. Designed for general vision-language tasks like VQA, document parsing, chart and table extraction, and multilingual OCR, the model emphasizes robust perception, spatial (2D/3D) understanding, and long-form visual comprehension, with competitive results on public benchmarks. Qwen3-VL also supports agentic interaction and tool use, following complex instructions in multi-image dialogues, aligning text to video timelines, operating GUIs for automation, and enabling visual coding workflows such as turning sketches into code or debugging UIs. Its strong text-only capabilities match Qwen3 language models, making it suitable for document AI, OCR, UI/software assistance, spatial reasoning, and vision-language agent research.

Pricing

Pay-as-you-go rates for this model. More details can be found here.

Input Tokens (1M)

$0.35

Output Tokens (1M)

$1.40

Capabilities

Input Modalities

Text
Image

Output Modalities

Text

Usage Analytics

Token usage across the last 30 active days

Throughput

Time-To-First-Token (TTFT)