qwen3-vl-235b-a22b-instruct

by qwen

Qwen3-VL-235B-A22B Instruct is an open-weight multimodal model that combines strong text generation with advanced visual understanding for images and video. Designed for general vision-language tasks like VQA, document parsing, chart and table extraction, and multilingual OCR, the model emphasizes robust perception, spatial (2D/3D) understanding, and long-form visual comprehension, with competitive results on public benchmarks. Qwen3-VL also supports agentic interaction and tool use, following complex instructions in multi-image dialogues, aligning text to video timelines, operating GUIs for automation, and enabling visual coding workflows such as turning sketches into code or debugging UIs. Its strong text-only capabilities match Qwen3 language models, making it suitable for document AI, OCR, UI/software assistance, spatial reasoning, and vision-language agent research.

Pricing

Pay-as-you-go rates for this model. More details can be found here.

Input Tokens (1M)

$0.35

Output Tokens (1M)

$1.40

Capabilities

Input Modalities

Text
Image

Output Modalities

Text

Rate Limits

Requests per minute (RPM) and per day (RPD) by tier. More about tiers here

TierRPMRPD
Free
Tier 110
Tier 215
Tier 325
Tier 450

Usage Analytics

Token usage across the last 30 active days

qwen3-vl-235b-a22b-instruct — Model | NagaAI