Abstract
Multimodal large language models (MLLMs), leveraging their increasingly advancing autonomous planning and tool use capabilities, are evolving into intelligent agents capable of performing web browsing for multimodal deep search. However, existing benchmarks remain limited in terms of task complexity, information searchability, and evaluation dimensions, thereby hindering comprehensive assessments of multimodal browsing agents' deep search capabilities in open-world environments. To bridge these gaps, we present BrowseComp-V3, a novel benchmark comprising 300 meticulously hand-crafted, challenging questions across diverse domains. By emphasizing deep, multi-level, and cross-modal multi-hop reasoning, we ensure that these tasks necessitate the use of web browsing tools and cannot be resolved solely through the model's parametric knowledge. Moreover, we strictly enforce the public searchability of all supporting evidence and incorporate an expert-validated, subgoal-driven process evaluation mechanism, thereby enabling fine-grained characterization of search behaviors and systematic analysis of capability boundaries. Beyond the dataset, we provide OmniSeeker, a general multimodal browsing agent framework, and conduct a comprehensive evaluation on MLLMs. The results demonstrate that even state-of-the-art models, such as GPT-5.2, achieve only 36% accuracy. Further analysis reveals critical bottlenecks in existing models regarding multimodal information integration and fine-grained perception, highlighting a fundamental lack of native multimodal reasoning capabilities.
Leaderboard
Performance on BrowseComp-V³. Results are reported in terms of Success Rate (SR) and Process Score (PS) under the Pass@1 setting. Average denotes the overall average.
| Model | Success Rate (SR, %) | Process Score (PS, %) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Average | Science | Technology | Society | Culture | Life | Average | Science | Technology | Society | Culture | Life | |
| Human | ||||||||||||
| Browser | 68.03 | 72.00 | 70.00 | 73.33 | 68.00 | 54.00 | 82.93 | 87.54 | 85.25 | 84.19 | 82.73 | 74.32 |
| Tool-Augmented VLMs | ||||||||||||
| GPT-5.2-Thinking | 39.13 | 26.00 | 48.00 | 38.67 | 37.33 | 46.00 | 66.05 | 61.11 | 79.74 | 54.87 | 71.41 | 64.73 |
| Gemini-3-Pro-Preview | 22.90 | 18.00 | 16.00 | 21.33 | 24.00 | 34.00 | 62.43 | 62.02 | 73.78 | 48.70 | 66.33 | 62.50 |
| Claude-Sonnet-4.5-Thinking | 18.33 | 22.00 | 16.00 | 17.33 | 21.33 | 14.00 | 47.73 | 58.61 | 58.41 | 34.79 | 54.33 | 35.68 |
| Tool-Free VLMs | ||||||||||||
| GPT-5.2 | 6.00 | 0.00 | 14.00 | 4.00 | 5.33 | 8.00 | 25.02 | 17.50 | 44.50 | 19.22 | 25.25 | 21.43 |
| o4-mini | 7.33 | 0.00 | 16.00 | 2.67 | 6.67 | 14.00 | 29.08 | 29.48 | 46.12 | 17.91 | 29.04 | 28.44 |
| GPT-4o | 2.67 | 0.00 | 10.00 | 2.67 | 1.33 | 0.00 | 11.26 | 6.78 | 26.52 | 9.45 | 7.60 | 8.66 |
| Gemini-3-Flash-Preview | 12.00 | 8.00 | 18.00 | 12.00 | 10.67 | 12.00 | 40.76 | 38.98 | 61.94 | 31.89 | 39.72 | 36.60 |
| Claude-Sonnet-4.5 | 4.00 | 4.00 | 6.00 | 2.67 | 5.33 | 2.00 | 25.74 | 33.06 | 47.56 | 15.97 | 24.53 | 13.06 |
| Doubao-Seed-1.8 | 9.00 | 8.00 | 16.00 | 1.33 | 13.33 | 8.00 | 34.74 | 36.26 | 51.00 | 22.28 | 39.87 | 27.96 |
| MiMo-V2-Flash | 3.00 | 2.00 | 4.00 | 2.67 | 4.00 | 2.00 | 8.12 | 4.52 | 17.28 | 5.43 | 9.40 | 4.70 |
| Qwen3-VL-235B-A22B-Instruct | 3.33 | 4.00 | 4.00 | 4.00 | 2.67 | 2.00 | 20.52 | 26.34 | 35.38 | 11.69 | 19.64 | 14.39 |
| Qwen3-VL-8B-Instruct | 1.00 | 0.00 | 2.00 | 1.33 | 0.00 | 2.00 | 6.64 | 3.28 | 15.36 | 6.39 | 4.09 | 5.48 |
| MM-Browse Agent | ||||||||||||
| GPT-5.2 | 36.00 | 50.00 | 28.00 | 33.33 | 33.33 | 38.00 | 57.70 | 67.23 | 55.40 | 49.34 | 60.49 | 58.81 |
| o4-mini | 26.00 | 22.00 | 24.00 | 25.33 | 34.67 | 20.00 | 44.66 | 43.70 | 52.11 | 40.73 | 48.72 | 37.94 |
| GPT-4o | 11.41 | 14.00 | 14.00 | 10.67 | 14.67 | 2.00 | 24.15 | 22.32 | 43.36 | 17.35 | 25.93 | 13.04 |
| Gemini-3-Flash-Preview | 23.67 | 32.00 | 24.00 | 18.67 | 25.33 | 20.00 | 47.37 | 50.84 | 68.35 | 40.15 | 43.68 | 39.27 |
| Claude-Sonnet-4.5 | 22.67 | 32.00 | 20.00 | 21.33 | 24.00 | 16.00 | 54.17 | 60.94 | 64.25 | 45.29 | 56.73 | 46.78 |
| Doubao-Seed-1.8 | 33.67 | 42.00 | 28.00 | 37.33 | 38.67 | 18.00 | 58.44 | 52.94 | 69.70 | 57.46 | 66.27 | 42.42 |
| MiMo-V2-Flash | 16.67 | 18.00 | 12.00 | 10.67 | 29.33 | 10.00 | 31.33 | 34.84 | 35.04 | 24.45 | 42.43 | 17.76 |
| Qwen3-VL-235B-A22B-Instruct | 14.33 | 16.00 | 14.00 | 14.67 | 17.33 | 8.00 | 26.68 | 28.56 | 35.93 | 22.00 | 28.36 | 20.05 |
| Qwen3-VL-8B-Instruct | 5.33 | 2.00 | 2.00 | 9.33 | 8.00 | 2.00 | 13.40 | 8.02 | 19.30 | 15.57 | 15.55 | 6.38 |
The BrowseComp-V³ Dataset
Overview
We introduce the construction pipeline and statistics of the BrowseComp-V³ benchmark below.
Data Construction Pipeline
Dataset Statistics
Experiment Analysis
Fine-grained Analysis
Test Time Scaling
Failure Mode Analysis
BibTeX
@article{zhang2026browsecompv3,
title = {BrowseComp-V³: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents},
author = {Huanyao Zhang and Jiepeng Zhou and Bo Li and Bowen Zhou and Yanzhe Dan and Haishan Lu and Zhiyong Cao and Jiaoyang Chen and Yuqian Han and Zinan Sheng and Zhengwei Tao and Hao Liang and Jialong Wu and Yang Shi and Yuanpeng He and Jiaye Lin and Qintong Zhang and Guochen Yan and Runhao Zhao and Zhengpin Li and Xiaohan Yu and Lang Mei and Chong Chen and Wentao Zhang and Bin Cui},
journal = {arXiv preprint arXiv:2602.12876},
year = {2026}
}