BrowseComp-V3: A Vertical, Verifiable, and Visual Benchmark for Multimodal Browsing Agents

Huanyao Zhang1◆‡, Jiepeng Zhou2◆, Bo Li1◆, Bowen Zhou1◆, Yanzhe Dan3◆, Haishan Lu1, Zhiyong Cao4, Jiaoyang Chen5, Yuqian Han1, Zinan Sheng1, Zhengwei Tao1, Hao Liang1, Jialong Wu1, Yang Shi1, Yuanpeng He1, Jiaye Lin6, Qintong Zhang1, Guochen Yan1, Runhao Zhao1, Zhengpin Li1, Xiaohan Yu7, Lang Mei7, Chong Chen7†, Wentao Zhang1†, Bin Cui1†
1Peking University   2HKUST(GZ)   3OUC   4CASIA   5HITSZ   6THU   7Huawei Cloud BU
◆ Core Contributor   ‡ Project Leader   † Corresponding author

Abstract

Multimodal large language models (MLLMs), leveraging their increasingly advancing autonomous planning and tool use capabilities, are evolving into intelligent agents capable of performing web browsing for multimodal deep search. However, existing benchmarks remain limited in terms of task complexity, information searchability, and evaluation dimensions, thereby hindering comprehensive assessments of multimodal browsing agents' deep search capabilities in open-world environments. To bridge these gaps, we present BrowseComp-V3, a novel benchmark comprising 300 meticulously hand-crafted, challenging questions across diverse domains. By emphasizing deep, multi-level, and cross-modal multi-hop reasoning, we ensure that these tasks necessitate the use of web browsing tools and cannot be resolved solely through the model's parametric knowledge. Moreover, we strictly enforce the public searchability of all supporting evidence and incorporate an expert-validated, subgoal-driven process evaluation mechanism, thereby enabling fine-grained characterization of search behaviors and systematic analysis of capability boundaries. Beyond the dataset, we provide OmniSeeker, a general multimodal browsing agent framework, and conduct a comprehensive evaluation on MLLMs. The results demonstrate that even state-of-the-art models, such as GPT-5.2, achieve only 36% accuracy. Further analysis reveals critical bottlenecks in existing models regarding multimodal information integration and fine-grained perception, highlighting a fundamental lack of native multimodal reasoning capabilities.

Leaderboard

Performance on BrowseComp-V³. Results are reported in terms of Success Rate (SR) and Process Score (PS) under the Pass@1 setting. Average denotes the overall average.

Model Success Rate (SR, %) Process Score (PS, %)
Average Science Technology Society Culture Life Average Science Technology Society Culture Life
Human
Browser 68.03 72.00 70.00 73.33 68.00 54.00 82.93 87.54 85.25 84.19 82.73 74.32
Tool-Augmented VLMs
GPT-5.2-Thinking 39.13 26.00 48.00 38.67 37.33 46.00 66.05 61.11 79.74 54.87 71.41 64.73
Gemini-3-Pro-Preview 22.90 18.00 16.00 21.33 24.00 34.00 62.43 62.02 73.78 48.70 66.33 62.50
Claude-Sonnet-4.5-Thinking 18.33 22.00 16.00 17.33 21.33 14.00 47.73 58.61 58.41 34.79 54.33 35.68
Tool-Free VLMs
GPT-5.2 6.00 0.00 14.00 4.00 5.33 8.00 25.02 17.50 44.50 19.22 25.25 21.43
o4-mini 7.33 0.00 16.00 2.67 6.67 14.00 29.08 29.48 46.12 17.91 29.04 28.44
GPT-4o 2.67 0.00 10.00 2.67 1.33 0.00 11.26 6.78 26.52 9.45 7.60 8.66
Gemini-3-Flash-Preview 12.00 8.00 18.00 12.00 10.67 12.00 40.76 38.98 61.94 31.89 39.72 36.60
Claude-Sonnet-4.5 4.00 4.00 6.00 2.67 5.33 2.00 25.74 33.06 47.56 15.97 24.53 13.06
Doubao-Seed-1.8 9.00 8.00 16.00 1.33 13.33 8.00 34.74 36.26 51.00 22.28 39.87 27.96
MiMo-V2-Flash 3.00 2.00 4.00 2.67 4.00 2.00 8.12 4.52 17.28 5.43 9.40 4.70
Qwen3-VL-235B-A22B-Instruct 3.33 4.00 4.00 4.00 2.67 2.00 20.52 26.34 35.38 11.69 19.64 14.39
Qwen3-VL-8B-Instruct 1.00 0.00 2.00 1.33 0.00 2.00 6.64 3.28 15.36 6.39 4.09 5.48
MM-Browse Agent
GPT-5.2 36.00 50.00 28.00 33.33 33.33 38.00 57.70 67.23 55.40 49.34 60.49 58.81
o4-mini 26.00 22.00 24.00 25.33 34.67 20.00 44.66 43.70 52.11 40.73 48.72 37.94
GPT-4o 11.41 14.00 14.00 10.67 14.67 2.00 24.15 22.32 43.36 17.35 25.93 13.04
Gemini-3-Flash-Preview 23.67 32.00 24.00 18.67 25.33 20.00 47.37 50.84 68.35 40.15 43.68 39.27
Claude-Sonnet-4.5 22.67 32.00 20.00 21.33 24.00 16.00 54.17 60.94 64.25 45.29 56.73 46.78
Doubao-Seed-1.8 33.67 42.00 28.00 37.33 38.67 18.00 58.44 52.94 69.70 57.46 66.27 42.42
MiMo-V2-Flash 16.67 18.00 12.00 10.67 29.33 10.00 31.33 34.84 35.04 24.45 42.43 17.76
Qwen3-VL-235B-A22B-Instruct 14.33 16.00 14.00 14.67 17.33 8.00 26.68 28.56 35.93 22.00 28.36 20.05
Qwen3-VL-8B-Instruct 5.33 2.00 2.00 9.33 8.00 2.00 13.40 8.02 19.30 15.57 15.55 6.38

The BrowseComp-V³ Dataset

Overview

We introduce the construction pipeline and statistics of the BrowseComp-V³ benchmark below.

Data Construction Pipeline

Data Construction Pipeline

Dataset Statistics

Domain statistics
Dataset statistics

Experiment Analysis

Fine-grained Analysis

Hop analysis
Ability heatmap

Test Time Scaling

Max turn scaling
Pass@k scaling

Failure Mode Analysis

Failure mode analysis

BibTeX

@article{zhang2026browsecompv3,
  title   = {BrowseComp-V³: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents},
  author  = {Huanyao Zhang and Jiepeng Zhou and Bo Li and Bowen Zhou and Yanzhe Dan and Haishan Lu and Zhiyong Cao and Jiaoyang Chen and Yuqian Han and Zinan Sheng and Zhengwei Tao and Hao Liang and Jialong Wu and Yang Shi and Yuanpeng He and Jiaye Lin and Qintong Zhang and Guochen Yan and Runhao Zhao and Zhengpin Li and Xiaohan Yu and Lang Mei and Chong Chen and Wentao Zhang and Bin Cui},
  journal = {arXiv preprint arXiv:2602.12876},
  year    = {2026}
}