MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

54.63%

Avg. accuracy across 6 benchmarks

21K

Multi-hop training examples

> GPT-5

> Gemini-3-Pro

Outperforms commercial LLMs

Abstract

Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63% across six challenging benchmarks, outperforming GPT-5 (51.86%), Gemini-2.5-Pro (50.98%), and Gemini-3-Pro (54.46%) under the same tool settings.

Method Overview

Tool-Augmented QA Agent

A ReAct-style agent with 4 tools (web search, web reader, Google Lens, image search) iteratively gathers evidence for multi-hop QA generation.

Multi-Stage Verification

Generated QA pairs are verified for factual correctness, answer uniqueness, temporal stability, and entity dependency through independent checks.

RL Training with DAPO

Models are fine-tuned using DAPO with cached tool interactions, enabling efficient training without real-time API calls.

Main Results

Accuracy (%) on six deep-search benchmarks. Our MTA-DeepSearch-32B achieves state-of-the-art, outperforming GPT-5 and Gemini models under the same tool settings.

Model	MMSrch+	HR-MMSrch	BC-VL	MMSrch	FVQA	MTA-test	Avg
Agent Workflow (same tool setting)
GPT-5	31.61	52.13	51.63	77.65	72.28	25.84	51.86
Gemini-2.5-Pro	30.65	48.20	49.50	77.65	72.33	27.53	50.98
Gemini-3-Pro	33.51	53.20	51.78	82.94	76.67	28.65	54.46
Qwen3-VL-32B-Inst.	14.84	38.69	38.69	68.52	66.94	17.42	40.85
Our Models
MTA-DeepSearch-8B	26.77	47.54	44.36	79.41	73.06	20.79	48.66
MTA-DeepSearch-32B	31.93	53.95	53.77	82.35	76.00	29.78	54.63

Key Findings

Deeper Search Behavior

Training increases average search depth from 2.27 to 4.28 steps, enabling more systematic and persistent multi-step retrieval strategies.

Structured Tool Usage

After training, Web Search usage rises to 99% and Reverse Image Search to 79%, forming a consistent two-stage retrieval pipeline.

Cost-Efficient Replay Training

Cached interaction replay enables effective RL training without real-time tool calls, significantly reducing training cost.

BibTeX

@article{peng2026mtaagent,
  title     = {MTA-Agent: An Open Recipe for Multimodal Deep Search Agents},
  author    = {Peng, Xiangyu and Qin, Can and Yan, An and Yang, Xinyi and Chen, Zeyuan and Xu, Ran and Wu, Chien-Sheng},
  journal   = {arXiv preprint arXiv:2604.06376},
  year      = {2026}
}

Contact

Name: Xiangyu Peng
Email: xiangyupeng1994@gmail.com

MTA-Agent: An Open Recipe forMultimodal Deep Search Agents