MTA-Agent: An Open Recipe for
Multimodal Deep Search Agents

Multi-modal Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis

Salesforce Salesforce AI Research
54.63%
Avg. accuracy across 6 benchmarks
21K
Multi-hop training examples
> GPT-5 OpenAI
> Gemini-3-Pro Gemini
Outperforms commercial LLMs

Abstract

Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63% across six challenging benchmarks, outperforming GPT-5 (51.86%), Gemini-2.5-Pro (50.98%), and Gemini-3-Pro (54.46%) under the same tool settings.

Method Overview

Tool-Augmented QA Agent

A ReAct-style agent with 4 tools (web search, web reader, Google Lens, image search) iteratively gathers evidence for multi-hop QA generation.

Multi-Stage Verification

Generated QA pairs are verified for factual correctness, answer uniqueness, temporal stability, and entity dependency through independent checks.

RL Training with DAPO

Models are fine-tuned using DAPO with cached tool interactions, enabling efficient training without real-time API calls.

Main Results

Accuracy (%) on six deep-search benchmarks. Our MTA-DeepSearch-32B achieves state-of-the-art, outperforming GPT-5 and Gemini models under the same tool settings.

Model MMSrch+ HR-MMSrch BC-VL MMSrch FVQA MTA-test Avg
Agent Workflow (same tool setting)
GPT-5 31.6152.1351.6377.6572.2825.8451.86
Gemini-2.5-Pro 30.6548.2049.5077.6572.3327.5350.98
Gemini-3-Pro 33.5153.2051.7882.9476.6728.6554.46
Qwen3-VL-32B-Inst. 14.8438.6938.6968.5266.9417.4240.85
Our Models
MTA-DeepSearch-8B 26.7747.5444.3679.4173.0620.7948.66
MTA-DeepSearch-32B 31.9353.9553.7782.3576.0029.7854.63

Key Findings

Deeper Search Behavior

Training increases average search depth from 2.27 to 4.28 steps, enabling more systematic and persistent multi-step retrieval strategies.

Structured Tool Usage

After training, Web Search usage rises to 99% and Reverse Image Search to 79%, forming a consistent two-stage retrieval pipeline.

Cost-Efficient Replay Training

Cached interaction replay enables effective RL training without real-time tool calls, significantly reducing training cost.

BibTeX

@article{peng2026mtaagent,
  title     = {MTA-Agent: An Open Recipe for Multimodal Deep Search Agents},
  author    = {Peng, Xiangyu and Qin, Can and Yan, An and Yang, Xinyi and Chen, Zeyuan and Xu, Ran and Wu, Chien-Sheng},
  journal   = {arXiv preprint arXiv:2604.06376},
  year      = {2026}
}

Contact

Name: Xiangyu Peng
Email: xiangyupeng1994@gmail.com