Multi-modal Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis
Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63% across six challenging benchmarks, outperforming GPT-5 (51.86%), Gemini-2.5-Pro (50.98%), and Gemini-3-Pro (54.46%) under the same tool settings.
A ReAct-style agent with 4 tools (web search, web reader, Google Lens, image search) iteratively gathers evidence for multi-hop QA generation.
Generated QA pairs are verified for factual correctness, answer uniqueness, temporal stability, and entity dependency through independent checks.
Models are fine-tuned using DAPO with cached tool interactions, enabling efficient training without real-time API calls.
Accuracy (%) on six deep-search benchmarks. Our MTA-DeepSearch-32B achieves state-of-the-art, outperforming GPT-5 and Gemini models under the same tool settings.
| Model | MMSrch+ | HR-MMSrch | BC-VL | MMSrch | FVQA | MTA-test | Avg |
|---|---|---|---|---|---|---|---|
| Agent Workflow (same tool setting) | |||||||
| GPT-5 | 31.61 | 52.13 | 51.63 | 77.65 | 72.28 | 25.84 | 51.86 |
| Gemini-2.5-Pro | 30.65 | 48.20 | 49.50 | 77.65 | 72.33 | 27.53 | 50.98 |
| Gemini-3-Pro | 33.51 | 53.20 | 51.78 | 82.94 | 76.67 | 28.65 | 54.46 |
| Qwen3-VL-32B-Inst. | 14.84 | 38.69 | 38.69 | 68.52 | 66.94 | 17.42 | 40.85 |
| Our Models | |||||||
| MTA-DeepSearch-8B | 26.77 | 47.54 | 44.36 | 79.41 | 73.06 | 20.79 | 48.66 |
| MTA-DeepSearch-32B | 31.93 | 53.95 | 53.77 | 82.35 | 76.00 | 29.78 | 54.63 |
Training increases average search depth from 2.27 to 4.28 steps, enabling more systematic and persistent multi-step retrieval strategies.
After training, Web Search usage rises to 99% and Reverse Image Search to 79%, forming a consistent two-stage retrieval pipeline.
Cached interaction replay enables effective RL training without real-time tool calls, significantly reducing training cost.
@article{peng2026mtaagent,
title = {MTA-Agent: An Open Recipe for Multimodal Deep Search Agents},
author = {Peng, Xiangyu and Qin, Can and Yan, An and Yang, Xinyi and Chen, Zeyuan and Xu, Ran and Wu, Chien-Sheng},
journal = {arXiv preprint arXiv:2604.06376},
year = {2026}
}
Name: Xiangyu Peng
Email: xiangyupeng1994@gmail.com