SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
| Preprint | Under review at ACL ARR 2026 | Sep. 2025 – Feb. 2026 |
Research assistant. With Prof. Chen Zhao.
Overview
Deep research agents have emerged as powerful systems for addressing complex queries, yet the role of retrieval remains underexplored. We introduce SAGE, a benchmark for scientific literature retrieval.
Contributions
- Built SAGE, a benchmark for reasoning-intensive scientific literature retrieval with 1,200 queries across four domains and a 200K-paper corpus, supporting short-form and open-ended questions.
- Evaluated six deep research agents under both web search and corpus retrieval settings, revealing that BM25 outperforms LLM-based retrievers by ~30% due to keyword-oriented sub-query generation.
- Proposed a corpus-level test-time scaling framework that augments documents with metadata and keywords using LLMs, improving retrieval accuracy by +8% (short-form) and +2% (open-ended).