SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Preprint     Under review at ACL ARR 2026     Sep. 2025 – Feb. 2026

Research assistant. With Prof. Chen Zhao.

arXiv


Overview

Deep research agents have emerged as powerful systems for addressing complex queries, yet the role of retrieval remains underexplored. We introduce SAGE, a benchmark for scientific literature retrieval.


Contributions

  • Built SAGE, a benchmark for reasoning-intensive scientific literature retrieval with 1,200 queries across four domains and a 200K-paper corpus, supporting short-form and open-ended questions.
  • Evaluated six deep research agents under both web search and corpus retrieval settings, revealing that BM25 outperforms LLM-based retrievers by ~30% due to keyword-oriented sub-query generation.
  • Proposed a corpus-level test-time scaling framework that augments documents with metadata and keywords using LLMs, improving retrieval accuracy by +8% (short-form) and +2% (open-ended).