{Seminar} @ CDS: #102, October 09th : 12:00: “Beyond Models: Rethinking Benchmarks, Data, and Evaluation for Retrieval-Augmented Generation.”

When

9 Oct 25    
12:00 PM - 1:00 PM

Event Type

Department of Computational and Data Sciences
Department Seminar


Speaker : Nandan Thakur, PhD candidate at the University of Waterloo
Title   : “Beyond Models: Rethinking Benchmarks, Data, and Evaluation for Retrieval-Augmented Generation”
Date & Time : October 09th, 2025 (Thursday), 12:00 noon
Venue : # 102, CDS Seminar Hall


ABSTRACT

Retrieval systems face two major bottlenecks limiting progress: in-domain overfitting due to unrealistic benchmarks, and the scarcity of high-quality training data. To address the overfitting challenge, we introduced two benchmarks: BEIR, focused on “zero-shot” evaluation for out-of-domain accuracy, and MIRACL, designed to measure out-of-language accuracy. Retrieval-Augmented Generation (RAG) has since emerged as a method to extend the theoretical boundaries of parametric knowledge in LLMs by integrating external, up-to-date information through a retrieval stage before the LLM’s response. Yet, real-world demands in RAG applications have shifted, and evaluation metrics and benchmarks must evolve to better capture retrieval diversity and recall. I will present FreshStack, a benchmark with complex technical questions asked by users in niche programming domains, designed to minimize contamination from LLM pretraining corpora. Next, I will share findings from the TREC 2024 RAG track, which investigates nugget- and support-based evaluation. This includes comparing human and LLM judges on whether answers contain the necessary facts and whether cited documents truly support those answers—while extending the evaluation framework across languages to measure for LLM hallucinations. Another persistent challenge lies in training data scarcity, which we address using approaches such as GPL and SWIM-IR to generate high-quality and large synthetic datasets. Finally, I will discuss training data quality, observing that more data is not always better, and relabeling false hard negatives curates the training data and improves out-of-distribution retrieval accuracy. I will conclude with a future vision for constructing complex benchmarks that support agentic retrieval systems, capable of decomposing and solving multi-step information-seeking tasks.

BIO: Nandan Thakur is a PhD candidate at the University of Waterloo advised by Prof. Jimmy Lin. His research broadly investigates constructing challenging benchmarks for robust evaluation and synthetic training data generation focusing on improving NLP and retrieval systems across domains and languages. His most prominent work on the BEIR benchmark is a leading industry standard for benchmarking state-of-the-art retrieval models, including Google, Microsoft, OpenAI, and other startups. Nandan led the Retrieval-Augmented Generation (RAG) track in TREC 2024, with over 50+ teams participating and is currently co-leading the TREC RAG 2025 Track. Nandan has over 2400+ citations and his research have been published in top conferences and journals, including NAACL, ICLR, EMNLP, NeurIPS, SIGIR, and TACL.

Webpage: https://thakur-nandan.github.io

Host Faculty: Dr. Danish Pruthi


ALL ARE WELCOME