{Seminar} @ CDS: #102, October 09th : 12:00: “Beyond Models: Rethinking Benchmarks, Data, and Evaluation for Retrieval-Augmented Generation.”: Department of Computational and Data Sciences- Indian Institute of Science, Bangalore

When

9 Oct 25

12:00 PM - 1:00 PM

Event Type

Events
Talks

Department of Computational and Data Sciences
Department Seminar

Speaker : Nandan Thakur, PhD candidate at the University of Waterloo
Title : “Beyond Models: Rethinking Benchmarks, Data, and Evaluation for Retrieval-Augmented Generation”
Date & Time : October 09th, 2025 (Thursday), 12:00 noon
Venue : # 102, CDS Seminar Hall

ABSTRACT

Retrieval systems face two major bottlenecks limiting progress: in-domain overfitting due to unrealistic benchmarks, and the scarcity of high-quality training data. To address the overfitting challenge, we introduced two benchmarks: BEIR, focused on “zero-shot” evaluation for out-of-domain accuracy, and MIRACL, designed to measure out-of-language accuracy. Retrieval-Augmented Generation (RAG) has since emerged as a method to extend the theoretical boundaries of parametric knowledge in LLMs by integrating external, up-to-date information through a retrieval stage before the LLM’s response. Yet, real-world demands in RAG applications have shifted, and evaluation metrics and benchmarks must evolve to better capture retrieval diversity and recall. I will present FreshStack, a benchmark with complex technical questions asked by users in niche programming domains, designed to minimize contamination from LLM pretraining corpora. Next, I will share findings from the TREC 2024 RAG track, which investigates nugget- and support-based evaluation. This includes comparing human and LLM judges on whether answers contain the necessary facts and whether cited documents truly support those answers—while extending the evaluation framework across languages to measure for LLM hallucinations. Another persistent challenge lies in training data scarcity, which we address using approaches such as GPL and SWIM-IR to generate high-quality and large synthetic datasets. Finally, I will discuss training data quality, observing that more data is not always better, and relabeling false hard negatives curates the training data and improves out-of-distribution retrieval accuracy. I will conclude with a future vision for constructing complex benchmarks that support agentic retrieval systems, capable of decomposing and solving multi-step information-seeking tasks.

BIO: Nandan Thakur is a PhD candidate at the University of Waterloo advised by Prof. Jimmy Lin. His research broadly investigates constructing challenging benchmarks for robust evaluation and synthetic training data generation focusing on improving NLP and retrieval systems across domains and languages. His most prominent work on the BEIR benchmark is a leading industry standard for benchmarking state-of-the-art retrieval models, including Google, Microsoft, OpenAI, and other startups. Nandan led the Retrieval-Augmented Generation (RAG) track in TREC 2024, with over 50+ teams participating and is currently co-leading the TREC RAG 2025 Track. Nandan has over 2400+ citations and his research have been published in top conferences and journals, including NAACL, ICLR, EMNLP, NeurIPS, SIGIR, and TACL.

Webpage: https://thakur-nandan.github.io

Host Faculty: Dr. Danish Pruthi

ALL ARE WELCOME

{Seminar} @ CDS: #102, October 09th : 12:00: “Beyond Models: Rethinking Benchmarks, Data, and Evaluation for Retrieval-Augmented Generation.”

When

Event Type

Recent News

Contact Us

Shortcuts

Explore

Get in touch

Follow us

Locate us