BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//wp-events-plugin.com//7.2.3.1//EN
TZID:Asia/Kolkata
X-WR-TIMEZONE:Asia/Kolkata
BEGIN:VEVENT
UID:150@cds.iisc.ac.in
DTSTART;TZID=Asia/Kolkata:20251009T120000
DTEND;TZID=Asia/Kolkata:20251009T130000
DTSTAMP:20250929T134925Z
URL:https://cds.iisc.ac.in/events/seminar-cds-102-october-09th-1200-beyond
 -models-rethinking-benchmarks-data-and-evaluation-for-retrieval-augmented-
 generation/
SUMMARY:{Seminar} @ CDS: #102\, October 09th : 12:00: "Beyond Models: Rethi
 nking Benchmarks\, Data\, and Evaluation for Retrieval-Augmented Generatio
 n."
DESCRIPTION:Department of Computational and Data Sciences\nDepartment Semin
 ar\n\n\n\nSpeaker : Nandan Thakur\, PhD candidate at the University of Wat
 erloo\nTitle   : "Beyond Models: Rethinking Benchmarks\, Data\, and Eval
 uation for Retrieval-Augmented Generation"\nDate &amp\; Time : October 09t
 h\, 2025 (Thursday)\, 12:00 noon\nVenue : # 102\, CDS Seminar Hall\n\n\n\n
 ABSTRACT\n\nRetrieval systems face two major bottlenecks limiting progress
 : in-domain overfitting due to unrealistic benchmarks\, and the scarcity o
 f high-quality training data. To address the overfitting challenge\, we in
 troduced two benchmarks: BEIR\, focused on “zero-shot” evaluation for 
 out-of-domain accuracy\, and MIRACL\, designed to measure out-of-language 
 accuracy. Retrieval-Augmented Generation (RAG) has since emerged as a meth
 od to extend the theoretical boundaries of parametric knowledge in LLMs by
  integrating external\, up-to-date information through a retrieval stage b
 efore the LLM’s response. Yet\, real-world demands in RAG applications h
 ave shifted\, and evaluation metrics and benchmarks must evolve to better 
 capture retrieval diversity and recall. I will present FreshStack\, a benc
 hmark with complex technical questions asked by users in niche programming
  domains\, designed to minimize contamination from LLM pretraining corpora
 . Next\, I will share findings from the TREC 2024 RAG track\, which invest
 igates nugget- and support-based evaluation. This includes comparing human
  and LLM judges on whether answers contain the necessary facts and whether
  cited documents truly support those answers—while extending the evaluat
 ion framework across languages to measure for LLM hallucinations. Another 
 persistent challenge lies in training data scarcity\, which we address usi
 ng approaches such as GPL and SWIM-IR to generate high-quality and large s
 ynthetic datasets. Finally\, I will discuss training data quality\, observ
 ing that more data is not always better\, and relabeling false hard negati
 ves curates the training data and improves out-of-distribution retrieval a
 ccuracy. I will conclude with a future vision for constructing complex ben
 chmarks that support agentic retrieval systems\, capable of decomposing an
 d solving multi-step information-seeking tasks.\n\nBIO: Nandan Thakur is a
  PhD candidate at the University of Waterloo advised by Prof. Jimmy Lin. H
 is research broadly investigates constructing challenging benchmarks for r
 obust evaluation and synthetic training data generation focusing on improv
 ing NLP and retrieval systems across domains and languages. His most promi
 nent work on the BEIR benchmark is a leading industry standard for benchma
 rking state-of-the-art retrieval models\, including Google\, Microsoft\, O
 penAI\, and other startups. Nandan led the Retrieval-Augmented Generation 
 (RAG) track in TREC 2024\, with over 50+ teams participating and is curren
 tly co-leading the TREC RAG 2025 Track. Nandan has over 2400+ citations an
 d his research have been published in top conferences and journals\, inclu
 ding NAACL\, ICLR\, EMNLP\, NeurIPS\, SIGIR\, and TACL.\n\nWebpage: https:
 //thakur-nandan.github.io\n\nHost Faculty: Dr. Danish Pruthi\n\n\n\nALL AR
 E WELCOME
CATEGORIES:Events,Talks
END:VEVENT
BEGIN:VTIMEZONE
TZID:Asia/Kolkata
X-LIC-LOCATION:Asia/Kolkata
BEGIN:STANDARD
DTSTART:20241009T120000
TZOFFSETFROM:+0530
TZOFFSETTO:+0530
TZNAME:IST
END:STANDARD
END:VTIMEZONE
END:VCALENDAR