BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//wp-events-plugin.com//7.2.3.1//EN
TZID:Asia/Kolkata
X-WR-TIMEZONE:Asia/Kolkata
BEGIN:VEVENT
UID:179@cds.iisc.ac.in
DTSTART;TZID=Asia/Kolkata:20260119T110000
DTEND;TZID=Asia/Kolkata:20260119T120000
DTSTAMP:20260114T141902Z
URL:https://cds.iisc.ac.in/events/seminar-cds-102-january-19th-1100-unifie
 d-2d-3d-vision-language-models-for-long-horizon-embodied-perception/
SUMMARY:{Seminar} @ CDS: #102\, January 19th: 11:00: "Unified 2D–3D Visio
 n–Language Models for Long-Horizon Embodied Perception."
DESCRIPTION:Department of Computational and Data Sciences\nDepartment Semin
 ar\n\n\n\nSpeaker : Mr. Ayush Jain\, PhD Student\, Robotics Institute\, Ca
 rnegie Mellon University\nTitle : Unified 2D–3D Vision–Language Models
  for Long-Horizon Embodied Perception\nDate &amp\; Time : January 19th\, 2
 026 (Monday)\, 11:00 AM\nVenue : # 102\, CDS Seminar Hall\n\n\n\nABSTRACT:
 \nModern vision–language models have achieved impressive performance acr
 oss a wide range of visual and embodied perception tasks. However\, their 
 dominant mode of operation\, which tokenizes RGB frames into dense 2D pixe
 l patches and applies all-to-all attention\, scales poorly when faced with
  the long-horizon\, continuous streams of visual input encountered by embo
 died agents operating in the real world. As a result\, current models stru
 ggle to reason efficiently over time\, space\, and scene structure. This r
 aises a fundamental question: what is the right representation for scaling
  vision–language models beyond images and short video clips? In this tal
 k\, we argue that 2D pixels\, and by extension videos\, are a highly redun
 dant view of an underlying 3D world. Instead\, vision–language models sh
 ould operate over compact\, non-redundant 3D scene representations\, allow
 ing model complexity to scale with scene content rather than raw sensory i
 nput.\n\nA central challenge in building such 3D-based vision–language m
 odels is the scarcity of large-scale 3D data. To address this\, we introdu
 ce unified 2D–3D vision–language models that leverage abundant 2D data
  while simultaneously learning 3D-aware representations with strong spatia
 l structure and compactness. We show that these models achieve state-of-th
 e-art performance across a range of 2D and 3D vision–language tasks\, in
 cluding segmentation\, referential grounding\, and visual question answeri
 ng.\n\nFinally\, we discuss extensions of these models to dynamic 3D scene
 s and to video understanding in settings where explicit 3D sensory data is
  unavailable. Together\, these results point toward a path for vision–la
 nguage models that can reason over long time horizons and serve as a found
 ation for the next generation of embodied agents.\n\nBIO: Ayush Jain is a 
 Ph.D. student in the Robotics Institute at Carnegie Mellon University\, ad
 vised by Dr. Katerina Fragkiadaki. His research focuses on unified 2D–3D
  vision-language models that leverage the scale of 2D data and the spatial
  structure and compactness of 3D representations. His work has produced st
 ate-of-the-art VLMs for 2D and 3D segmentation\, referential grounding\, a
 nd visual question answering. Ayush’s research has been published at top
  machine learning and computer vision venues\, including CVPR\, ECCV\, RSS
 \, NeurIPS\, and ICML\, with Spotlight presentations at CVPR and NeurIPS. 
 He has received Outstanding Reviewer Awards at ICCV 2023 and 2025 and CVPR
  2024. He is supported by the CMU Robotics Vision Fellowship and the Meta 
 AI Mentorship Fellowship\, and has interned at Apple Machine Learning Rese
 arch\, Meta FAIR\, and the Meta Reality Labs.\n\nHost Faculty: Prof. Venka
 tesh Babu\n\n\n\nALL ARE WELCOME
CATEGORIES:Events,Talks
END:VEVENT
BEGIN:VTIMEZONE
TZID:Asia/Kolkata
X-LIC-LOCATION:Asia/Kolkata
BEGIN:STANDARD
DTSTART:20250119T110000
TZOFFSETFROM:+0530
TZOFFSETTO:+0530
TZNAME:IST
END:STANDARD
END:VTIMEZONE
END:VCALENDAR