{Seminar} @ CDS: #102, January 19th: 11:00: “Unified 2D–3D Vision–Language Models for Long-Horizon Embodied Perception.”

When

19 Jan 26    
11:00 AM - 12:00 PM

Event Type

Department of Computational and Data Sciences
Department Seminar


Speaker : Mr. Ayush Jain, PhD Student, Robotics Institute, Carnegie Mellon University
Title : Unified 2D–3D Vision–Language Models for Long-Horizon Embodied Perception
Date & Time : January 19th, 2026 (Monday), 11:00 AM
Venue : # 102, CDS Seminar Hall


ABSTRACT:
Modern vision–language models have achieved impressive performance across a wide range of visual and embodied perception tasks. However, their dominant mode of operation, which tokenizes RGB frames into dense 2D pixel patches and applies all-to-all attention, scales poorly when faced with the long-horizon, continuous streams of visual input encountered by embodied agents operating in the real world. As a result, current models struggle to reason efficiently over time, space, and scene structure. This raises a fundamental question: what is the right representation for scaling vision–language models beyond images and short video clips? In this talk, we argue that 2D pixels, and by extension videos, are a highly redundant view of an underlying 3D world. Instead, vision–language models should operate over compact, non-redundant 3D scene representations, allowing model complexity to scale with scene content rather than raw sensory input.

A central challenge in building such 3D-based vision–language models is the scarcity of large-scale 3D data. To address this, we introduce unified 2D–3D vision–language models that leverage abundant 2D data while simultaneously learning 3D-aware representations with strong spatial structure and compactness. We show that these models achieve state-of-the-art performance across a range of 2D and 3D vision–language tasks, including segmentation, referential grounding, and visual question answering.

Finally, we discuss extensions of these models to dynamic 3D scenes and to video understanding in settings where explicit 3D sensory data is unavailable. Together, these results point toward a path for vision–language models that can reason over long time horizons and serve as a foundation for the next generation of embodied agents.

BIO: Ayush Jain is a Ph.D. student in the Robotics Institute at Carnegie Mellon University, advised by Dr. Katerina Fragkiadaki. His research focuses on unified 2D–3D vision-language models that leverage the scale of 2D data and the spatial structure and compactness of 3D representations. His work has produced state-of-the-art VLMs for 2D and 3D segmentation, referential grounding, and visual question answering. Ayush’s research has been published at top machine learning and computer vision venues, including CVPR, ECCV, RSS, NeurIPS, and ICML, with Spotlight presentations at CVPR and NeurIPS. He has received Outstanding Reviewer Awards at ICCV 2023 and 2025 and CVPR 2024. He is supported by the CMU Robotics Vision Fellowship and the Meta AI Mentorship Fellowship, and has interned at Apple Machine Learning Research, Meta FAIR, and the Meta Reality Labs.

Host Faculty: Prof. Venkatesh Babu


ALL ARE WELCOME