{Seminar} @ CDS: #102, December 23, 11:00: “Counterfactual World Modeling – A framework for constructing vision foundation models.”

When

23 Dec 24    
11:00 AM - 12:00 PM

Event Type

Department of Computational and Data Sciences
Department Seminar


Speaker :Mr. Rahul, PhD student at Stanford’s NeuroAILab,
Title :”Counterfactual World Modeling – A framework for constructing vision foundation models.”
Date & Time : December 23, 2024, 11:00 AM
Venue : # 102, CDS Seminar Hall


ABSTRACT
Foundation models of natural language have shown how large pre-trained neural networks can provide solutions to a wide range of tasks. However, in machine vision, most leading approaches employ different architectures for different tasks, trained on costly task-specific labeled datasets. In this talk, I will introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model: a unified, unsupervised network that can be prompted to perform a wide variety of visual computations. CWM has two key components that resolve the core issues that have hindered the application of the foundation model concept to vision. The first is structured masking, a generalization of masked prediction methods that encourages a prediction model to capture the low-dimensional structure in visual data. Specifically, we can sample a patch-level prompt, to meaningfully control scene dynamics. This in turn enables CWM’s second main idea – the observation that many apparently distinct visual representations can be computed, in a zero-shot manner, by comparing the prediction model’s output on real inputs versus slightly modified (“counterfactual”) inputs. This talk will describe how CWM enables the extraction of low and mid-level vision structures such as optical flow, keypoints, and object segments under a unified architecture. Further, I’ll demonstrate that patch-level prompting also enables sophisticated image editing capabilities which has previously been challenging to do even with task-specific models. Finally, I will also discuss how the CWM framework can be bootstrapped to extract increasingly powerful vision structures – paving the way for real-world robotics applications, where robust task-general perception still remains a bottleneck.

BIO: Rahul is a fourth-year CS PhD student at Stanford’s NeuroAILab, where he is advised by Prof. Dan Yamins. His research explores the mechanisms that enable the interpretation of physical dynamics from visual imagery, both in humans and machines. He holds a Master’s in Computer Vision from CMU, where he worked on 3D shape modeling. Prior to that, he was a research assistant at the Vision and AI Lab at the Indian Institute of Science, focusing on human and object pose estimation as well as domain adaptation.

Host Faculty: Prof. Venkatesh Babu


ALL ARE WELCOME