Ph.D: Thesis Colloquium: 102: CDS: 16, June 2026 “Continuous, Spatial, and Distributional Control for Faithful Image Synthesis”

When

16 Jun 26    
11:00 AM - 12:00 PM

Event Type

DEPARTMENT OF COMPUTATIONAL AND DATA SCIENCES
Ph.D. Thesis Colloquium


Speaker: Mr. Rishubh Parihar
S.R. Number: 06-18-01-10-12-21-1-19480
Title: “Continuous, Spatial, and Distributional Control for Faithful Image Synthesis”
Research Supervisor: Prof. Venkatesh Babu
Date & Time : June 16, 2026 (Tuesday), 11:00 AM
Venue : #102, CDS Seminar Hall


ABSTRACT
Deep generative models have transformed image synthesis over the last decade, advancing from early Generative Adversarial Networks (GANs) to modern text-to-image models capable of generating highly photorealistic images. The primary goal of these models is to learn the underlying data distribution from a given set of samples (e.g., images), enabling the synthesis of novel instances. This typically involves learning a function that maps from a simple tractable distribution (e.g., Gaussian) to the data distribution, parameterized by a neural network. However, while these models can generate highly diverse image variations, they offer little to no direct control over the synthesized content. This lack of precision restricts a user’s ability to accurately convey their intent about scene layout, semantic attributes, and object identities during image synthesis, limiting these models’ utility as practical creative tools. To address these challenges, this thesis proposes a comprehensive suite of frameworks that introduce precise, intuitive control mechanisms for image synthesis. Specifically, we explore three crucial dimensions of control for image generation:

  • Continuous Control: Smoothly modulating the intensity of semantic attributes, such as precisely varying a facial expression by directly manipulating the model’s latent representations.
  • Spatial Control: Governing scene composition by realistically inserting new elements into existing images or specifying precise geometric properties of objects, such as 3D orientation and scale, during image generation.
  • Test-Time Distribution Control: Steering the sample distribution of pretrained generative models to achieve target criteria, such as attribute balancing for debiased face generation.

Continuous Semantic Control: Visual data inherently contains numerous semantic attributes with continuous factors of variation, such as a person’s age or an object’s size. Capturing these continuous changes is essential for fine-grained generative control, yet standard models often restrict synthesis to discrete or binary variations. In our work, FLAME, we propose an efficient method to discover disentangled edit directions in the latent space of pretrained StyleGAN models, enabling continuous control over facial attributes during the synthesis process in a training-free manner. To extend this capability to the unconstrained setting of in-the-wild, text-conditioned image generation, we next develop PreciseControl, a method to personalize diffusion models with fine-grained facial attribute control. By conditioning the diffusion model on the disentangled latent space of StyleGAN, this approach achieves smooth attribute editing while preserving the compositionality of text-guided generation. Finally, to move beyond specific domains like faces, we introduce KontinuousKontext, extending continuous control capabilities to foundational instruction-driven image editing models. This framework allows users to smoothly adjust the intensity of diverse editing tasks such as stylization, object shape, and scene lighting through an intuitive, slider-based interface.

Spatial Control: We investigate the control of spatial scene elements through two important tasks – object insertion within existing scenes and the grounding of 3D object properties such as location, orientation, and scale during generation. For object insertion, we propose Text2Place, a test-time training approach that leverages the generative priors of text-to-image diffusion models to predict 2D affordances for realistic human placement. We extend this capability to 3D-aware object insertion in MonoPlace3D and Depth-Aware Editing, ensuring that inserted elements naturally blend into the 3D scene with accurate perspective, scale, and harmonious occlusions. For spatial grounding in text-to-image generation, we first introduce CompassControl, a method to precisely control the 3D orientation of text-described scene objects. Moving toward full scene synthesis, we develop SeeThrough3D, where we propose a novel primitive based scene representation that models objects as translucent 3D boxes to condition the generation process, achieving robust 3D layout grounding with accurate occlusion handling.

Test-Time Distribution Control: Beyond individual image-level control, an important but largely unexplored direction is steering the distribution of generated samples from a pretrained generative model. To achieve this without the prohibitive cost of post-training, we develop training-free guidance mechanisms that steer the empirical distribution of a sampled batch toward a user-specified target. In BalancingAct, we introduce attribute distribution guidance within the bottleneck $h$-space of diffusion models, enabling the generation of image batches that adhere to a user-provided reference attribute distribution over subgroups, such as for demographic balancing. We then address the problem of diversity collapse in pretrained flow models in Do Not Settle at the Mode! The key idea is to maximize the pairwise distance between internal representations within a batch while ensuring samples remain anchored to the learned feature manifold via feature guidance during sampling. This divergence mechanism significantly enhances the diversity of the base generative model without compromising visual quality.


ALL ARE WELCOME