M.Tech Research Thesis {Colloquium}: CDS: “The Art of Control: Post-hoc Alignment of Diffusion Models for Safety, Ethics, and Fine-grained Control”

When

16 Jan 26    
10:30 AM - 11:30 AM

Event Type

DEPARTMENT OF COMPUTATIONAL AND DATA SCIENCES
M.Tech Research Thesis Colloquium


Speaker : Mr. Aakash Kumar Singh
S.R. Number : 06-18-01-10-22-23-1-23884
Title : ” The Art of Control: Post-hoc Alignment of Diffusion Models for Safety, Ethics, and Fine-grained Control.”
Research Supervisor : Prof. Venkatesh Babu
Date & Time : January 16, 2026, 10.30 AM
Venue : # 102 CDS Seminar Hall


ABSTRACT

Diffusion models have achieved unprecedented success in generating high-fidelity images, largely due to the massive scale of their training datasets. Yet this success raises significant concerns regarding privacy, intellectual property rights, and model safety as these methods tend to generate copyrighted content and accurate human identities seen during the training phase. To address these risks, earlier approaches relied on dataset filtering and inference-time safety checkers—both fundamentally fragile solutions. These external defenses merely suppress outputs without removing the underlying internal representations, leaving undesirable concepts intact.

These concerns highlight the need for unlearning concepts post-training. Existing unlearning methods can erase specific concepts but often introduce significant degradation in neighbouring concepts. To reduce such side effects, recent approaches require extensive domain expertise to identify which other concepts must be preserved, making them impractical for large-scale deployment where thousands of concepts must be handled automatically. Moreover, many sensitive concepts (NSFW content) are inherently subjective, with desired degree of forgetting varying across cultural, regional, and application-specific contexts, further complicating the use of rigid, one-size-fits-all unlearning strategies.

To address these challenges, we introduce Concept Siever, an end-to-end framework for targeted concept removal in pre-trained text-to-image diffusion models. Concept Siever rests on two key innovations:

  1. Automatic Paired Data Generation : The framework creates paired datasets of a target concept and its negations by utilizing the diffusion model’s own latent space. These pairs differ only in the target concept, enabling precise forgetting with minimal side effects—crucially, without requiring domain expertise.
  2. Concept-Specific Localization : Concept Siever employs a novel localization method to identify and isolate model components most responsible for the target concept. By retraining only these localized components on the paired dataset, the method accurately removes concepts with negligible side effects while preserving neighboring and unrelated concepts.

Additionally, Concept Siever offers continuous inference-time control over forgetting strength, enabling flexible adaptation to context-dependent requirements without additional fine-tuning. The method achieves state-of-the-art performance on the I2P benchmark, surpassing previous methods by over 33% while demonstrating superior structure preservation. Extensive quantitative and qualitative evaluations, along with a user study, validate the effectiveness of our approach in providing a targeted, adjustable mechanism for concept erasure with minimal collateral impact.

[Project Page]


ALL ARE WELCOME