BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//wp-events-plugin.com//7.2.3.1//EN
TZID:Asia/Kolkata
X-WR-TIMEZONE:Asia/Kolkata
BEGIN:VEVENT
UID:75@cds.iisc.ac.in
DTSTART;TZID=Asia/Kolkata:20240923T140000
DTEND;TZID=Asia/Kolkata:20240923T150000
DTSTAMP:20240917T154256Z
URL:https://cds.iisc.ac.in/events/ph-d-thesis-defense-hybridcds-23septembe
 r-2024-integrating-coarse-semantic-information-with-deep-image-representat
 ions-for-object-localization-model-generalization-and-efficient-training/
SUMMARY:Ph.D. Thesis Defense: HYBRID:CDS: 23\,September 2024 "Integrating C
 oarse Semantic Information with Deep Image Representations for Object Loca
 lization\, Model Generalization and Efficient Training"
DESCRIPTION:DEPARTMENT OF COMPUTATIONAL AND DATA SCIENCES\nPh.D. Thesis Def
 ense (HYBRID)\n\n\n\nSpeaker : Mr. Aditay Tripathi\nS.R. Number : 06-18-02
 -10-12-18 -1-15681\nTitle : "Integrating Coarse Semantic Information with 
 Deep Image Representations for Object Localization\, Model Generalization 
 and Efficient Training "\nThesis examiner: Prof. Konda Reddy Mopuri\, IIT 
 Hyderabad\nResearch Supervisor: Dr. Anirban Chakraborty\nDate &amp\; Time 
 : September 23\, 2024 (Monday) at 02:00 PM\nVenue : The Thesis Défense wi
 ll be held on HYBRID Mode\n# 202 CDS Class Room /MICROSOFT TEAMS\n\nPlease
  click on the following link to join the Thesis Defense:\nMS Teams link\nJ
 oin conversation\nteams.microsoft.com\n\n\n\nABSTRACT\nCoarse semantic fea
 tures are abstract descriptors capturing broad semantic information in an 
 image\, including scene labels\, crude contextual relationships between ob
 jects in the scene\, or even objects described using hand-drawn sketches. 
 Derived from external sources or pre-trained models\, these features compl
 ement fine-grained representations from deep neural networks\, enhancing o
 verall image understanding. In this thesis\, we explore applications where
  we integrate coarse semantic cues with deep image representations to addr
 ess novel visual analytics tasks and propose significantly improved soluti
 ons to existing challenges. In the first part of the thesis\, we present n
 ovel query-guided object localization frameworks\, where concepts like a h
 and-drawn sketch of an object\, a ‘gloss’ delivering a crude descripti
 on of the object of interest\, or a scene-graph representing objects and t
 heir coarse relationships provide necessary cues to localize all instances
  of the corresponding object(s) on a complex natural scene. Next\, we util
 ize the information contained in edge maps via intelligent augmentation an
 d shuffling to improve the robustness of computer vision models against th
 e adverse effects of texture bias prevalent in such models. Lastly\, in th
 e realm of self-supervised learning\, we use coarse representations derive
 d from a compact proxy model to schedule and sequence training data for ef
 ficient training of larger models.\n\nLocating objects in a scene through 
 image queries is a key problem in computer vision\, with recent work highl
 ighting the challenge of localizing both seen and unseen object categories
  at test time. A possible solution is to use object images as queries\; ho
 wever\, practical obstacles such as copyright\, privacy constraints\, and 
 difficulties in obtaining annotated data for emerging categories pose chal
 lenges. Instead\, we propose ‘sketch-guided object localization\,’ whi
 ch utilizes crude hand-drawn sketches to localize corresponding objects. W
 e employ cross-modal attention to integrate query information into the ima
 ge feature representation\, facilitating the generation of region proposal
 s relevant to the query sketch. The region proposals are then scored along
  with the sketch query for localization. Our method outperforms baselines 
 in both single-query and multi-query localization tasks on object detectio
 n benchmarks (MS-COCO and PASCAL-VOC) using abstract hand-drawn sketches a
 s queries.\n\nChallenges in scaling sketch-guided object localization incl
 ude the abstract appearance of hand-drawn sketches\, style variations\, an
 d a significant domain gap with respect to natural images. Existing soluti
 ons using attention-based frameworks show weak alignment and inaccurate lo
 calization because they integrate query features after extracting image fe
 atures independently. To mitigate this\, we introduce a novel sketch-guide
 d vision transformer encoder that uses cross-attention after each transfor
 mer-based image encoder block to integrate sketch information into the ima
 ge representation. Thus\, we learn query-conditioned image features for st
 ronger alignment with the sketch query. At the decoder’s output\, cross-
 attention is used to integrate sketch information into the object-level im
 age features\, thus improving their semantic alignment. The model generali
 zes to unseen object categories and achieves state-of-the-art performance 
 across both open-set and closed-set localization tasks.\n\nThe aforementio
 ned works pioneered and optimized the use of hand-drawn sketches for one-s
 hot object localization. However\, relying solely on crude hand-drawn sket
 ches may introduce ambiguity - for instance\, a rough sketch of a laptop c
 ould be confused for a sofa. One approach towards addressing this is to us
 e a coarse linguistic definition of the category\, e.g.\, ‘a small porta
 ble computer small enough to use in your lap’\, to complement the sketch
  query. We\, therefore\, propose a multimodal integration of sketches with
  linguistic category definitions\, called ‘gloss’\, for a comprehensiv
 e representation of visual and semantic cues. We use cross-modal attention
  to integrate information from the multi-modal queries into the image repr
 esentation\, thus generating region proposals relevant to the queries. Fur
 ther\, we propose a novel orthogonal projection-based proposal scoring tec
 hnique that evaluates each proposal with respect to the multi-modal querie
 s. Experiments on the MS-COCO dataset using ’Quick\, Draw!’ sketches a
 nd ’WordNet’ glosses as queries demonstrate superior performance over 
 related baselines for both seen and unseen categories.\n\nIn natural scene
 s\, single-query object localization is uncertain due to factors like unde
 rrepresentation\, occlusion\, or unavailability of suitable training data.
  Scenes with multiple objects exhibit visual relationships\, offering stro
 ng contextual cues for improved grounding. Scene graphs efficiently repres
 ent objects and the coarse semantic relationships (e.g.\, laptop ‘on’ 
 table) between them. In this work\, we study the problem of grounding scen
 e graphs on natural images to improve object localization. To this end\, w
 e propose a novel graph neural network-based approach referred to as Visio
 -Lingual Message PAssing Graph neural Network (VL-MPAG Net). We first cons
 truct a directed graph with object proposals as nodes and edges representi
 ng plausible relations. Next\, we employ a three-step framework involving 
 inter-graph and intra-graph message passing. Through inter-graph message p
 assing\, the model integrates scene-graph information into the proposal re
 presentations\, facilitating the learning of query-conditioned proposal re
 presentations. Subsequently\, intra-graph message passing refines them to 
 learn context-dependent representations of these proposals as well as that
  of the query objects. These refined query representations are used to sco
 re the proposals for object localization\, outperforming baselines on publ
 ic benchmark datasets.\n\nDeep vision models often exhibit overreliance on
  texture features\, resulting in poor generalization. To address this text
 ure bias\, we propose a lightweight adversarial augmentation technique cal
 led ELeaS that explicitly incentivizes the network to learn holistic shape
 s for accurate prediction in an object classification setting. Our augment
 ations superpose coarser descriptors\, namely edgemaps\, from one image on
 to another image with shuffled patches using a randomly determined mixing 
 proportion. To be able to classify these augmented images with the label o
 f the edgemap images\, the model needs to not only detect and focus on edg
 es but also distinguish between relevant and spurious edges. We show that 
 our augmentations significantly improve classification accuracy and robust
 ness measures on a range of datasets and neural architectures. Analysis us
 ing multiple probe datasets shows substantially increased shape sensitivit
 y in our trained models\, explaining these observed improvements\n\nSelf-s
 upervised learning (SSL) is vital for acquiring high-quality representatio
 ns from unlabeled image collections\, but the growing dataset sizes increa
 se the demand for computational resources in training SSL models. To addre
 ss this\, we propose ‘DYSCOF’\, a Dynamic data Selection method. DYSCO
 F scores and prioritizes essential samples using a Coarse-toFine schedule 
 for optimized data selection. It employs a small proxy model pre-trained v
 ia contrastive learning to identify a data subset based on the score diffe
 rence between the representations learned by the larger target model and t
 he coarse representations obtained from this proxy. The selected subset ge
 ts iteratively updated in a coarse-to-fine schedule that initially selects
  a larger data fraction and gradually reduces the proportion of selected d
 ata as training progresses. To further enhance efficiency\, we introduce a
  distillation loss that leverages coarse representations obtained from the
  proxy model to guide the target model’s learning. Validated on public b
 enchmark datasets\, our method achieves a large reduction in computational
  load on all benchmark datasets\, enhancing SSL training efficiency while 
 maintaining classification performance.\n\n\n\nALL ARE WELCOME
CATEGORIES:Events,Thesis Defense
END:VEVENT
BEGIN:VTIMEZONE
TZID:Asia/Kolkata
X-LIC-LOCATION:Asia/Kolkata
BEGIN:STANDARD
DTSTART:20230924T140000
TZOFFSETFROM:+0530
TZOFFSETTO:+0530
TZNAME:IST
END:STANDARD
END:VTIMEZONE
END:VCALENDAR