- This event has passed.
Ph.D. Thesis Defense (ONLINE) @ CDS : 15 April 2021 : “A Study on Deep Learning Approaches, Architectures and Training Methods for Crowd Analysis”
15 Apr @ 3:30 PM -- 5:00 PM
Ph.D. Thesis Defense (Online)
Speaker : Deepak Babu Sam
S.R. Number : 06-18-02-10-12-15-02-13199
Title : A Study on Deep Learning Approaches, Architectures and Training Methods for Crowd Analysis
Date & Time : 15 April 2021 (Thursday), 03:30 PM
Venue : Online
Analyzing large crowds quickly is one of the highly sought-after capabilities nowadays. Especially in terms of public security and planning, this assumes prime importance. But automated reasoning of crowd images or videos is a challenging Computer Vision task. The difficulty is so extreme in dense crowds that the task is typically narrowed down to estimating the number of people. Since the count or distribution of people in the scene itself can be very valuable information, this field of research has gained traction. The difficulty mostly stems from the drastic variability in crowd density as any prospective approach has to scale across crowds formed by few tens to thousands of people. This results in large diversity in the way people appear in crowded scenes. Often people are only seen as a bunch of blobs in highly dense crowds, whereas facial or body features might be visible in less dense gatherings. Hence, the visibility and scale of features for crowd discrimination vary drastically with the density of the crowd. Severe occlusion, pose changes and view-point variations further compound the problem. Typical head or body detection-based methods fail to adapt with such huge diversity, paving way for the simpler crowd density regression models. Add to these, the practical difficulty of annotating millions of head locations in dense crowds. This implies creating large-scale labeled crowd data is expensive and directly takes a toll on the performance of existing CNN based counting models.
Given these challenges, this thesis tackles the problem of crowd counting from multiple perspectives. Detailed inquiry is done to address the three major issues: diversity, data scarcity, and localization.
** Addressing Diversity **: First, the diversity issue is considered as it causes significant prediction errors on account of failure to scale well across the density categories. In the diverse scenario, discriminating persons requires larger spatial context and semantics of the scene, instead of local crowd patterns. A set of brain-inspired top-down feedback connections from higher layers is proposed. This feedback is shown to deliver global context for the initial layers of CNN and help correct prediction errors in an iterative manner. Next, an alternative mixture of experts approach is devised, where a differential training regime jointly clusters and fine-tunes a set of experts to capture the huge diversity seen in crowd images. This approach results in a significant boost in counting performance as different regions of the images are processed by the appropriate expert regressor based on the local density. Further performance improvement is obtained through a growing CNN that can progressively increase its capacity to account for the wide variability in crowd scenes.
** Addressing Data Scarcity **: Dense crowd counting demands millions of head annotations for training models. This annotation difficulty could be mitigated using a Grid Winner-Take-All autoencoder, which is designed to learn almost 99\% of the parameters from unlabeled crowd images. The model achieves superior results compared to other unsupervised methods and beats the fully supervised baselines in limited data scenarios. This objective is pushed further to fully eliminate the dependency on instance-level labeled data. The proposed completely self-supervised architecture does not require any annotation for training, but uses a distribution matching technique to learn the required features. The only input required to train, apart from a large set of unlabeled crowd images, is the approximate upper limit of the crowd count for the given dataset. Experiments show that the model results in effective learning of crowd features and delivers significant counting performance. Furthermore, the superiority of the method is established in limited data setting as well.
** Addressing Localization **: Typical counting models predict crowd density for an image as opposed to detecting every person. These regression methods, in general, fail to localize persons accurate enough for most applications other than counting. Hence, two detection frameworks for dense crowd counting are developed, such that they obviate the need for the prevalent density regression paradigm. The first approach reformulates the task as localized dot prediction in dense crowds, where the model is trained for pixel-wise binary classification to detect people. In the second dense detection architecture, apart from locating persons, the spotted heads are sized with bounding boxes. This approach could detect individual persons consistently across the diversity spectrum as opposed to regressing local crowd density values. Moreover, this improved localization is achieved without requiring any additional box annotations for training.
ALL ARE WELCOME