M.Tech Research Thesis {Colloquium}: CDS : “Scalable Video Data Management and Visual Querying for Autonomous Camera Networks”


1 Mar 24    
3:00 PM - 4:00 PM

Event Type

M.Tech Research Thesis Colloquium

Speaker : Ms. Bharati Khanijo

S.R. Number : 06-18-02-10-12-19-1-17219

Title :”Scalable Video Data Management and Visual Querying for Autonomous Camera Networks”
Research Supervisor: Prof. Yogesh Simmhan
Date & Time : March 01, 2024 (Friday) at 03:00 PM
Venue : # 102 CDS Seminar Hall


Video data has been historically known for its unstructured nature and rich semantic content but also for scalability issues in terms of storage and analytics. Mobile aerial platforms like drones capture such videos across space and time. Advances in computer vision and deep learning enable automatic extraction of rich semantic information from video data, leading to applications where the stored video data can be used to study and analyze the world retrospectively and automatically. However, recent research has highlighted the compute-intensive nature of such Deep Neural Network (DNN) models, e.g., for accurate object detection, leading to high computing costs that limits their applicability for brute-force analysis of all historical videos. Also, an efficient design of such applications often requires co-analysis of video data along with associated geospatial and temporal metadata, which is a challenge.

We propose a geospatial-temporal video query system with support for semantic queries for drone videos, extending an existing spatial-temporal database and contemporary object detection models. We develop a heuristic to enable better reuse of semantic object detections obtained from different configurations (object detection model and its input resolution) . The system further motivates the need for optimizations for retrospective semantic analysis and storage for drone videos, which is addressed by our novel DDownscale method and the associated ingest pipeline.

Prior optimizations on semantic querying over video data focus on static cameras from city-scale traffic/surveillance camera networks, often leveraging the spatial and temporal characteristics of associated videos, which are absent in videos recorded by mobile drone cameras. We specifically focus on two such characteristics of drone videos. One is that drone videos have shorter durations, unlike those captured by static cameras. Another is that there can be large variations in the level of detail of information captured across a fleet of drone cameras due to differences in the resolution of the camera, the altitude, and the orientation from which the videos were captured.

Specifically, we address the need to intelligently scale-down the spatial resolution of videos to reduce the video storage costs and semantic query/inferencing time. However, conventional methods of manual or profiling-based estimation of the ideal scaling ratio are compute-intensive and/or time consuming for such heterogeneous feeds. We propose DDownscale, a novel method to dynamically select the downscale factor for a video by utilizing the information on the object size in the video. We model the downscale factor and associated drop in relative recall due to downscaling as a function of object size in the downscaled video and demonstrated that for a given DNN model and class of interest, DDownscale generalizes well to the evaluated datasets. A DDownscale inequality between the relative recall drop and the hyper-parameters of the method is derived. This satisfies 98% of the dynamically downscaled videos across datasets, objects of interest and parameters. The algorithm achieve over 19% reduction in total object detection time and 24% reduction in storage on average compared to the baseline of storing/inferencing at the original resolution , for different user-specified target reduction in recall values ranging from 1–30%, and 96% of the downscaled videos are within the target recall drop.

A simpler specification at the time of ingest of target level of detail (average ground spatial distance) captured in the video and the harmonic mean of relative recall drop for the class of smallest object of interest and selected object detection model was derived using the above modeling to aid in the selection of a target level of detail. Additionally, we develop an ingest pipeline that reduces the time to ingest drone videos using this dynamically downscaling method over heterogeneous edge accelerators, and reduce the average turnaround time to ingest data from multiple clients by ~ 66%, despite the downscaling time overhead, compared to uploading original resolution video without downscaling.