DEPARTMENT OF COMPUTATIONAL AND DATA SCIENCES
Ph.D. Thesis Colloquium
Speaker : Ms. Prashanthi S. K.
S.R. Number : 06-18-01-10-12-20-1-18362
Title : “Systems Optimizations for DNN Training and Inference on Accelerated Edge Devices”
Research Supervisor :Prof. Yogesh Simmhan
Date & Time : April 2, 2025 (Wednesday), 03:00 PM
Venue : CDS # 419
ABSTRACT
Deep Neural Networks (DNNs) have had a significant impact on a wide variety of domains, such as Autonomous Vehicles, Smart Cities, and Healthcare, through low-latency inferencing on edge computing devices close to the data source. Recently, there has also been a push towards training DNN models on the edge. This is driven by the increasing data collected from edge devices in Cyber-Physical Systems (CPS), the growing computing power of edge devices, and the rise of on-device training paradigms such as Federated Learning and Continuous Learning that focus on privacy and personalization.
Existing literature has focused heavily on optimizing edge inference, and there is very limited systems research on optimizing DNN training, and concurrent training and inference on the edge. Previous work on server GPUs cannot be directly applied since edge devices are architecturally different from cloud/server GPUs and find use in varied field deployments that have power or energy constraints. Through this PhD thesis, we design system optimizations and tune edge platforms to help DNN training and inference workloads utilize the full potential of accelerated edge hardware.
Specifically, in this thesis, we make four contributions: 1) Characterize the impact of training and device parameters on the performance and energy of DNN training workloads 2) Develop empirical ML models to predict and optimize the performance of training workloads in a power-constrained setting 3) Develop an analytical roofline model to understand and explain the impact of device parameters on power and performance of training and inference. 4) Design a scheduler for concurrent training and inference workloads to meet diverse QoS goals of latency and throughput within a power budget.
We motivate the need for training on the edge and the associated systems research challenges and conduct a rigorous empirical performance characterization of four classes of NVIDIA Jetson accelerated edge devices for DNN training. We vary training and device parameters such as I/O pipelining and parallelism, storage media, mini-batch sizes, and power modes, and examine their effect on CPU and GPU utilization, fetch stalls, training time, energy usage, and variability. Our analysis exposes several resource inter-dependencies and counter-intuitive insights, while also helping quantify known wisdom.
Building upon the insights from our characterization, we develop PowerTrain, a pre-training and transfer-learning approach to accurately predict the performance and power consumption of a given DNN training workload using any specified power mode (CPU/GPU/memory frequencies, core count) on NVIDIA Jetson devices. We use these predictions to instantly construct a Pareto front and return a configuration that minimizes training time within a power budget. PowerTrain requires minimal additional profiling for transfer learning to a new workload and generalizes to different models, datasets, and other edge devices. Our predictions outperform the NVIDIA prediction tool and other baselines and have low prediction errors of 5-15%.
In Pagoda, we investigate analytical roofline-based characterization to understand and explain the impact of power modes for various workloads. We develop a time roofline and a novel energy roofline model for diverse power modes. We couple this with an analytical model of the compute (FLOP) and memory access (bytes) for DNN workloads to analyze them from first principles. Lastly, we apply these methods to modify the power mode and, hence, the roofline of the edge device to optimize the latency and energy usage for DNN inference. Our experiments show energy benefits of up to 15% without a degradation in time.
Finally, we design Fulcrum, a scheduler that optimizes the power and performance of DNN training and inference workloads, both individually and when run concurrently. Specifically, we develop an interleaved approach for concurrent workload execution scheduled at the minibatch granularity, offering low variability in the inference latency. We also propose two novel optimization strategies that satisfy the diverse QoS goals of meeting inference latency and maximizing training throughput while staying within a power budget for field deployments. Our gradient descent-based multi-dimensional search approach (GMD) quickly converges to a solution with lesser profiling of power modes, while our active-learning-based approach (ALS) generalizes well across various problem configurations. Both our strategies outperform baselines and are close to the optimal solution.
ALL ARE WELCOME