Department of Computational and Data Sciences
Scalable Systems for Data Science
- Instructors: Yogesh Simmhan (email) (www)
- TA: Radhika Mittal
- Course number: DS256
- Credits: 3:1
- Semester: Jan 2025
- Lecture: Tue-Thu 330-5pm (First class on Thu 9 Jan, 2025)
- Room: CDS 202
- Teams: Teams Link (Join using Teams Code q450782)
- Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required.
- See 2017, 2018, 2019, 2020, 2021, 2022, 2023 webpages
Overview
This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to store, manage and process datasets that are large, fast and linked. This includes data engineering to pre-process data before machine learning, and also scalable machine learning using distributed and federated approaches. If you are curious about how distributed ML, NoSQL and Big Data platforms work internally and how to use them efficiently to store and analyze terabytes of data, this is the course for you.
The course modules will cover all layers of a scalable data science stack:
- How do you store and query data at scale, using distributed file systems such as GFS/HDFS and Ceph and using cloud/NoSQL databases such as HBase and Dynamo?
- How do you pre-process data at large volumes in preparation for machine learning using distributed processing systems on the cloud, such as Apache Spark?
- How do you perform scalable training for both classic and deep learning using distributed training patterns and platforms such as parameter server, model/pipeline parallelism, federated learning, SparkML and DistDGL? How do we serve model inferencing at scale on distributed systems, including LLMs and GNNs?
- How do you process fast and linked data for applications such as Internet of Things (IoT) and fintech using platforms such as Kafka, Spark Streaming and Giraph?
There will also be guest lectures by experts from the industry and academia who work on Data Science platforms and machine learning applications in the real-world.
The course will have one programming assignment with Big Data platforms. There will be one literature review and paper presentation. There is also a course project on scalability data science and systems topics that will be performed in groups of 2. Teams will have access to computing resources such as commodity cluster, accelerators, edge devices, etc. to apply their classroom knowledge hands-on to real data and real platforms at scale. There will be 2 quizzes and a final exam to form the rest of the grading.
Pre-requisites
This is an introductory course on platforms and tools required to develop analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Python or Java).
Tentative Schedule
- First class on Thu 9 Jan at 330PM at CDS 202
- Introduction to Distributed Systems & Big Data Storage (Starts 9 Jan, ~5 lectures)
- Intro to Big Data
- Contrast Big Data systems: HBase/Big Table, Cassandra/Key-Value Store, Graph DB overview
- Understand the role of distributed systems. Understand distinction between weak and strong scaling.
- Distributed File Systems/HDFS/GFS/Ceph
- Cloud storage
- Reading
- Scalable problems and memory-bounded speedup, Sun and Ni, JPDC, 1993
- The Google File System, Sanjay Ghemawat Howard Gobioff Shun-Tak Leung, ACM SOSP, 2003
- Ceph: A scalable, high-performance distributed file system. Weil, Sage, et al. OSDI. 2006.
- Processing Large Volumes of Big Data (Starts ~30 Jan, ~5 lectures)
- Big Data Processing with MapReduce and Apache Spark
- Spark Basics, RDD, transformations, action, Shuffle
- Spark internals & Spark tuning
- Spark DataFrames, Spark SQL and Catalyst Optimizer
- Reading
- MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghemawat, USENIX OSDI, 2004
- Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, Matei Zaharia, et al., USENIX NSDI, 2012
- Spark SQL: Relational Data Processing in Spark, Michael Armbrust, et al., ACM SIGMOD 2015
- Select chapters from Learning Spark, Holden Karau, et al., 1st Editions and Learning Spark, Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, 2nd Edition
- Tutorials (TBD)
- Quiz 1 (15 points) (TBD)
- Programming Assignment (15 points)
- Large scale data processing and analysis using Apache Spark
- Posted on TBD, due on TBD
- Machine Learning at Scale (Starts ~18 Feb, ~5 lectures)
- ML over Big Data, TensorFlow
- Data, Model and Pipeline parallelism. Parameter server.
- Federated Learning
- Scalable GNN Training
- Serving LLMs at scale
- Spark ML for ML pipelines
- Reading
- Tensorflow: Large-scale machine learning on heterogeneous distributed systems, Abadi, Martín, et al., arXiv, 2016
- Scaling Distributed Machine Learning with the Parameter Server, Li, Mu, et al., USENIX OSDI, 2014
- Towards federated learning at scale: System design, Bonawitz, Keith, et al., SysML Conference, 2019
- Beyond Data and Model Parallelism for Deep Neural Networks, Zhihao Jia, et al., MLSys 2019
- Orca: A Distributed Serving System for Transformer-Based Generative Models, Gyeong-In Yu, et al., USENIX OSDI 2022
- Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, Besta and Hoefler, 2024
- Proposal of Project Topic and Team (Due ~20 Feb)
- NoSQL Databases (Starts ~6 Mar, ~4 lectures)
- Consistency models and CAP theorem/BASE
- Amazon Dynamo/Cassandra distributed key-value store
- Overview of HBase/Big Table, Graph Databases, Vector Databases
- Overview of Data Warehousing, Data Lakes, ETL, Cloud NoSQL
- Reading
- The dangers of replication and a solution, Jim Gray, Pat Helland, Patrick O’Neil, Dennis Shasha, ACM SIGMOD Record, 1996
- CAP Twelve Years Later: How the “Rules” Have Changed, Eric Brewer, IEEE Computer, 2012
- Dynamo: amazon’s highly available key-value store, DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. ACM SOSP, 2007
- Select chapters from Learning Spark, Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, 2nd Edition
- Quiz 2 (15 points)
- Selection of Research Paper for Presentation (Due ~9 Mar)
- Processing Fast Data & Linked Data (Starts ~25 Mar, ~4 lectures)
- Need for Fast Data Processing. Internet of Things (IoT) application domain.
- Difference between low-latency ingest, analytics and querying.
- Publish-subscribe systems and Apache Kafka
- Streaming dataflows: Spark Streaming, Twitter Heron, Apache Flink
- Distributed graph processing, Vertex Centric Programming, Pregel, Giraph algorithms
- Reading
- Kafka: A Distributed Messaging System for Log Processing, Jay Kreps, Neha Narkhede, Jun Rao, NetDB, 2011
- DSTREAM
- Pregel: a system for large-scale graph processing, Malewicz, et al, ACM SIGMOD 2010
- Quiz 3
- Research Reading and Presentations (10 points) (TBD)
- Presenting one research paper
- Peer review
- Guest Lectures
- Talks by industry speakers throughout the semester
- Final Exam (25 points) (TBD)
- Final Project Presentation (25 points) (TBD)
Project Topics
Some sample projects are:
- Flotilla: Federated learning using edge accelerators
- PowerTrain and Fulcrum: Optimizing edge accelerators for LLMs and DNNs
- Optimes: Scalable training and inferencing over GNNs
- XFaaS: Composing and optimizing agentic LLMs as FaaS workflows
- TARIS: Temporal graph and streaming graph mining for anomaly detection over fintech datasets
- AeroDaaS: Distributed platforms for composing applications for drone fleet as a service
- AIOpsLab: Validating LLM agents for resiliency in distributed systems
- Scalable GNN analytics for traffic flow predictions
- Optimizing FaaS workflows for Quantum computing on the cloud
- Building a social network app using AT Protocol of BlueSky
- Decentralized social media data using Solid Project
- Scaling Hyper Ledger Fabric (HLF) for fintech blockchain transactions
- Using Inter Planetary File System (IPFS) for federated data management
Papers for Presentation
Some papers to choose that can be used for presentation are given below. Students can also propose alternative papers and get them approved.
- TBD
Assessments and Grading
Weightage 6890_324cb3-0f> |
Assessment 6890_2b3192-29> |
15% 6890_8c9f24-92> |
One programming assignment in Spark |
30% 6890_0d1833-9d> |
2 Quizzes |
10% 6890_dc9ccc-97> |
Paper Presentation 6890_45e559-35> |
20% 6890_2faf8d-a6> |
Final exam 6890_b6a6c4-dc> |
25% 6890_94aacf-12> |
Project 6890_a64a79-b6> |
Teaching & Office Hours
- Lecture: Tue and Thu, 330-5pm
- Classroom: CDS 202
- Office Hours: By appointment
Resources
- Online Teams Channel
- Cluster Access: Students will validate their assignments and projects on the CDS
turing
cluster and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.
Academic Integrity
Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties.
Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. Unless stated otherwise, you must not take any help from others, online sources or generative AI tools (ChatGPT, Copilot, etc.) when solving any assessments. All works submitted by the student as part of their academic assessment must be their own.