Department of Computational and Data Sciences

Scalable Systems for Data Science

Instructors: Yogesh Simmhan (email) (www)
TA: Radhika Mittal
Course number: DS256
Credits: 3:1
Semester: Jan 2025
Lecture: Tue-Thu 330-5pm (First class on Thu 9 Jan, 2025)
Room: CDS 202
Teams: Teams Link (Join using Teams Code q450782)
Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required.
See 2017, 2018, 2019, 2020, 2021, 2022, 2023 webpages

Overview

This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to store, manage and process datasets that are large, fast and linked. This includes data engineering to pre-process data before machine learning, and also scalable machine learning using distributed and federated approaches. If you are curious about how distributed ML, NoSQL and Big Data platforms work internally and how to use them efficiently to store and analyze terabytes of data, this is the course for you.

The course modules will cover all layers of a scalable data science stack:

How do you store and query data at scale, using distributed file systems such as GFS/HDFS and Ceph and using cloud/NoSQL databases such as HBase and Dynamo?
How do you pre-process data at large volumes in preparation for machine learning using distributed processing systems on the cloud, such as Apache Spark?
How do you perform scalable training for both classic and deep learning using distributed training patterns and platforms such as parameter server, model/pipeline parallelism, federated learning, SparkML and DistDGL? How do we serve model inferencing at scale on distributed systems, including LLMs and GNNs?
How do you process fast and linked data for applications such as Internet of Things (IoT) and fintech using platforms such as Kafka, Spark Streaming and Giraph?

There will also be guest lectures by experts from the industry and academia who work on Data Science platforms and machine learning applications in the real-world.

The course will have one programming assignment with Big Data platforms. There will be one literature review and paper presentation. There is also a course project on scalability data science and systems topics that will be performed in groups of 2. Teams will have access to computing resources such as commodity cluster, accelerators, edge devices, etc. to apply their classroom knowledge hands-on to real data and real platforms at scale. There will be 2 quizzes and a final exam to form the rest of the grading.

Pre-requisites

This is an introductory course on platforms and tools required to develop analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Python or Java).

Tentative Schedule

First class on Thu 9 Jan at 330PM at CDS 202
Introduction to Distributed Systems & Big Data Storage (Starts 9 Jan, ~5 lectures)
- Intro to Big Data
- Contrast Big Data systems: HBase/Big Table, Cassandra/Key-Value Store, Graph DB overview
- Understand the role of distributed systems. Understand distinction between weak and strong scaling.
- Distributed File Systems/HDFS/GFS/Ceph
- Cloud storage
- Reading
  - Scalable problems and memory-bounded speedup, Sun and Ni, JPDC, 1993
  - The Google File System, Sanjay Ghemawat Howard Gobioff Shun-Tak Leung, ACM SOSP, 2003
  - Ceph: A scalable, high-performance distributed file system. Weil, Sage, et al. OSDI. 2006.
Processing Large Volumes of Big Data (Starts ~30 Jan, ~5 lectures)
- Big Data Processing with MapReduce and Apache Spark
- Spark Basics, RDD, transformations, action, Shuffle
- Spark internals & Spark tuning
- Spark DataFrames, Spark SQL and Catalyst Optimizer
- Reading
  - MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghemawat, USENIX OSDI, 2004
  - Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, Matei Zaharia, et al., USENIX NSDI, 2012
  - Spark SQL: Relational Data Processing in Spark, Michael Armbrust, et al., ACM SIGMOD 2015
  - Select chapters from Learning Spark, Holden Karau, et al., 1st Editions and Learning Spark, Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, 2nd Edition
- Tutorials (TBD)
- Quiz 1 (15 points) (18 Feb)
- Programming Assignment (15 points)
  - Large scale data processing and analysis using Apache Spark
  - Posted on TBD, due on TBD
Machine Learning at Scale (Starts ~18 Feb, ~5 lectures)
- ML over Big Data, TensorFlow
- Data, Model and Pipeline parallelism. Parameter server.
- Federated Learning
- Scalable GNN Training
- Serving LLMs at scale
- Spark ML for ML pipelines
- Reading
  - Tensorflow: Large-scale machine learning on heterogeneous distributed systems, Abadi, Martín, et al., arXiv, 2016
  - Scaling Distributed Machine Learning with the Parameter Server, Li, Mu, et al., USENIX OSDI, 2014
  - Towards federated learning at scale: System design, Bonawitz, Keith, et al., SysML Conference, 2019
  - Beyond Data and Model Parallelism for Deep Neural Networks, Zhihao Jia, et al., MLSys 2019
  - Orca: A Distributed Serving System for Transformer-Based Generative Models, Gyeong-In Yu, et al., USENIX OSDI 2022
  - Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, Besta and Hoefler, 2024
Proposal of Project Topic and Team (Due ~20 Feb)
NoSQL Databases (Starts ~6 Mar, ~4 lectures)
- Consistency models and CAP theorem/BASE
- Amazon Dynamo/Cassandra distributed key-value store
- Overview of HBase/Big Table, Graph Databases, Vector Databases
- Overview of Data Warehousing, Data Lakes, ETL, Cloud NoSQL
- Reading
  - The dangers of replication and a solution, Jim Gray, Pat Helland, Patrick O’Neil, Dennis Shasha, ACM SIGMOD Record, 1996
  - CAP Twelve Years Later: How the “Rules” Have Changed, Eric Brewer, IEEE Computer, 2012
  - Dynamo: amazon’s highly available key-value store, DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. ACM SOSP, 2007
  - Select chapters from Learning Spark, Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, 2nd Edition
- Quiz 2 (15 points)
Selection of Research Paper for Presentation (Due ~9 Mar)
Processing Fast Data & Linked Data (Starts ~25 Mar, ~4 lectures)
- Need for Fast Data Processing. Internet of Things (IoT) application domain.
- Difference between low-latency ingest, analytics and querying.
- Publish-subscribe systems and Apache Kafka
- Streaming dataflows: Spark Streaming, Twitter Heron, Apache Flink
- Distributed graph processing, Vertex Centric Programming, Pregel, Giraph algorithms
- Reading
  - Kafka: A Distributed Messaging System for Log Processing, Jay Kreps, Neha Narkhede, Jun Rao, NetDB, 2011
  - DSTREAM
  - Pregel: a system for large-scale graph processing, Malewicz, et al, ACM SIGMOD 2010
  - Quiz 3
Research Reading and Presentations (10 points) (TBD)
- Presenting one research paper
- Peer review
Guest Lectures
- Talks by industry speakers throughout the semester
Final Exam (25 points) (TBD)
Final Project Presentation (25 points) (TBD)

Project Topics

Some sample projects are:

Flotilla: Federated learning using edge accelerators
PowerTrain and Fulcrum: Optimizing edge accelerators for LLMs and DNNs
Optimes: Scalable training and inferencing over GNNs
XFaaS: Composing and optimizing agentic LLMs as FaaS workflows
TARIS: Temporal graph and streaming graph mining for anomaly detection over fintech datasets
AeroDaaS: Distributed platforms for composing applications for drone fleet as a service
AIOpsLab: Validating LLM agents for resiliency in distributed systems
Scalable GNN analytics for traffic flow predictions
Optimizing FaaS workflows for Quantum computing on the cloud
Building a social network app using AT Protocol of BlueSky
Decentralized social media data using Solid Project
Scaling Hyper Ledger Fabric (HLF) for fintech blockchain transactions
Using Inter Planetary File System (IPFS) for federated data management

Papers for Presentation

Some papers to choose that can be used for presentation are given below. Students can also propose alternative papers and get them approved.

TBD

Assessments and Grading

Weightage	Assessment
15%	One programming assignment in Spark
30%	2 Quizzes
10%	Paper Presentation
20%	Final exam
25%	Project

Teaching & Office Hours

Lecture: Tue and Thu, 330-5pm
Classroom: CDS 202
Office Hours: By appointment

Resources

Online Teams Channel
Cluster Access: Students will validate their assignments and projects on the CDS turing cluster and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.

Academic Integrity

Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties.

Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. Unless stated otherwise, you must not take any help from others, online sources or generative AI tools (ChatGPT, Copilot, etc.) when solving any assessments. All works submitted by the student as part of their academic assessment must be their own.

DS256: Scalable Systems for Data Science [Jan, 2023]

Department of Computational and Data Sciences

Scalable Systems for Data Science

Overview

Pre-requisites

Tentative Schedule

Project Topics

Papers for Presentation

Assessments and Grading

Teaching & Office Hours

Resources

Academic Integrity

Recent News

Contact Us

Shortcuts

Explore

Get in touch

Follow us

Locate us