DS256: Scalable Systems for Data Science [Jan, 2023]

Department of Computational and Data Sciences

Scalable Systems for Data Science

  • Instructors: Yogesh Simmhan (email) (www)
  • TA: Radhika Mittal
  • Course number: DS256
  • Credits: 3:1
  • Semester: Jan 2025
  • Lecture: Tue-Thu 330-5pm (First class on Thu 9 Jan, 2025)
  • Room: CDS 202
  • Teams: Teams Link (Join using Teams Code q450782)
  • Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required.
  • See 2017, 2018, 2019, 2020, 2021, 2022, 2023 webpages

Overview

This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to store, manage and process datasets that are large, fast and linked. This includes data engineering to pre-process data before machine learning, and also scalable machine learning using distributed and federated approaches.  If you are curious about how distributed ML, NoSQL and Big Data platforms work internally and how to use them efficiently to store and analyze terabytes of data, this is the course for you.

The course modules will cover all layers of a scalable data science stack:

  • How do you store and query data at scale, using distributed file systems such as GFS/HDFS and Ceph and using cloud/NoSQL databases such as HBase and Dynamo?
  • How do you pre-process data at large volumes in preparation for machine learning using distributed processing systems on the cloud, such as Apache Spark?
  • How do you perform scalable training for both classic and deep learning using distributed training patterns and platforms such as parameter server, model/pipeline parallelism, federated learning, SparkML and DistDGL? How do we serve model inferencing at scale on distributed systems, including LLMs and GNNs?
  • How do you process fast and linked data for applications such as Internet of Things (IoT) and fintech using platforms such as Kafka, Spark Streaming and Giraph?

There will also be guest lectures by experts from the industry and academia who work on Data Science platforms and machine learning applications in the real-world.

The course will have one programming assignment with Big Data platforms. There will be one literature review and paper presentation. There is also a course project on scalability data science and systems topics that will be performed in groups of 2. Teams will have access to computing resources such as commodity cluster, accelerators, edge devices, etc. to apply their classroom knowledge hands-on to real data and real platforms at scale. There will be 2 quizzes and a final exam to form the rest of the grading.

Pre-requisites

This is an introductory course on platforms and tools required to develop analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Python or Java).

Tentative Schedule

Project Topics

Some sample projects are:

  • Flotilla: Federated learning using edge accelerators
  • PowerTrain and Fulcrum: Optimizing edge accelerators for LLMs and DNNs
  • Optimes: Scalable training and inferencing over GNNs
  • XFaaS: Composing and optimizing agentic LLMs as FaaS workflows
  • TARIS: Temporal graph and streaming graph mining for anomaly detection over fintech datasets
  • AeroDaaS: Distributed platforms for composing applications for drone fleet as a service
  • AIOpsLab: Validating LLM agents for resiliency in distributed systems
  • Scalable GNN analytics for traffic flow predictions
  • Optimizing FaaS workflows for Quantum computing on the cloud
  • Building a social network app using AT Protocol of BlueSky
  • Decentralized social media data using Solid Project
  • Scaling Hyper Ledger Fabric (HLF) for fintech blockchain transactions
  • Using Inter Planetary File System (IPFS) for federated data management

Papers for Presentation

Some papers to choose that can be used for presentation are given below. Students can also propose alternative papers and get them approved.

  • TBD

Assessments and Grading

Weightage

Assessment

15%

One programming assignment in Spark

30%

2 Quizzes

10%

Paper Presentation

20%

Final exam

25%

Project

Teaching & Office Hours

  • Lecture: Tue and Thu, 330-5pm
  • Classroom: CDS 202
  • Office Hours: By appointment

Resources

  • Online Teams Channel
  • Cluster Access: Students will validate their assignments and projects on the CDS turing cluster and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.

Academic Integrity

Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties.

Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. Unless stated otherwise, you must not take any help from others, online sources or generative AI tools (ChatGPT, Copilot, etc.) when solving any assessments. All works submitted by the student as part of their academic assessment must be their own.