DS256: Scalable Systems for Data Science [Jan, 2026]

Department of Computational and Data Sciences

Scalable Systems for Data Science

  • Instructors: Yogesh Simmhan (email) (www)
  • TA: Pranjal Naman (email), TBD
  • Course number: DS256
  • Credits: 3:1
  • Semester: Jan 2025
  • Lecture: Tue-Thu 330-5pm (First class on Tue 13 Jan, 2026), Tutorials: TBD
  • Room: CDS 202
  • Teams: Teams Link (Join using Teams Code v0xq5zv)
  • Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required. Basic knowledge of Machine Learning and Deep Learning.
  • See 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2025 webpages

Overview

This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to store, manage, pre-process and train ML models over datasets that are large, fast and linked. In particular, we will examine the data engineering pipelines required to prepare data before DNN and LLM training. We will also explore scalable machine learning methods using distributed and federated approaches, besides Big Data platforms used to scale enterprise data. 

  • Besides examining the architectural design of several contemporary data platforms: Google’s GFS/Apache HDFS, Apache Spark (RDD/DF/ML/Streaming), PyTorch Distributed, Amazon’s DynamoDB/Apache Cassandra, Apache Kafka and Google’s Pregel/Apache Giraph, this term we will also be examining Design Pattern for Distributed Systems used by these systems to achieve scalability, throughput, reliability, etc.
  • Instead of a self-selected team project, the entire class will do a semester-long project on designing a pre-processing pipeline for training an LLM (SLM) using Apache Spark, and subsequently performing distributed model training using Pytorch Distributed and serving the model.

The course modules will cover all layers of a scalable data science and scalable ML stack:

  • How do you store and query data at scale, using distributed file systems such as GFS/HDFS and Ceph and using cloud/NoSQL databases such as HBase and Dynamo?
  • How do you pre-process data at large volumes in preparation for machine learning using distributed processing systems on the cloud, such as Apache Spark?
  • How do you perform scalable training for both classic and deep learning using distributed training patterns and platforms such as parameter server, model/pipeline parallelism, federated learning, SparkML, Pytorch Distributed and DistDGL? How do we serve model inferencing at scale on distributed systems, including LLMs and GNNs?
  • How do you process fast and linked data for applications such as Internet of Things (IoT) and fintech using platforms such as Kafka, Spark Streaming and Giraph?

There will also be guest lectures by experts from the industry and academia who work on Data Science platforms and machine learning applications in the real-world.

The course will have one common hands-on course project for 35 points performed in teams of 2-4 (TBD) split into 3 parts: (1) Pre-processing pipeline for LLM (SLM) training, (2) Distributed training of LLM (SLM), and (3) Scalable inferencing of LLM (SLM), with a bulk of the weightage for #1 and reducing for #2 and #3.

There will also be one literature review and paper presentation (10 points).

Lastly, there will be 2 midterm quizzes (2×15 points) and a final exam (25 points) in in-class proctored to form the rest of the grading.

Pre-requisites

This is an introductory course on the design of platforms for data engineering analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Python). You will also need basic knowledge of Machine Learning and Deep Learning.

Tentative Schedule

Papers for Presentation

Some papers to choose that can be used for presentation are given below. Students can also propose alternative papers and get them approved.

  • TBD

Assessments and Grading

Weightage

Assessment

35%

One hands-on programming course project on pre-processing pipeline, training and inferencing for LLMs (20%+10%+5%)

30%

2 Midterm Quizzes (15%+15%)

10%

Paper Presentation at end of term

25%

Final exam

Teaching & Office Hours

  • Lecture: Tue and Thu, 330-5pm
  • Classroom: CDS 202
  • Office Hours: By appointment

Resources

  • Patterns of Distributed Systems, Umesh Joshi, Martin Fowler, 2023
  • Cluster Access: Students will validate their assignments and projects on the CDS turing cluster and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.

Academic Integrity

Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course. Use of Generative AI (ChatGPT, Copilot, etc.) in completing any of the assessments, including project coding/reports and presentation, is not permitted. Failure to follow these guidelines will lead to sanctions and penalties.

Learning takes place both within and outside the class. Hence, discussions between students, reference to online material and conversations with chatbots is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.