Department of Computational and Data Sciences
Scalable Systems for Data Science
- Instructors: Yogesh Simmhan (email) (www)
- TA: Tuhin Khare (email)
- Course number: DS256
- Credits: 3:1
- Semester: Jan 2022
- Lecture: Mon-Wed 330-5pm
- Room: Initially on Microsoft Teams (Use Teams Code lvs00ts). Physical room TBD.
- Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required.
- First class on Wed 12 Jan at 330PM on Teams
- See 2017, 2018, 2019, 2020, 2021 webpages
Overview
This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to acquire, store and query large, fast and linked datasets, to train machine learning models, and to process and analyze large datasets. If you are curious about how Big Data, NoSQL and ML platforms work internally and how to use them efficiently to store and process terabytes of data, this is the course for you.
This course will address three facets of abstractions, platforms and applications for Big Data:
- How are distributed program models such as Map Reduce, vertex-centric, parameter server, federated learning, etc. designed to analyze large datasets?
- How are popular Big Data and ML platforms like HDFS, Spark ML, Cassandra, Kafka, etc. architected? What makes them scale on 100s of servers, accelerators and edge devices over terabytes of data?
- How can you use these to develop distributed algorithms, scalable analytics and wide-area Internet of Things (IoT) applications using various design patterns?
There will also be guest lectures by experts from the industry and academia who work on Big Data platforms and machine learning applications in the real-world.
The course will have one programming assignment with Big Data platforms. There will be one literature review and paper presentation. There is also a project on topics the students can propose related to such scalable data and ML platforms. Students will have access to a compute cluster, accelerators, edge devices, etc. and other computing resources to apply their classroom knowledge hands-on to real data and real platforms at scale. There will be periodic online quizzes and a final exam to form the rest of the grading.
Pre-requisites
This is an introductory course on platforms and tools required to develop analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Java or Python).
Tentative Schedule
- First class on Wed 12 Jan at 330PM on Teams
- Introduction to Big Data & Distributed Systems (Starts 12 Jan)
- Intro to Big Data
- Storage, compute, visualization, etc. platforms
- Files vs. Overview of Relational Databases vs. NoSQL Databases:
- Contrast Big Data systems: HBase/Big Table, Cassandra/Key-Value Store, Graph DB overview
- Understand the role of distributed systems for data-parallel processing. Clusters, Cloud computing, Edge computing.
- Understand distinction between weak and strong scaling.
- Distributed File Systems/HDFS/GFS
- Cloud storage
- Reading
- Scalable problems and memory-bounded speedup, Sun and Ni, JPDC, 1993
- The Google File System, Sanjay Ghemawat Howard Gobioff Shun-Tak Leung, ACM SOSP, 2003
- Quiz 1
- Processing Large Volumes of Big Data (Starts 31 Jan)
- Big Data Processing with MapReduce and Spark
- Spark Basics, RDD, transformations, action, Shuffle
- Spark internals & Spark tuning
- Reading
- MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghemawat, USENIX OSDI, 2004
- Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, Ion Stoica, USENIX NSDI, 2012
- Select chapters from Learning Spark, Holden Karau, et al., 1st Edition
- Tutorials (5 and 11 Feb)
- Quiz 2
- Programming Assignment 1
- Large scale data processing and analysis using Apache Spark
- Posted on 11 Feb, due on 25 Feb
- NoSQL Databases (Starts 16 Feb)
- Consistency models and CAP theorem/BASE
- Amazon Dynamo/Cassandra distributed key-value store
- Spark DataFrames, Spark SQL, Catalyst optimizer
- Overview of HBase/Big Table, Graph Databases
- Overview of Data Warehousing, Data Lakes, ETL, Cloud NoSQL
- Reading
- The dangers of replication and a solution, Jim Gray, Pat Helland, Patrick O’Neil, Dennis Shasha, ACM SIGMOD Record, 1996
- CAP Twelve Years Later: How the “Rules” Have Changed, Eric Brewer, IEEE Computer, 2012
- Dynamo: amazon’s highly available key-value store, DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. ACM SOSP, 2007
- Spark SQL: Relational Data Processing in Spark, Michael Armbrust, et al., ACM SIGMOD 2015
- Select chapters from Learning Spark, Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, 2nd Edition
- Quiz 3
- Proposal of Project Topic and Team (Due 2 Mar)
- Selection of Paper for Presentation (Due 9 Mar)
- Processing Fast Data & Linked Data (Starts 9 Mar)
- Need for Fast Data Processing. Internet of Things (IoT) application domain.
- Difference between low-latency ingest, analytics and querying.
- Publish-subscribe systems and Apache Kafka
- Streaming dataflows: Spark Streaming, Twitter Heron, Apache Flink
- Distributed graph processing, Vertex Centric Programming, Pregel, Giraph algorithms
- Reading
- Kafka: A Distributed Messaging System for Log Processing, Jay Kreps, Neha Narkhede, Jun Rao, NetDB, 2011
- TBD: DSTREAM/FLINK/HERON
- Pregel: a system for large-scale graph processing, Malewicz, et al, ACM SIGMOD 2010
- Quiz 4
- Machine Learning at Scale (Starts 28 Mar)
- ML over Big Data
- TensorFlow
- Parameter server and Federated learning
- Spark ML for ML pipelines
- Reading
- Tensorflow: Large-scale machine learning on heterogeneous distributed systems, Abadi, Martín, et al., arXiv, 2016
- Scaling Distributed Machine Learning with the Parameter Server, Li, Mu, et al., USENIX OSDI, 2014
- Towards federated learning at scale: System design, Bonawitz, Keith, et al., SysML Conference, 2019
- Select chapters from Learning Spark, Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, 2nd Edition
- Quiz 5
- Research Reading and Presentations (11 and 13 Apr)
- Presenting one research paper
- Peer review
- Guest Lectures
- Talks by industry speakers throughout the semester
- Final Exam (Week of 18 Mar)
- Final Project Presentation (23 Apr)
Project Topics
Some sample projects are:
- Federated learning using edge computing (NVIDIA Jetson) and cloud computing resources
- Distributed edge (Raspberry Pi) and cloud storage and querying systems
- Scalable querying over knowledge graphs
- Scalable training and inferencing over graph neural networks
- Scalable pattern mining and analysis over Twitter streams
- Distributed video analytics over drone (Tello) video feeds
Papers for Presentation
Some papers to choose that can be used for presentation are given below. Students can also propose alternative papers and get them approved.
- TBD
Grading
15% | One programming assignment |
35% | Quizzes (5 x 7 points) |
10% | Paper Presentation |
20% | Final exam |
20% | Project |
Teaching & Office Hours
- Lecture: Mon and Wed, 330-5pm
- First class on Wed 12 Jan at 330PM on Teams
- Physical class room TBD
- Office Hours: By appointment
Resources
- Online Teams Channel
- Cluster Access: Students will validate their assignments and projects on the CDS
turing
cluster, and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.
Academic Integrity
Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties.
Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.