Department of Computational and Data Sciences

Scalable Systems for Data Science

Instructors: Yogesh Simmhan (email) (www)
TA: Tuhin Khare (email)
Course number: DS256
Credits: 3:1
Semester: Jan 2022
Lecture: Mon-Wed 330-5pm
Room: Initially on Microsoft Teams (Use Teams Code lvs00ts). Physical room TBD.
Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required.
First class on Wed 12 Jan at 330PM on Teams
See 2017, 2018, 2019, 2020, 2021 webpages

Overview

This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to acquire, store and query large, fast and linked datasets, to train machine learning models, and to process and analyze large datasets. If you are curious about how Big Data, NoSQL and ML platforms work internally and how to use them efficiently to store and process terabytes of data, this is the course for you.

This course will address three facets of abstractions, platforms and applications for Big Data:

How are distributed program models such as Map Reduce, vertex-centric, parameter server, federated learning, etc. designed to analyze large datasets?
How are popular Big Data and ML platforms like HDFS, Spark ML, Cassandra, Kafka, etc. architected? What makes them scale on 100s of servers, accelerators and edge devices over terabytes of data?
How can you use these to develop distributed algorithms, scalable analytics and wide-area Internet of Things (IoT) applications using various design patterns?

There will also be guest lectures by experts from the industry and academia who work on Big Data platforms and machine learning applications in the real-world.

The course will have one programming assignment with Big Data platforms. There will be one literature review and paper presentation. There is also a project on topics the students can propose related to such scalable data and ML platforms. Students will have access to a compute cluster, accelerators, edge devices, etc. and other computing resources to apply their classroom knowledge hands-on to real data and real platforms at scale. There will be periodic online quizzes and a final exam to form the rest of the grading.

Pre-requisites

This is an introductory course on platforms and tools required to develop analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Java or Python).

Tentative Schedule

First class on Wed 12 Jan at 330PM on Teams
Introduction to Big Data & Distributed Systems (Starts 12 Jan)
- Intro to Big Data
- Storage, compute, visualization, etc. platforms
- Files vs. Overview of Relational Databases vs. NoSQL Databases:
- Contrast Big Data systems: HBase/Big Table, Cassandra/Key-Value Store, Graph DB overview
- Understand the role of distributed systems for data-parallel processing. Clusters, Cloud computing, Edge computing.
- Understand distinction between weak and strong scaling.
- Distributed File Systems/HDFS/GFS
- Cloud storage
- Reading
  - Scalable problems and memory-bounded speedup, Sun and Ni, JPDC, 1993
  - The Google File System, Sanjay Ghemawat Howard Gobioff Shun-Tak Leung, ACM SOSP, 2003
- Quiz 1
Processing Large Volumes of Big Data (Starts 31 Jan)
- Big Data Processing with MapReduce and Spark
- Spark Basics, RDD, transformations, action, Shuffle
- Spark internals & Spark tuning
- Reading
  - MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghemawat, USENIX OSDI, 2004
  - Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, Ion Stoica, USENIX NSDI, 2012
  - Select chapters from Learning Spark, Holden Karau, et al., 1st Edition
- Tutorials (5 and 11 Feb)
- Quiz 2
- Programming Assignment 1
  - Large scale data processing and analysis using Apache Spark
  - Posted on 11 Feb, due on 25 Feb
NoSQL Databases (Starts 16 Feb)
- Consistency models and CAP theorem/BASE
- Amazon Dynamo/Cassandra distributed key-value store
- Spark DataFrames, Spark SQL, Catalyst optimizer
- Overview of HBase/Big Table, Graph Databases
- Overview of Data Warehousing, Data Lakes, ETL, Cloud NoSQL
- Reading
  - The dangers of replication and a solution, Jim Gray, Pat Helland, Patrick O’Neil, Dennis Shasha, ACM SIGMOD Record, 1996
  - CAP Twelve Years Later: How the “Rules” Have Changed, Eric Brewer, IEEE Computer, 2012
  - Dynamo: amazon’s highly available key-value store, DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. ACM SOSP, 2007
  - Spark SQL: Relational Data Processing in Spark, Michael Armbrust, et al., ACM SIGMOD 2015
  - Select chapters from Learning Spark, Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, 2nd Edition
- Quiz 3
Proposal of Project Topic and Team (Due 2 Mar)
Selection of Paper for Presentation (Due 9 Mar)
Processing Fast Data & Linked Data (Starts 9 Mar)
- Need for Fast Data Processing. Internet of Things (IoT) application domain.
- Difference between low-latency ingest, analytics and querying.
- Publish-subscribe systems and Apache Kafka
- Streaming dataflows: Spark Streaming, Twitter Heron, Apache Flink
- Distributed graph processing, Vertex Centric Programming, Pregel, Giraph algorithms
- Reading
  - Kafka: A Distributed Messaging System for Log Processing, Jay Kreps, Neha Narkhede, Jun Rao, NetDB, 2011
  - TBD: DSTREAM/FLINK/HERON
  - Pregel: a system for large-scale graph processing, Malewicz, et al, ACM SIGMOD 2010
  - Quiz 4
Machine Learning at Scale (Starts 28 Mar)
- ML over Big Data
- TensorFlow
- Parameter server and Federated learning
- Spark ML for ML pipelines
- Reading
  - Tensorflow: Large-scale machine learning on heterogeneous distributed systems, Abadi, Martín, et al., arXiv, 2016
  - Scaling Distributed Machine Learning with the Parameter Server, Li, Mu, et al., USENIX OSDI, 2014
  - Towards federated learning at scale: System design, Bonawitz, Keith, et al., SysML Conference, 2019
  - Select chapters from Learning Spark, Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, 2nd Edition
- Quiz 5
Research Reading and Presentations (11 and 13 Apr)
- Presenting one research paper
- Peer review
Guest Lectures
- Talks by industry speakers throughout the semester
Final Exam (Week of 18 Mar)
Final Project Presentation (23 Apr)

Project Topics

Some sample projects are:

Federated learning using edge computing (NVIDIA Jetson) and cloud computing resources
Distributed edge (Raspberry Pi) and cloud storage and querying systems
Scalable querying over knowledge graphs
Scalable training and inferencing over graph neural networks
Scalable pattern mining and analysis over Twitter streams
Distributed video analytics over drone (Tello) video feeds

Papers for Presentation

Some papers to choose that can be used for presentation are given below. Students can also propose alternative papers and get them approved.

TBD

Grading

15%	One programming assignment
35%	Quizzes (5 x 7 points)
10%	Paper Presentation
20%	Final exam
20%	Project

Teaching & Office Hours

Lecture: Mon and Wed, 330-5pm
- First class on Wed 12 Jan at 330PM on Teams
- Physical class room TBD
Office Hours: By appointment

Resources

Online Teams Channel
Cluster Access: Students will validate their assignments and projects on the CDS turing cluster, and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.
- Hadoop on turing

Academic Integrity

Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties.

Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.

DS256: Scalable Systems for Data Science [Jan, 2022]

Department of Computational and Data Sciences

Scalable Systems for Data Science

Overview

Pre-requisites

Tentative Schedule

Project Topics

Papers for Presentation

Grading

Teaching & Office Hours

Resources

Academic Integrity

Recent News

Contact Us

Shortcuts

Explore

Get in touch

Follow us

Locate us