Department of Computational and Data Sciences

Scalable Systems for Data Science

Instructors: Yogesh Simmhan (email) (www)
TA: Tuhin Khare (email)
Course number: DS256
Credits: 3:1
Semester: Feb, 2021
Lecture: Mon-Wed 330-5pm
Room: Virtual on Teams. Use Teams Code 9l7vhi2
Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required.
First class on Wed 24 Feb at 5PM, see here for Teams Link to first class
See 2017, 2018, 2019, 2020 webpages

Overview

This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to acquire, store and query large, fast and linked datasets, to train machine learning models, and to process and analyze large datasets. If you are curious about how Big Data, NoSQL and ML platforms work internally and how to use them efficiently to store and process terabytes of data, this is the course for you.

This course will address three facets of scalable data platforms.

How are distributed program models such as MapReduce, TensorFlow, Vertex-centric and streaming dataflows designed to analyze large datasets?
How are popular Big Data and ML platforms like HDFS, Spark, MLLib, TensorFlow, Cassandra, Flink, etc. architected? What makes them scale on 100s of machines for terabytes of data?
How can you use these to develop distributed algorithms and scalable analytics applications using various design patterns?

There will also be several guest lectures by experts from the Industry, such as Microsoft, IBM Research, VMWare, etc. who work on Big Data platforms and machine learning applications in the real-world.

The course will have two programming assignments with Big Data platforms: on training ML models and analyzing linked data at scale. There is also a project on topics the students can propose related to such scalable data and ML platforms. Students will have access to a 24-node compute cluster and other computing resources to apply their classroom knowledge hands-on to real data platforms at scale. There will be periodic online quizzes and a final exam to form the rest of the grading.

Pre-requisites

This is an introductory course on platforms and tools required to develop analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Java).

Tentative Schedule

First Class on Wednesday Feb 24, 2021 at 5PM
Storing and Processing large datasets
- Distributed block-storage is the fundamental data platform for storing and access large data sets in a reliable and efficient manner. We will discuss:
  - GFS/HDFS for block data storage
  - MapReduce programming model
  - Apache Spark dataflow processing
Machine learning over large datasets
- Training of machine learning models requires computing resources at large scales. We will discuss data platforms for ML that operate on a cluster of servers:
  - Spark MLLib for machine learning
  - TensorFlow for deep learning: Parameter server, Federated
Analytics over Linked data
- Social networks and knowlege graphs are represented as linked data, with billions of vertices and edges. Processing, analyzing and querying them at scale requires specialized platforms:
  - Google Pregel model for designing large graph algorithms
  - Apache Giraph/GraphX for graph processing
NoSQL Databases
- NoSQL databases trade-in scalability and availability on 100s of servers in return for relaxed consistency. Such databases are at the heart of any eCommerce or social network website you use:
  - Consistency models and CAP theorem/BASE
  - Amazon Dynamo/Cassandra distributed key-value store
  - Google Big Table/HBase and SparkSQL for SQL-like querying
Project: Proposal
Processing high-velocity data
- From Internet of Things (IoT) to financial transactions to Twitter, processing and analyzing data arriving at 1000 events/sec is a latency-sensitive challenge. Fast data platforms help manage these:
  - Publish-subscribe systems and Apache Kafka
  - Streaming dataflows: Spark Streaming, Apache Flink
Project: Midterm review
Other topics
- Big data and ethics
- Big data on the Cloud
- IoT and Edge computing
- Data Lakes
Guest Lectures
- Talks by industry speakers such as Microsoft, IBM Research, VMWare, etc
Final Exam
Project : Final review

Project Topics

Some sample projects are:

Federated learning using edge computing (NVIDIA Jetson) and cloud computing resources
Distributed edge (Raspberry Pi) and cloud storage and querying systems
Scalable querying over knowledge graphs
Scalable training and inferencing over graph neural networks
Scalable pattern mining and analysis over Twitter streams
Distributed video analytics over drone (Tello) video feeds

Grading

30% Assignments	Two programming assignments
25%	In-class online quizzes (best 5 out of 6 quizzes)
25%	Project
20% Exam	Final exam

Teaching & Office Hours

Lecture: TBD, will be decided after first class
Office Hours: By appointment

Resources

Select literature
Select chapters from:
- Mastering Apache Spark 2 (Spark 2.2+), Jacek Laskowski
- Spark Internals, Jerry Lead
- Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, 1st Edition, Morgan & Claypool Publishers, 2010
- Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman, 2nd Edition (v2.1), 2014.
- TBD
Online documentation
- Apache Hadoop/HDFS
- Apache Spark
- TBD
Online Teams Channel: TBD
Cluster Access: Students will validate their assignments and projects on the CDS turing cluster, and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.
- Hadoop on turing

Public Datasets

Microsoft Data Science for Research: Dataset directory
TBD

Academic Integrity

Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties.

Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.

DS256: Scalable Systems for Data Science [3:1] (Feb, 2021)

Department of Computational and Data Sciences

Scalable Systems for Data Science

Overview

Pre-requisites

Tentative Schedule

Project Topics

Grading

Teaching & Office Hours

Resources

Public Datasets

Academic Integrity

Recent News

Contact Us

Shortcuts

Explore

Get in touch

Follow us

Locate us