Department of Computational and Data Sciences
Scalable Systems for Data Science
- Instructors: Yogesh Simmhan (email) (www)
- TA: Tuhin Khare (email)
- Course number: DS256
- Credits: 3:1
- Semester: Feb, 2021
- Lecture: Mon-Wed 330-5pm
- Room: Virtual on Teams. Use Teams Code 9l7vhi2
- Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required.
- First class on Wed 24 Feb at 5PM, see here for Teams Link to first class
- See 2017, 2018, 2019, 2020 webpages
Overview
This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to acquire, store and query large, fast and linked datasets, to train machine learning models, and to process and analyze large datasets. If you are curious about how Big Data, NoSQL and ML platforms work internally and how to use them efficiently to store and process terabytes of data, this is the course for you.
This course will address three facets of scalable data platforms.
- How are distributed program models such as MapReduce, TensorFlow, Vertex-centric and streaming dataflows designed to analyze large datasets?
- How are popular Big Data and ML platforms like HDFS, Spark, MLLib, TensorFlow, Cassandra, Flink, etc. architected? What makes them scale on 100s of machines for terabytes of data?
- How can you use these to develop distributed algorithms and scalable analytics applications using various design patterns?
There will also be several guest lectures by experts from the Industry, such as Microsoft, IBM Research, VMWare, etc. who work on Big Data platforms and machine learning applications in the real-world.
The course will have two programming assignments with Big Data platforms: on training ML models and analyzing linked data at scale. There is also a project on topics the students can propose related to such scalable data and ML platforms. Students will have access to a 24-node compute cluster and other computing resources to apply their classroom knowledge hands-on to real data platforms at scale. There will be periodic online quizzes and a final exam to form the rest of the grading.
Pre-requisites
This is an introductory course on platforms and tools required to develop analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Java).
Tentative Schedule
- First Class on Wednesday Feb 24, 2021 at 5PM
- Storing and Processing large datasets
- Distributed block-storage is the fundamental data platform for storing and access large data sets in a reliable and efficient manner. We will discuss:
- GFS/HDFS for block data storage
- MapReduce programming model
- Apache Spark dataflow processing
- Distributed block-storage is the fundamental data platform for storing and access large data sets in a reliable and efficient manner. We will discuss:
- Machine learning over large datasets
- Training of machine learning models requires computing resources at large scales. We will discuss data platforms for ML that operate on a cluster of servers:
- Spark MLLib for machine learning
- TensorFlow for deep learning: Parameter server, Federated
- Training of machine learning models requires computing resources at large scales. We will discuss data platforms for ML that operate on a cluster of servers:
- Analytics over Linked data
- Social networks and knowlege graphs are represented as linked data, with billions of vertices and edges. Processing, analyzing and querying them at scale requires specialized platforms:
- Google Pregel model for designing large graph algorithms
- Apache Giraph/GraphX for graph processing
- Social networks and knowlege graphs are represented as linked data, with billions of vertices and edges. Processing, analyzing and querying them at scale requires specialized platforms:
- NoSQL Databases
- NoSQL databases trade-in scalability and availability on 100s of servers in return for relaxed consistency. Such databases are at the heart of any eCommerce or social network website you use:
- Consistency models and CAP theorem/BASE
- Amazon Dynamo/Cassandra distributed key-value store
- Google Big Table/HBase and SparkSQL for SQL-like querying
- NoSQL databases trade-in scalability and availability on 100s of servers in return for relaxed consistency. Such databases are at the heart of any eCommerce or social network website you use:
- Project: Proposal
- Processing high-velocity data
- From Internet of Things (IoT) to financial transactions to Twitter, processing and analyzing data arriving at 1000 events/sec is a latency-sensitive challenge. Fast data platforms help manage these:
- Publish-subscribe systems and Apache Kafka
- Streaming dataflows: Spark Streaming, Apache Flink
- From Internet of Things (IoT) to financial transactions to Twitter, processing and analyzing data arriving at 1000 events/sec is a latency-sensitive challenge. Fast data platforms help manage these:
- Project: Midterm review
- Other topics
- Big data and ethics
- Big data on the Cloud
- IoT and Edge computing
- Data Lakes
- Guest Lectures
- Talks by industry speakers such as Microsoft, IBM Research, VMWare, etc
- Final Exam
- Project : Final review
Project Topics
Some sample projects are:
- Federated learning using edge computing (NVIDIA Jetson) and cloud computing resources
- Distributed edge (Raspberry Pi) and cloud storage and querying systems
- Scalable querying over knowledge graphs
- Scalable training and inferencing over graph neural networks
- Scalable pattern mining and analysis over Twitter streams
- Distributed video analytics over drone (Tello) video feeds
Grading
30% Assignments | Two programming assignments |
25% | In-class online quizzes (best 5 out of 6 quizzes) |
25% | Project |
20% Exam | Final exam |
Teaching & Office Hours
- Lecture: TBD, will be decided after first class
- Office Hours: By appointment
Resources
- Select literature
- Select chapters from:
- Mastering Apache Spark 2 (Spark 2.2+), Jacek Laskowski
- Spark Internals, Jerry Lead
- Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, 1st Edition, Morgan & Claypool Publishers, 2010
- Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman, 2nd Edition (v2.1), 2014.
- TBD
- Online documentation
- Online Teams Channel: TBD
- Cluster Access: Students will validate their assignments and projects on the CDS
turing
cluster, and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.
Public Datasets
- Microsoft Data Science for Research: Dataset directory
- TBD
Academic Integrity
Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties.
Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.