DS256: Scalable Systems for Data Science [3:1] (Feb, 2021)

Department of Computational and Data Sciences

Scalable Systems for Data Science

  • Instructors: Yogesh Simmhan (email) (www)
  • TA: Tuhin Khare (email)
  • Course number: DS256
  • Credits: 3:1
  • Semester: Feb, 2021
  • Lecture: Mon-Wed 330-5pm
  • Room: Virtual on Teams. Use Teams Code 9l7vhi2
  • Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required.
  • First class on Wed 24 Feb at 5PM, see here for Teams Link to first class
  • See 2017, 2018, 2019, 2020 webpages

Overview

This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to acquire, store and query large, fast and linked datasets, to train machine learning models, and to process and analyze large datasets. If you are curious about how Big Data, NoSQL and ML platforms work internally and how to use them efficiently to store and process terabytes of data, this is the course for you.

This course will address three facets of scalable data platforms.

  • How are distributed program models such as MapReduce, TensorFlow, Vertex-centric and streaming dataflows designed to analyze large datasets?
  • How are popular Big Data and ML platforms like HDFS, Spark, MLLib, TensorFlow, Cassandra, Flink, etc. architected? What makes them scale on 100s of machines for terabytes of data?
  • How can you use these to develop distributed algorithms and scalable analytics applications using various design patterns?

There will also be several guest lectures by experts from the Industry, such as Microsoft, IBM Research, VMWare, etc. who work on Big Data platforms and machine learning applications in the real-world.

The course will have two programming assignments with Big Data platforms: on training ML models and analyzing linked data at scale. There is also a project on topics the students can propose related to such scalable data and ML platforms. Students will have access to a 24-node compute cluster and other computing resources to apply their classroom knowledge hands-on to real data platforms at scale. There will be periodic online quizzes and a final exam to form the rest of the grading.

Pre-requisites

This is an introductory course on platforms and tools required to develop analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Java).

Tentative Schedule

  • First Class on Wednesday Feb 24, 2021 at 5PM
  • Storing and Processing large datasets
    • Distributed block-storage is the fundamental data platform for storing and access large data sets in a reliable and efficient manner. We will discuss:
      • GFS/HDFS for block data storage
      • MapReduce programming model
      • Apache Spark dataflow processing
  • Machine learning over large datasets
    • Training of machine learning models requires computing resources at large scales. We will discuss data platforms for ML that operate on a cluster of servers:
      • Spark MLLib for machine learning
      • TensorFlow for deep learning: Parameter server, Federated
  • Analytics over Linked data
    • Social networks and knowlege graphs are represented as linked data, with billions of vertices and edges. Processing, analyzing and querying them at scale requires specialized platforms:
      • Google Pregel model for designing large graph algorithms
      • Apache Giraph/GraphX for graph processing
  • NoSQL Databases
    • NoSQL databases trade-in scalability and availability on 100s of servers in return for relaxed consistency. Such databases are at the heart of any eCommerce or social network website you use:
      • Consistency models and CAP theorem/BASE
      • Amazon Dynamo/Cassandra distributed key-value store
      • Google Big Table/HBase and SparkSQL for SQL-like querying
  • Project: Proposal
  • Processing high-velocity data
    • From Internet of Things (IoT) to financial transactions to Twitter, processing and analyzing data arriving at 1000 events/sec is a latency-sensitive challenge. Fast data platforms help manage these:
      • Publish-subscribe systems and Apache Kafka
      • Streaming dataflows: Spark Streaming, Apache Flink
  • Project: Midterm review
  • Other topics
    • Big data and ethics
    • Big data on the Cloud
    • IoT and Edge computing
    • Data Lakes
  • Guest Lectures
    • Talks by industry speakers such as Microsoft, IBM Research, VMWare, etc
  • Final Exam
  • Project : Final review

Project Topics

Some sample projects are:

  • Federated learning using edge computing (NVIDIA Jetson) and cloud computing resources
  • Distributed edge (Raspberry Pi) and cloud storage and querying systems
  • Scalable querying over knowledge graphs
  • Scalable training and inferencing over graph neural networks
  • Scalable pattern mining and analysis over Twitter streams
  • Distributed video analytics over drone (Tello) video feeds

Grading

30% Assignments Two programming assignments
25% In-class online quizzes (best 5 out of 6 quizzes)
25% Project
20% Exam Final exam

Teaching & Office Hours

  • Lecture: TBD, will be decided after first class
  • Office Hours: By appointment

Resources

Public Datasets

Academic Integrity

Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties.

Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.