DS256: Scalable Systems for Data Science [Jan, 2022]

Department of Computational and Data Sciences

Scalable Systems for Data Science

  • Instructors: Yogesh Simmhan (email) (www)
  • TA: Tuhin Khare (email)
  • Course number: DS256
  • Credits: 3:1
  • Semester: Jan 2022
  • Lecture: Mon-Wed 330-5pm
  • Room: Initially on Microsoft Teams (Use Teams Code lvs00ts). Physical room TBD.
  • Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required.
  • First class on Wed 12 Jan at 330PM on Teams
  • See 2017, 2018, 2019, 2020, 2021 webpages

Overview

This course will teach the fundamental “systems” aspects of designing and using scalable data science platforms. Such platforms are used to acquire, store and query large, fast and linked datasets, to train machine learning models, and to process and analyze large datasets. If you are curious about how Big Data, NoSQL and ML platforms work internally and how to use them efficiently to store and process terabytes of data, this is the course for you.

This course will address three facets of abstractions, platforms and applications for Big Data:

  • How are distributed program models such as Map Reduce, vertex-centric, parameter server, federated learning, etc. designed to analyze large datasets?
  • How are popular Big Data and ML platforms like HDFS, Spark ML, Cassandra, Kafka, etc. architected? What makes them scale on 100s of servers, accelerators and edge devices over terabytes of data?
  • How can you use these to develop distributed algorithms, scalable analytics and wide-area Internet of Things (IoT) applications using various design patterns?

There will also be guest lectures by experts from the industry and academia who work on Big Data platforms and machine learning applications in the real-world.

The course will have one programming assignment with Big Data platforms. There will be one literature review and paper presentation. There is also a project on topics the students can propose related to such scalable data and ML platforms. Students will have access to a compute cluster, accelerators, edge devices, etc. and other computing resources to apply their classroom knowledge hands-on to real data and real platforms at scale. There will be periodic online quizzes and a final exam to form the rest of the grading.

Pre-requisites

This is an introductory course on platforms and tools required to develop analytics over Big Data. However, you need prior knowledge on basics of computer systems, data structures, algorithms and good programming skills (preferably in Java or Python).

Tentative Schedule

  • First class on Wed 12 Jan at 330PM on Teams
  • Introduction to Big Data & Distributed Systems (Starts 12 Jan)
    • Intro to Big Data
    • Storage, compute, visualization, etc. platforms
    • Files vs. Overview of Relational Databases vs. NoSQL Databases:
    • Contrast Big Data systems: HBase/Big Table, Cassandra/Key-Value Store, Graph DB overview
    • Understand the role of distributed systems for data-parallel processing. Clusters, Cloud computing, Edge computing.
    • Understand distinction between weak and strong scaling.
    • Distributed File Systems/HDFS/GFS
    • Cloud storage
    • Reading
    • Quiz 1
  • Processing Large Volumes of Big Data (Starts 31 Jan)
  • NoSQL Databases (Starts 16 Feb)
  • Proposal of Project Topic and Team (Due 2 Mar)
  • Selection of Paper for Presentation (Due 9 Mar)
  • Processing Fast Data & Linked Data (Starts 9 Mar)
  • Machine Learning at Scale (Starts 28 Mar)
  • Research Reading and Presentations (11 and 13 Apr)
    • Presenting one research paper
    • Peer review
  • Guest Lectures
    • Talks by industry speakers throughout the semester
  • Final Exam (Week of 18 Mar)
  • Final Project Presentation (23 Apr)

Project Topics

Some sample projects are:

  • Federated learning using edge computing (NVIDIA Jetson) and cloud computing resources
  • Distributed edge (Raspberry Pi) and cloud storage and querying systems
  • Scalable querying over knowledge graphs
  • Scalable training and inferencing over graph neural networks
  • Scalable pattern mining and analysis over Twitter streams
  • Distributed video analytics over drone (Tello) video feeds

Papers for Presentation

Some papers to choose that can be used for presentation are given below. Students can also propose alternative papers and get them approved.

  • TBD

Grading

15% One programming assignment
35% Quizzes (5 x 7 points)
10% Paper Presentation
20% Final exam
20% Project

Teaching & Office Hours

  • Lecture: Mon and Wed, 330-5pm
    • First class on Wed 12 Jan at 330PM on Teams
    • Physical class room TBD
  • Office Hours: By appointment

Resources

  • Online Teams Channel
  • Cluster Access: Students will validate their assignments and projects on the CDS turing cluster, and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.

Academic Integrity

Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties.

Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.