Department of Computational and Data Sciences

Scalable Systems for Data Science

Instructors: Yogesh Simmhan (email)
TA: Sheshadri KR (email) and Shriram R (email)
Course number: DS256
Credits: 3:1
Semester: Jan, 2020
Lecture: Tue/Thu 2-330PM
Room: CDS 202
Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required, preferably in Java.
First class on Tue, Jan 7, 2020 at 2pm at CDS20
Register for Course Online using IISc AcadServer
See 2017, 2018 and 2019 webpages

Overview

This course will teach the fundamental Systems aspects of designing and using Big Data platforms, which are a specialization of scalable systems for data science applications. This course will address three facets of these platforms.

The design of distributed program models and abstractions, such as MapReduce, Dataflow and Vertex-centric models, for processing volume, velocity and linked datasets, and for storing and querying over NoSQL datasets.
The approaches and design patterns to translate existing data-intensive algorithms and analytics into these distributed programming abstractions.
Distributed software architectures, runtime and storage strategies used by Big Data platforms such as Apache Hadoop, Spark, Storm, Giraph and Hive to execute applications developed using these models on commodity clusters and Clouds in a scalable manner.

It will cover topics on: Why Big Data platforms are necessary? How they are designed? What are the programming abstractions (e.g. MapReduce) that are used to compose data science applications? How the programming models are translated to scalable runtime execution on clusters and Clouds (e.g. Hadoop)? How do you design algorithms for analyzing large datasets? How do you map them to Big Data platforms? and How can these be used to develop Big Data applications in an integrated manner?

Several of the lectures will adopt a “flipped classroom” model with students going over lectures and topics before the class, and using the classroom time to perform short in-class assignments based on applying concepts from the lecture topic

There will also be a hands-on Project, students will work with real, large datasets and commodity clusters, and use scalable algorithms and platforms to develop a Big Data application. The emphasis will be on designing applications that show good “weak scaling” as the size, speed or complexity of data increases, and using distributed systems such as commodity clusters and Clouds.

Several key and contemporary research papers will be discussed as part of the course, with group discussions and self-evaluations by students in the class.

There will also be several guest lectures by experts from the Industry who work on Big Data platforms, Cloud computing and data science.

This course extends from the systems basics introduced in the DS 221: Introduction to Scalable Systems course at CDS, and is complementary to the DS 222: Machine Learning with Large Datasets is offered in the Aug term. This course also complements other breadth courses on data science like the E0 229: Foundations of Data Science and E0 259: Data Analytics.

Intended Learning Objectives

At the end of the course, students will have learned about the following concepts.

Types of Big Data, Design goals of Big Data platforms, and where in the systems landscape these platforms fall.
Distributed programming models for Big Data, including Map Reduce, Stream processing and Graph processing.
Design of and development on Big Data platforms and their optimizations on commodity clusters and Clouds.
Scaling data Science algorithms and analytics using Big Data platforms.

Pre-requisites

This is an introductory course on platforms and tools required to develop analytics over Big Data. However, it builds upon prior knowledge that students have on computing and software systems, programming, data structures and algorithms. Students must be familiar with Data Structures (e.g. Arrays, Queues, Trees, Hashmaps, Graphs) and Algorithms (e.g. Sorting, Searching, Graph traversal, String algorithms, etc.).

It is recommended that students have good programming skills (preferably in Java) which is necessary for the programming assignments and projects. Familiarity with one or more of the following courses will also be helpful (although not mandatory): DS 292 (HPC), DS 295 (Parallel Programming), E0 253 (Operating Systems), E0 264 (Distributed Computing Systems), DS 252 (Introduction to Cloud Computing), E0 225 (Design and Analysis of Algorithms), E0 232 (Probability and Statistics), E0 259 (Data Analytics).

Assessment

The total assessment score for the course is based on a 1000 point scale. Of this, the weightage to different activities will tentatively be as follows:

30% Assignments	Several in-class assignments and activites, individually or in teams
40% Project	Proposal, Midterm and final project, to be done individually or in teams (50+100+250 points)
10% Research reading	Leading the discussion of a paper, and providing critical feedback on a presenter
20% Exam	Final exam (200 points)

Academic Integrity

Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties. This includes a reduced or failing grade in the course, and recurrent academic violations will be reported to the Institute and may lead to an expulsion.

Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.

Plagiarism: Verbatim reproduction of material from external sources (web pages, books, papers, etc.) is not acceptable. If you are paraphrasing external content (or even your own prior work) or were otherwise influenced by them while completing your assignments, projects or exams, you must clearly acknowledge them. When in doubt, add a citation!
Cheating: While you may discuss lecture topics and broad outlines of homework problems and projects with others, you cannot collaborate in completing the assignments, copy someone else’s solution or falsify results. You cannot use notes or unauthorized resources during exams, or copy from others. The narrow exception to collaboration is between team-mates when competing the project, and even there, the contribution of each team member for each project assignment should be clearly documented.
Classroom Behavior: Ensure that the course atmosphere, both in the class, outside and on the online forum, is conducive for learning. Participate in discussions but do not dominate or be abusive. There are no “stupid” questions. Be considerate of your fellow students and avoid disruptive behavior.

Resources

Select literature reading
Select chapters from:
- Mastering Apache Spark 2 (Spark 2.2+), Jacek Laskowski
- Spark Internals, Jerry Lead
- Mastering Apache Storm, Ankit Jain, Packtpub, 2017
- Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, 1st Edition, Morgan & Claypool Publishers, 2010
- Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman, 2nd Edition (v2.1), 2014.
Online documentation
Online Forum: ds256-2020-jan@iisc.ac.in group on O365
Cluster Access: Students will validate their assignments and projects on the CDS turing cluster, and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.
- Spark on turing
- Storm on turing
- Hadoop on turing

Teaching & Office Hours

Lecture: Tue/Thu 2-330PM, CDS 202 (Yogesh)
Office Hours: By appointment

Tentative Schedule

First Class on Jan 7, 2020
Processing large data volumes (~10 lectures)
- GFS/HDFS
- Hadoop MapReduce
- Spark, YARN
- Class programming activities, discussions, literature reading… (15%)
Project Proposal (5%)
Consistency & CAP Theorem (~3 lectures)
- Consistency models
- CAP, ACID, BASE
- DynamoDB, Cassandra, ElfStore
- Class programming activities, discussions, literature reading… (5%)
Processing linked data & fast data (~4 lectures)
- Pregel, Giraph, Graphite
- Storm
- Class programming activities, discussions, literature reading… (10%)
Midterm Project Review (10%)
Research literature discussions (~4 lectures)
- Literature reading
- Critical review and peer evaluation (10%)
Guest lectures (~4 lectures)
Final exam (20%)
Final Project Review (25%)

Activities (tentative)

Classroom Discussion Topics [Impact of Big Data]
- Is Big Data Good or Bad for Science and Society? [7-Jan]
  - Data driven science vs. Incomplete science
  - News at your fingertips vs. Fake news
  - Insightful recommendations vs. Targeted ads that follow you
- Can we preserve Privacy in the era of Big Data and AI? [9-Jan]
  - Free internet services vs. Knowledge of personal details by Industry
  - Right to Individual Privacy vs. Rights of Government to Ensure Security
  - Transparency in politics vs. Effect on Elections
  - Pervasive tracking through mobile phones and video cameras
  - Reading Material
    - “Will Democracy Survive Big Data and Artificial Intelligence?” Dirk Helbing, Bruno S. Frey, Gerd Gigerenzer, Ernst Hafen, Michael Hagner, Yvonne Hofstetter, Jeroen van den Hoven, Roberto V. Zicari, Andrej Zwitter, Scientific American, on February 25, 2017, https://www.scientificamerican.com/article/will-democracy-survive-big-data-and-artificial-intelligence/
    - Mai, Jens-Erik. “Big data privacy: The datafication of personal information.” The Information Society 32.3 (2016): 192-199. jenserikmai.info/Papers/2016_BigDataPrivacy.pdf
    - “Big Data, Data Science, and Civil Rights”, Solon Barocas, Elizabeth Bradley, Vasant Honavar, and Foster Provost, A Computing Community Consortium (CCC) white paper, https://arxiv.org/pdf/1706.03102
    - Calude, Cristian S., and Giuseppe Longo. “The deluge of spurious correlations in big data.” Foundations of science 22, no. 3 (2017): 595-612. https://hal.archives-ouvertes.fr/hal-01380626/file/BigData-Calude-LongoAug21.pdf
- Is access to the Internet a fundamental right?
  - Costs of access, Free Basics, Net neutrality
  - Internet freedom and Censorship
  - Disability and Accessibility
- Reading Material
  - Indian Journal of Law and Technology, http://ijlt.in/index.php/ijlt-blog/
  - Centre for Internet and Society (CIS), https://cis-india.org/
  - Software Freedom Law Center, https://sflc.in/
  - Living in Digital Darkness: A Handbook on Internet Shutdowns in India, SFLC, 2018
  - The Internet Society, https://www.internetsociety.org
Classroom Discussion Topics [Technical]
- Hadoop vs. Spark
- ACID vs. BASE
- Edge vs. Cloud
Classroom Programming Activities
- Managing and processing Big Data
- Replication in HDFS
- Spark programming
- Shuffle phase
- Scheduling in YARN
- Hands on with consistency
- Giraph programming

Papers

TBD

Project Topics

TBD

Public Datasets

Microsoft Data Science for Research: Dataset directory
…

DS256: Scalable Systems for Data Science [3:1] (Jan, 2020)

Department of Computational and Data Sciences

Scalable Systems for Data Science

Overview

Intended Learning Objectives

Pre-requisites

Assessment

Academic Integrity

Resources

Teaching & Office Hours

Tentative Schedule

Activities (tentative)

Papers

Project Topics

Public Datasets

Recent News

Contact Us

Shortcuts

Explore

Get in touch

Follow us

Locate us