Department of Computational and Data Sciences
Scalable Systems for Data Science
- Instructors: Yogesh Simmhan (email)
- Course number: DS256
- Credits: 3:1
- Semester: Jan, 2017
- Lecture: Tue/Thu 330-5PM
- Room: CDS 202
- Pre-requisites: Data Structures, Programming and Algorithm concepts. Programming experience required, preferably in Java.
- First class on Thu, Jan 5 at 330pm at CDS202
- Register for Course Online using IISc CourseReg
This course will teach the fundamental Systems aspects of designing and using Big Data platforms, which are a specialization of scalable systems for data science applications. This course will address three facets of these platforms.
- The design of distributed program models and abstractions, such as MapReduce, Dataflow and Vertex-centric models, for processing volume, velocity and linked datasets, and for storing and querying over NoSQL datasets.
- The approaches and design patterns to translate existing data-intensive algorithms and analytics into these distributed programming abstractions.
- Distributed software architectures, runtime and storage strategies used by Big Data platforms such as Apache Hadoop, Spark, Storm, Giraph and Hive to execute applications developed using these models on commodity clusters and Clouds in a scalable manner.
It will cover topics on: Why Big Data platforms are necessary? How they are designed? What are the programming abstractions (e.g. MapReduce) that are used to compose data science applications? How the programming models are translated to scalable runtime execution on clusters and Clouds (e.g. Hadoop)? How do you design algorithms for analyzing large datasets? How do you map them to Big Data platforms? and How can these be used to develop Big Data applications in an integrated manner?
As part of a hands-on Project in this course, students will work with real, large datasets and commodity clusters, and use scalable algorithms and platforms to develop a Big Data application. The emphasis will be on designing applications that show good “weak scaling” as the size, speed or complexity of data increases, and using distributed systems such as commodity clusters and Clouds.
Besides class lectures, there will be several guest lectures by experts from the Industry who work on Big Data platforms, Cloud computing and data science.
This is a revised format of the previous years’ course, SE 256 (Jan 2016). This course extends from the systems basics introduced in the DS 286: DSP and DS 292: HPC courses at CDS. It is complementary to the proposed DS 222: Machine Learning with Large Datasets, likely to be offered in Aug, 2017, which will teach methods and techniques to design scalable algorithms to analyze large datasets. These algorithms in turn can be translated by students of this DS 256 course into programming patterns and scalable applications using Big Data platforms. This course also complements other breadth courses on data science like the E0 229: Foundations of Data Science and E0 259: Data Analytics.
Intended Learning Objectives
At the end of the course, students will have learned about the following concepts.
- Types of Big Data, Design goals of Big Data platforms, and where in the systems landscape these platforms fall.
- Distributed programming models for Big Data, including Map Reduce, Stream processing and Graph processing.
- Runtime Systems for Big Data platforms and their optimizations on commodity clusters and Clouds.
- Scaling data Science algorithms and analytics using Big Data platforms.
This is an introductory course on platforms and tools required to develop analytics over Big Data. However, it builds upon prior knowledge that students have on computing and software systems, programming, data structures and algorithms. Students must be familiar with Data Structures (e.g. Arrays, Queues, Trees, Hashmaps, Graphs) and Algorithms (e.g. Sorting, Searching, Graph traversal, String algorithms, etc.).
It is recommended that students have good programming skills (preferably in Java) which is necessary for the programming assignments and projects. Familiarity with one or more of the following courses will also be helpful (although not mandatory): DS 292 (HPC), DS 295 (Parallel Programming), E0 253 (Operating Systems), E0 264 (Distributed Computing Systems), SE252 (Introduction to Cloud Computing), E0 225 (Design and Analysis of Algorithms), E0 232 (Probability and Statistics), E0 259 (Data Analytics).
The total assessment score for the course is based on a 1000 point scale. Of this, the weightage to different activities will be as follows:
|45% Homework||Three programming assignments (150 points each)|
|30% Project||One final project, to be done individually or in teams (300 points)|
|20% Exams||One Final exam (200 points)|
|5% Participation||Participation (i.e. not just “attendance”) in classroom discussions and online forum for the course (50 points)|
Students must uphold IISc’s Academic Integrity guidelines. We have a zero-tolerance policy for cheating and unethical behavior in this course and failure to follow these guidelines will lead to sanctions and penalties. This includes a reduced or failing grade in the course, and recurrent academic violations will be reported to the Institute and may lead to an expulsion.
Learning takes place both within and outside the class. Hence, discussions between students and reference to online material is encouraged as part of the course to achieve the intended learning objectives. However, while you may learn from any valid source, you must form your own ideas and complete problems and assignments by yourself. All works submitted by the student as part of their academic assessment must be their own.
- Verbatim reproduction of material from external sources (web pages, books, papers, etc.) is not acceptable. If you are paraphrasing external content (or even your own prior work) or were otherwise influenced by them while completing your assignments, projects or exams, you must clearly acknowledge them. When in doubt, add a citation!
- While you may discuss lecture topics and broad outlines of homework problems and projects with others, you cannot collaborate in completing the assignments, copy someone else’s solution or falsify results. You cannot use notes or unauthorized resources during exams, or copy from others. The narrow exception to collaboration is between team-mates when competing the project, and even there, the contribution of each team member for each project assignment should be clearly documented.
- Classroom Behavior
- Ensure that the course atmosphere, both in the class, outside and on the online forum, is conducive for learning. Participate in discussions but do not dominate or be abusive. There are no “stupid” questions. Be considerate of your fellow students and avoid disruptive behavior.
- Select chapters from Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, 1st Edition, Morgan & Claypool Publishers, 2010
- Select chapters from Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman, 2nd Edition (v2.1), 2014.
- Current literature and online documentation
- Online Forum: firstname.lastname@example.org | Mailman Info Webpage
- Cluster Access: Students will validate their assignments and projects on the CDS
turingcluster, and Cloud resources. Details for accessing the cluster and running programs on it will be covered in a lab session.
Teaching & Office Hours
- Lecture: Tue/Thu 330-5PM, CDS 202 (Yogesh)
- Office Hours: By appointment
- First Class on Jan 5
- Project Demos on Sat, Apr 29, 2017, 2PM-5PM
- Final exam on Fri, Apr 28, 2017, 2-5PM
|1||Introduction to Course.|
Data; Platforms; Applications. Big Data Stacks.
|2||MapReduce/Hadoop: Programming Model, Basic Algorithms||Slides|
|MapReduce/Hadoop: HDFS, Yarn||Slides|
|MapReduce/Hadoop: Advanced Algorithms||Slides|
|3||Distributed Stream Processing: Concepts||Slides|
|Distributed Stream Processing: Apache Storm||Slides|
|Distributed Stream Processing: Storm Tutorial||Slides|
|Distributed Graph Processing: Pregel, Giraph, GoFFish||Slides|
|4||Apache Spark: RDD Data Model||Slides|
|Apache Spark: Dataflow Language||Slides|
|Apache Spark: Execution Model||Slides|
|5||NoSQL Databases and CAP Theorem||Slides|
|7||Guest Lecture: Microsoft Research Academic Summit -- Data Science Track|
|Guest Lecture: Machine Learning using Azure, Manish Gupta, Microsoft|
|Guest Lecture: Apache Hadoop and Yarn: Past, Present and Future, Varun Vasudev, Hortonworks|
|Guest Lecture: Kubernetes, Irfan UR Rehman, Huawei|
|Guest Lecture: TensorFlow, Narayan Hegde, Google||Slides|
About the dataset: This year’s course will use an archive of tweets from Twitter for various assignments. …
- Big Data: Using Apache Hadoop platform and MapReduce programming model for processing and extract information and structure from tweets
- Assignment A (PDF) [Posted on Jan 25, Due on Feb 12]
- Fast Data: Using Apache Storm platform and continuous dataflow model for streaming analytics and real-time mining over tweet streams
- Assignment B (PDF) [Posted on Feb 22, Due on Mar 10]
- Linked Data: Using Apache Giraph platform and Pregel programming model for network analytics over large graph datasets
- Assignment C (PDF) [Posted on Mar 16, Due on Mar 31]
Project topics have to be decided and mutually agreed by Tuesday March 21. Project report and code is due by
April 25 Apr 28 Midnight. This gives you 5 weeks from the project proposal. Project demo and presentation is on Apr 29. Project carries 300 points weightage (30%). You may have individual or teams of up to 2 perform a project (but double the effort expected for 2-person teams).
You can propose your own topic. They can be in two flavors:
- Application focus: Non-trivial and realistic algorithms, real/realistic large/fast/complex dataset, implement using Hadoop/Yarn/Storm/Giraph/GoFFish and validate
- Platform focus: Non-trivially extend the capability of one the Big Data platforms (Hadoop/Yarn/Storm/Giraph/GoFFish), and validate.
Some sample topics include:
- Temporal graph algorithms over graph snapshots
- Machine Learning algorithms using subgraph-centric models
- Real-time stream processing and analytics over IoT or Twitter datasets
- Transformations from batch to network datasets using Spark over Common Crawl data of the entire WWW over many months.
- Better scheduling algorithm for Yarn/Storm/Giraph/GoFFish
You must send me a 1 page project proposal by Mar 21 that sets out the technical problem you plan to solve, why it is interesting and/or novel, what Big Data platforms you will use, what datasets you will use that tests the scalability, where you will run these experiments, and what the success criteria is.
Azure Pass: Microsoft is providing students of the course with complimentary credits on their Microsoft Azure Cloud. This is particularly useful for running your own platform-as-a-service such Hadoop, Storm, Spark, Hive, etc. as using their HDInsight services when performing your project. Azure Pass details will be emailed to each student in the course.
- Microsoft Data Science for Research: Dataset directory