=========================================== DS221: INTRODUCTION TO SCALABLE SYSTEMS =========================================== POSTED: 21 OCT, 2017 DUE DATE: 31 OCT, 2017, 11:59PM POINTS: 50 points [+10 points extra credit] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You will be using US Federal Aviation Authority (FAA) data related to flight travel for performing several simple analytics using Apache Spark [1]. The CSV dataset is available from [2], and described in detail there as well. The data from 1998-2008 (11 files) is already hosted under hdfs:///user/simmhan/faa/ in the turing cluster. Answer the following questions by writing a Spark Python script for each using Spark 2 over the RDD API [3]. Do NOT use dataframes or datasets or other advanced Spark features. Most of the problem should be solved using Spark APIs. Do NOT perform local processing within your Python driver code, unless the data size has been reduced to under 1000 items/1MB. Each problem should be a separate Python script file. See sample file "a3_0.py" that is included. 1) Report statistics on the following [5x5=25 points] 1_1) How many distinct carriers are there? 1_2) How many flights are operated by each carriers annually? 1_3) How many distinct airports are there? 1_4) How many flights took off and landed in each airport (airport, takeoffs, landings)? 1_5) How many flights were diverted or cancelled on 11-Sep-2001? 2) Do the following analysis on flight delays. Consider any delay to be equally shared by both the starting and ending airports. [10 points] 2_1) Which are the top-10 airports that had the most average delay in a year? 2_2) Has this delay changed between before 2001 and after 2001? 2_3) What is the most frequent cause for the delay (Carrier, weather, NAS, Security, LateAircraft)? 3) Do the following analysis on flight routes. [10 points] 3_1) Which are the top 5 routes that have the longest average distance? 3_2) Which are the top 5 routes where aircrafts fly the fastest on average? 4) Do the following analysis on the flight network. [15 points] 4_1) For any one year, generate a network (graph) data of flight patterns. Vertices are airports. Label them with the airport code. Label the edges with the total number of flights between those airports and the average delay between them. Generate an edge list file for the network of the form: ,,, 4_2) Convert this edge file to a ".dot" file that can be visualized using GraphViz. Include the figure in your report. [1] https://www.faa.gov/data_research/aviation_data_statistics/ [2] http://stat-computing.org/dataexpo/2009/the-data.html [3] http://spark.apache.org/docs/2.1.1/programming-guide.html =========================================== IMPORTANT NOTE: 1) Do NOT save your code/scripts in the HDFS folder. Assume it is globally readable by others in the cluster. Keep your code only in your turing home directory that is access controlled. If we find any code present in HDFS, you will get 0 points for that problem. Follow ethics at all times. 2) You may try your scripts for small dataset sizes using the PySpark shell on the head node. Do NOT run it for the whole data on the head node, and instead use the cluster's compute nodes using spark-submit. Those found abusing the cluster resources will have their jobs killed without warning. 3) Please use the existing source data in HDFS under hdfs:///user/simmhan/faa/ and do NOT create additional copies from the web in your HDFS folders. You may use one of more of these files to answer these questions -- more the better, but be courteous when using resources on the shared cluster. =========================================== SUBMISSION INSTRUCTIONS 1) Provide a separate Python file for each problem named as "a3_*.py", where '*' is the problem number given below. Also submit a brief report ($username.pdf) that answers the questions in text and/or using simple plots based on the Spark scripts. 2) Tar/Gzip the 12 Python script files and the PDF file into a file named $username-a3.tar.gz, where $username is your login name in the turing cluster. 3) Email it to simmhan@iisc.ac.in before 11:59PM on Oct 31, 2017. The subject line should be "ds221-a3-$username". Only a single submission will be accepted, and late submissions will not be accepted.