Assignment 0: Processing the World Wide Web (or a fraction of it) 

50 points (5% weightage)
Due on 26 Jan, 2017.

In this assignment, you will get familiar with the Common Crawl (CC) dataset that provides monthly snapshots of the entire WWW. Further, you will perform simple statistical operations on top of this dataset using Apache Spark's Java interface. You will also learn to use the turing cluster as part of this.

CC provides content and metadata of web crawl using the Web ARChive (WARC) format. These contain HTTP metadata headers as well as content that is often HTML pages. You will learn to parse the WARC file and HTML using Java libraries, and perform basic analytics on them using Spark Java. You may use existing parsers from the CC github pages and other standard Java libraries such as from Apache, but have to perform all other tasks yourselves. Specifically, do the following operations on the WARC files.

1) Frequency distribution for the number of URLs that are hosted on a single IP address.
2) Histogram of top 100 most frequent words, while ignoring stop words.
3) Frequency distribution for the number of outgoing (non-distinct) links from each webpage of Content-Type: text/html.
4) URLs with the largest and smallest content. What is the average page lenth?
5) What is the time taken to perform any one of the above on 1%, 5%, 10%, 25% and 100% of the data made available.

The data is hosted in HDFS under the location on turing <TBD>. The instructions for running Spark Java on turing are at <TBD>.

<Submission instructions for compiling and running?>
Submit the files as a single zipped folder that can compile using maven. <TBD>

Include a PDF report summarizing the above results with plots and basic analysis of the observations. Both code and report will be evaluated.