Assignment B. Deadline for submission of code and documentation is 5pm IST Mar 30, 2016. ============== Let’s say we want to replicate Google’s search engine from 10 years back using the Common Crawl dataset. There are two parts to it: Find the web pages that match a given set of search terms, and Rank the webpages and return the top 100 matches in rank order. We will use inverted indexes for doing the search and PageRank for raking the results. (a) Write a MapReduce job to build an inverted index based on the contents of the Common Crawl web pages. This will output a Key-Values file where the key is a search term and list of values are URLs that contain the key. You should ignore stop words [1] from this inverted index so that commonly occurring words are not indexed. (Why?) This MapReduce job will be run once to build the index. (b) Using the web graph you constructed for the Common Crawl data from Assignment A, write a MapReduce job to run the Page Rank algorithm on this graph using with 30 iterations. Emit the URL and its Rank as output. This MapReduce job will be run once to build the webpage’s rankings . (c) Then, given a set of search terms, you will use the inverted index to find *ALL* web pages (URLs) that contain *ALL* the given input search terms, again ignoring stop words. For all these URLs, lookup their PageRank and identify the top 100 pages with the highest page rank. Write one or more MapReduce jobs (fewer the better) to return the ranked list of 100 matching URL for a given set of search terms. You’ll run this (these) job(s) for every search you wish to perform. (d) Train a classifier which given a webpage's content (e.g., title, content, etc.) from the Common Crawl dataset, identifies the country in which it is hosted. Please train the classifier in a distributed manner. Training and evaluation data for this classifier may be obtained by doing a reverse IP lookup on the IP address associated with each page. Of course, this information will be hidden to the classifier during test time. Please report on data preparation, classifier training strategies, brief implementation details, and final evaluation accuracy. [1] http://xpo6.com/list-of-english-stop-words/