== README ==

0) Edit the pom.xml and set your username (replacing 'hobbes' with your own username)
1) Run 'mvn initialize' for the first time you are compiling this project on a machine. This will download missing jars and also install the local jars from lib/ folder to your .m2 repository, where they will be picked up in future. The first time you run this may take some time as it downloads missing jars.
2) Run 'mvn clean compile package' for the first time and also whenever you make changes to the code. The first time you run this may take some time as it downloads missing jars. This command will generate a jar file with dependencies in the target/ folder.
3) scp (secure copy) the dependencies jar file to the cluster.
4) Run the sample CC job as a test with default input parameters specified in the code.
hadoop jar /home/username/se256-alpha-username-0.1-jar-with-dependencies.jar in.ac.iisc.cds.se256.alpha.cc.CCAnalyticsJob &> /tmp/username_cc_1

You can tail the output file and also look at the yarn scheduler's application status.

tail -f /tmp/username_cc_1
yarn application -status application_nnnnnnnnnnnnn_nnnn


5) Look at the output files generated in hdfs under your home directory's alpha/cc/output. You should see an output in the part files that looks like:

hadoop fs -cat alpha/cc/output/* 
HTML_SIZE_BYTES 4128767455
ANYFILE_SIZE_BYTES      4301533647

6) After the initial validation test, you MUST run hadoop jobs using the PBS scheduler script ONLY!!!

7) You can include input parameters to the hadoop job (look through the CC and Twitter Job's source code), such as below, where we pass input files (as wildcard), output file location, the min number of mappers (4) and number reducers (2) and their task memory allocation (4GB and 8GB respectively).

hadoop jar /home/username/se256-alpha-username-0.1-jar-with-dependencies.jar in.ac.iisc.cds.se256.alpha.cc.CCAnalyticsJob  '/SE256/CC/*-0003?-*.gz' 'alpha/cc/output_2' 4 2 4000 8000 &>/tmp/username_cc_2

