Projects‎ > ‎

Project2

Implementing Distributed MapReduce in Python

Due Wed Oct 21 at 11:59pm in Github.

Project demos on Thu Oct 22 and Tue Oct 27.

You must work in groups of two.

For this project you are going to implement the reliable, distributed MapReduce framework in Python using ZeroRPC and Gevent. In this project you will develop distributed code and tests that will validate your implementation.

Overview of Features

  • Fixed sized data (i.e., binary data, hamming encoded data)
  • Line oriented data (i.e., variable length lines)
  • A single input file or multiple input files with the same base filename
  • Configurable split size on MapReduce execution
  • A partition function for splitting Map output into R sets of data
  • A single Master
  • The master does not have to tolerate failure
  • Multiple workers (all will have access to the input file)
  • A worker can have one Map task and one Reduce task outstanding at a time
  • Support for local execution.
  • You must be able to tolerate worker failure
  • You can assume that the Map results can fit into worker memory
  • You can embed the MapReduce code directly in your worker (you do not have to support code shipping)
  • You must demonstrate performance improvement over sequential MapReduce
  • You must demonstrate reliability and the ability to tolerate worker failure
  • You must be able to run your code on the Bass Cluster

Method of Usage

1. Start the master:

$ python mr_master.py <port> <data_dir>

The data_dir is the location in the file system where input files can be found and where resulting output files will be placed.

2. Start the workers (locally or remotely)

$ python mr_worker.py <ip_address_master:port> [<ip_address_worker:port>]

The workers will register with the master. 

3. Start a MapReduce job:

$ python mr_job.py [<name> | <mr_class.py>]  <split_size> <num_reducers> [<input_filename> | <base_filename>_] <output_filename_base>

For example:

$ python mr_job <ip_address_master:port> wordcount 100000 4 book.txt count

This will result in running wordcount across all workers. There will be 4 output files total: count_00, count_01, count_02, count_03

And for sequential execution (no master or workers):

$ python mr_seq.py [<name> | <mr_class.py>]  <split_size> <num_reducers> [<input_filename> | <base_filename>_] <output_filename_base>

4. Collect results from workers:

$ python mr_collect.py <filename_base> <output_filename>

For example:

$ python mr_collect.py count count_all

Will collect the distributed results and put them into a file call called count_all.  You need to ensure that ordering in the output file is preserved. You may consider turning this in a MapReduce program with one reducer on the local machine.

Testing 

You need to write and demonstrate the execution of the following MapReduce programs:
  • Word Count
  • Sorting
  • Binary Hamming Encode
  • Binary Hamming Decode
  • Binary Hamming Check (Report a summary of errors)
  • Binary Hamming Fix
  • Binary Hamming Error (Introduce errors into the encoded file)
You also need to write automated benchmarking and testing code:
  • Performance testing (show performance improvement over sequential MapReduce)
    • You will need to find the write test and data size
  • Reliability testing (this can be done locally)
    • You should be able to kill workers during execution and the program should complete successfully (even with just one remaining worker).
    • You need to consider different points of failure: mapping, shuffling, reducing.

Deliverables

  • Design and Experimental Results Report (PDF)
  • Working code
  • README with usage instructions
You will be grading on the completeness of you solution and you ability to demonstrate the required functionality.

Extra Credit

You can receive extra credit for implementing the following features
  • Integrate with HDFS. That is, use HDFS for input and output data instead of the local filesystem.
  • Support code shipping. That is, send MapReduce code directly to the workers for execution rather than having the code embedded in the workers.
  • Support MapReduce counters. Implement and provide test code.
  • Support backup tasks toward the end of the MapReduce computation.
  • Store Map output in files rather then in memory to support larger file sizes. Demonstrate how in-memory fails and disk storage succeeds.
You can also propose your own extra credit.

Points will be determined by the completeness and quality of the extra credit feature. Please get all the standard features working before attempting the extra credit.

Comments