Simple Distributed File Indexer (Python)
A Multip-process, command-line indexer application that finds the top 10 words across a collection of documents.
Have a fixed number (N) of worker processes (say, N=3) that handle text indexing. Workers should be able to run on separate machines from each other.
When a worker process receives a text blob to process, it tokenizes it into words. Words are delimited by any character other than A-Z or 0-9.
A master collection, shared between all workers, keeps track of all unique words encountered and the number of times it was encountered. Each time a word is encountered, the count for that word is incremented (the word is added to the list if not present). Words should be matched in a case-insensitive manner and without any punctuation.
The application should output the top 10 words (and their counts) to standard out.
$tar -xzvf zeromq-4.0.4.tar.gz
$make && sudo make install
$sudo pip install pyzmq
Clone this project, cd to your cloned directory.
You will need to have python2.7 (python 3 should work, have not tested) installed to run this app:
$python main.py [-N <number of workers>] -W <worker_1 ip:port> <worker_2 ip:port> <worker_3 ip:port> -F <file_1_path> <file_2_path>
$python main.py -W 127.0.0.1:5002 127.0.0.1:5004 127.0.0.1:5006 -F test_files/sample.txt test_files/TaleOfTwoCities.txt
Number of workers has been defaulted to 3.