We are seeing an advent of parallel processing in all kinds of computational models. The trend can be seen not only from the progress in cluster computational architectures, but also from the wide adoption of multi-core processors, to the extent that we cannot imagine large data processing without them. One architecture that has reached critical acclaim over the last 3 years is Google’s MapReduce. It’s a model derived from functional programming for handling computation with terabytes of data. As its name implies the model works by dividing computation over to a number of Map and Reduce processes. For example when computing the number of back-links from web-pages, each map function will take a set of web-pages (the input data), and emit key-value pairs like <www.unique-weblink.com, [no. of occurrences]>. Using a hash function, all unique keys will end up with a specific Reduce process. The job of these Reduce functions would be to take the input key-value pairs and reduce them so every unique key eventually has a single value. Hence each emitted pair would tell the number of times a web-link appeared in all the web-pages. Need-less to see this architecture is used for a number of other applications.
Using MapReduce, an application programmer (at Google) needs to concentrate only on code which dictates only the processing the data needs to go through rather than on the complex parallel processing code that MapReduce already offers. The idea is not only robust but also novel, yet our research team still feeld there is great room for improvement.
So me and, the intrepid, Momina Azam are currently working under the wing of Dr. Umar Saif, to improve this architecture. We are aiming for a publication soon, so watch out for this space, to learn more about our improvements over the MapReduce architecture and to read the publication itself :)
If you're trying to trace how Hadoop works, you might find our Hadoop call-trace doc. helpful. If you do! drop a thank you note to Momina :)
Watch out for this space! Our research team will be soon releasing its implementation of plain-vanilla MapReduce in Python
General >> Google Code Uni. Distributed Systems | Google Lectures on MapReduce | MapReduce Wikipedia
Blogs >> Carnage4Life | Geeking with Greg |
Implementations >> Hadoop | Skynet | Cat Programming Language | Qt Concurrent | Andrew McNabb's Mrs
General >> Apache Hadoop | Hadoop Wiki | Hadoop Summit | HDFS | Yahoo Dev. Net. Hadoop | HBase | Hadoop Wikipedia
Help + Articles >> Hadoop Docs | Hadoop API | Hadoop on Amazon EC2 and S3 | HDFS with Python | Hadoop Wiki Amazon EC2 | Berkeley CS16x Project | UCSD CSE 124 Project | IBM Hadoop Tools for Eclipse
Blogs >> Doug Cutting's Blog | Code Codex | Tom White's Blog | Jeremy Zawodny's Blog Yahoo
People >> Jeffrey Dean | Sanjay Ghemawat | Christopher Olsten | Joseph M. Hellerstein | Mehul A. Shah | Doug Cutting