tinkering with Google's MapReduce
We are seeing an advent of parallel processing in all kinds of computational models. The trend can be seen not only from the progress in cluster computational architectures, but also from the wide adoption of multi-core processors, to the extent that we cannot imagine large data processing without them. One architecture that has reached critical acclaim over the last 3 years is Google’s MapReduce. It’s a model derived from functional programming for handling computation with terabytes of data. As its name implies the model works by dividing computation over to a number of Map and Reduce processes. For example when computing the number of back-links from web-pages, each map function will take a set of web-pages (the input data), and emit key-value pairs like <www.unique-weblink.com, [no. of occurrences]>. Using a hash function, all unique keys will end up with a specific Reduce process. The job of these Reduce functions would be to take the input key-value pairs and reduce them so every unique key eventually has a single value. Hence each emitted pair would tell the number of times a web-link appeared in all the web-pages. Need-less to see this architecture is used for a number of other applications.
Using MapReduce, an application programmer (at Google) needs to concentrate only on code which dictates only the processing the data needs to go through rather than on the complex parallel processing code that MapReduce already offers. The idea is not only robust but also novel, yet our research team still feeld there is great room for improvement.
So me and, the intrepid, Momina Azam are currently working under the wing of Dr. Umar Saif, to improve this architecture. We are aiming for a publication soon, so watch out for this space, to learn more about our improvements over the MapReduce architecture and to read the publication itself :)
Research Questions? we are Answering
- Changing how the Master works. Improving it, will greatly enhance the performance of the whole architecture, since it plays a pivotal role in orchestrating the computation.
- More efficient work distribution across network nodes. Currently the MapReduce architecture binds a key strongly to a certain node for the Reduce phase. Does this burden some nodes heavily? How can this limitation be laxed?
- Getting results in stages rather than at the end of the computation. It makes little sense to obtain results for large keys and small ones at the same time. How can the computation be scheduled in a way to obtain meaningful results for small keys much earlier in the computation.
- Getting meaningful partial results. The original MapReduce architecture was bound to complete the reduce of every key before a computation could successfuly end. Would it make sense to know that www.cnn.com had either greater than 2 million or exactly 7.34 million back-links? How can a user obtain approximate, but still meaningful answers.
If you're trying to trace how Hadoop works, you might find our Hadoop call-trace doc. helpful. If you do! drop a thank you note to Momina :)
Watch out for this space! Our research team will be soon releasing its implementation of plain-vanilla MapReduce in Python
- Pig Latin: A Not-So-Foreign Language for Data Processing | C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins | ACM SIGMOD/PODS '08 - International Conference on Management of Data, June 2008 | Project Page | Presentation | Video
- Sinfonia: a new paradigm for building scalable distributed systems | M. K. Aguilera, A. Merchant, M. A. Shah, A. Veitch, C. Karamanolis | ACM Symposium on Operating Systems Principles, pages 159 - 174, October 2007 | Presentation
- Map-reduce-merge: simplified relational data processing on large clusters | H. Yang, A. Dasdan, R. Hsiao, D. S. Parker | ACM SIGMOD international conference on Management of data, pages 1029 - 1040, 2007 | Presentation
- Dynamo: Amazon’s Highly Available Key-value Store | G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels | ACM Symposium on Operating Systems Principles, pages 205-220, Oct. 2007 | Presentation
- Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks | M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly | Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, pages 59-72, 2007 | Presentation | Video
- Google’s MapReduce programming model—Revisited | R. Lammel | Science of Computer Programming, pages 208-237, Volume 68, Issue 3, Oct. 2007
- Flux: A Mechanism for Building Robust, Scalable Dataflows | M. A. Shah | UC Berkeley PhD Thesis, 2004.
General >> Google Code Uni. Distributed Systems | Google Lectures on MapReduce | MapReduce Wikipedia
Blogs >> Carnage4Life | Geeking with Greg |
Implementations >> Hadoop | Skynet | Cat Programming Language | Qt Concurrent | Andrew McNabb's Mrs
General >> Apache Hadoop | Hadoop Wiki | Hadoop Summit | HDFS | Yahoo Dev. Net. Hadoop | HBase | Hadoop Wikipedia
Help + Articles >> Hadoop Docs | Hadoop API | Hadoop on Amazon EC2 and S3 | HDFS with Python | Hadoop Wiki Amazon EC2 | Berkeley CS16x Project | UCSD CSE 124 Project | IBM Hadoop Tools for Eclipse
Blogs >> Doug Cutting's Blog | Code Codex | Tom White's Blog | Jeremy Zawodny's Blog Yahoo
People >> Jeffrey Dean | Sanjay Ghemawat | Christopher Olsten | Joseph M. Hellerstein | Mehul A. Shah | Doug Cutting