For the last two years, one of my research interests has been in MapReduce for computer vision tasks. My initial reason for looking into MapReduce was to encourage the Computer Vision community to move towards larger, web-based, datasets (e.g., Flickr, YouTube, Google Images). These datasets tend to be more realistic, show generality, and demonstrate computational tractability of research methods. MapReduce comes in because it enables programmers to abstract away the complexity of distributed processing at the expense of efficiency (compared to fine-grained methods such as MPI). A major hurdle for me starting out with Hadoop was that its native API is in Java, a language that gets little love in the vision community. This is primarily due to the need for fast pixel operations (often with pointer tricks) and linear algebra, tasks that come more naturally in languages like C and to some extent Python using numpy. My personal style is to write all code that touches pixels in C or in numpy arrays (similar to MATLAB) and use Python for the higher level architecture.
After a brief search I found Hadoop Streaming, a
guide to using streaming with Python, and the
dumbo library. The obvious choice is to work with dumbo, the
de facto standard; however, after a few months of use I found it to be a bit difficult to use the way I wanted as it was designed with different goals in mind. At a high level I have ~20 stage MapReduce flows with C modules, Cython, and ctypes where dumbo is really designed for fast one-off scripts; moreover, I have access to a few different clusters, some of which I am unable to install libraries or even Python on, clearly dumbo was optimized for the case where you have some amount of cluster admin access. I decided that the interface I needed was simple enough, and that rolling my own library would satisfy my own research goals while diversifying the existing Python/Hadoop community.
My goals for hadoopy are
- Similar interface to Hadoop API (design patterns usable between Python/Java interfaces)
- General compatibility with dumbo to allow users to switch back and forth (in some instances this isn't possible due to #1)
- Usable on Hadoop clusters without Python and admin access (this also simplifies use with EC2 as there is no setup)
- Fast conversion and processing (Klaas, the author of Dumbo, has done an excellent job advocating for Hadoop Streaming enhancements such as TypedBytes)
- Stay small and well documented
- Be transparent with what is going on
- Handle programs with complicated .so's, ctypes, and extensions
- Code written for hack-ability (library is targeted at longer term projects where users will likely dive into the code-base)
- Simple HDFS access (e.g., ls and cat)
- Oozie support (both dumbo and hadoopy were written before Oozie, but it is clearly the standard now)
- Protocol Buffers support (in progress)
- Cython user code support (in progress)
Hadoopy's core is written in Cython so that the plumbing that handles the parsed KeyValue pairs is efficient. In my experiments, for IO bound tasks like log parsing, the TypedBytes parsing is the bottleneck. I wrote my own implementation of TypedBytes in Cython that is optimized for the common case (input via stdin, output via stdout). I validated my implementation with the reference TypedBytes/cTypedBytes libraries and have achieved a fairly dramatic
performance boost. It would be good to see some of the optimizations make it back into those implementations so that everyone can benefit.
#3 seems impossible at first glance, how can you make a Python library that runs on a cluster without a reasonable Python?
cx_Freeze to the rescue! cx_Freeze effectively builds you a custom executable that consists of your target program and a Python parser. It also gives you the shared libraries you need to take along for the ride. In Hadoopy, there is a 'launch' function that starts a Hadoop Job like normal (similar to how dumbo's executable does) and a 'launch_frozen' function which does the same thing but auto-magically cx_Freeze's everything and handles the changes internally. If you use EC2 that means no more complicated setup scripts (I used to have a huge script to set everything up just right, now I just start the cluster). If you are running on someone else's cluster and you don't want to bug them with installing Python, the new library that you need, etc then you are in luck. Even if you have your own carefully manicured cluster it may be worth not customizing each node as it saves time and effort when you add nodes. Keep in mind this is basically what Java is doing, you just bring along all of your goodies and it lets you not have to worry about the actual cluster details. I'm experimenting with using Cython's own support for "freeze" functionality. This would remove the cx_Freeze dependency and allow for user programs to be written in Cython (basically you get C performance from Python syntax).
During analysis, I kept wanting to grab a KeyValue stream off of HDFS in python (normally while using the ipython shell) and just get a feel for the data. Instead I had to convert it to a nicer output format with another Map task. For #9 I use the hadoop command line functionality dumptb (written by Klaas in CDH3) which will uncompress the data and take it from the SequenceFile format, resulting in a stream of TypedBytes that I can parse in. This lets you get an iterator of KeyValue pairs from any TypedBytes encoded file, dramatically simplifying the integration of Hadoop into auxiliary analysis scripts that run locally. It is also possible to read off of HDFS while inside of a hadoopy job, simplifying tasks such as K-Means Clustering where you need to package together your most recent cluster centers for the next iteration.
The Hadoop framework is focused on the Map and Reduce phases of a single job; however, projects consist of several of these jobs chained together. The
Oozie framework made by Yahoo and supported by Cloudera is designed to handle this task by representing each Hadoop job as a node in an directed acyclic graph. My previous stab at a similar idea (before Oozie came out) worked fine until jobs failed. Almost all new work occurs at the end of the pipeline so that is where your new bugs are; moreover, Python exacerbates this problem as most checks occur at runtime. This makes Oozie and Python the perfect team, you can hack away your latest job, push it out, if it fails you change your code,
update the list of nodes to skip, and rerun the task. I'll provide more details in a later post.
I recently rewrote major parts of Hadoopy in Cython (I have been using it consistently for about a year now) and want to get feedback. I'm working on documenting the code, providing, tutorials, and examples of differences between the Hadoopy/Dumbo interfaces. As of this writing the documentation includes benchmarks, a full tutorial, and API specification.
Hadoopy - Source
Computer Vision Code written using Hadoopy
Papers using the Hadoopy library
A Case for Query by Image and Text Content: Searching Computer Help using Screenshots and Keywords (to appear in WWW'11)