Brandyn White http://brandynwhite.com Computer Vision, Hadoop, Mobile Computing, Kinect, and Big Data posterous.com Sun, 18 Sep 2011 22:53:00 -0700 Bitcasa: Infinite storage brings infinite cost and liability http://brandynwhite.com/bitcasa-infinite-storage-brings-infinite-cost http://brandynwhite.com/bitcasa-infinite-storage-brings-infinite-cost

There is an exciting new company Bitcasa that promises infinite storage for $10 a month and says your data is encrypted client side.  To break that down, that means you pay a finite cost for infinite resources (bandwidth, storage, processing) and client side encryption implies they have no practical ability to read your data (plaintext).  It sounds too good to be true, but the tricky part is how close they can get to that and what they sacrifice in the process.

Disclosure:  I am not affiliated with Bitcasa or any of their competitors, many of the statements here are educated guesses based on related research papers and their own public disclosures.  I have not used Bitcasa (though I'd like to!).  I am interested in security and privacy and would love a company like this to succeed, but I believe there are major problems with their model that haven't been discussed publicly.  Some of what I am going to talk about are attack models and potential legal issues, I do not condone exploiting anything (and I hope that they have solutions internally for these problems) and I am not a lawyer.  I don't have a professional background in cryptography but I have significant practical experience and have taken graduate level crypto courses (I am knowledgeable but not an expert).  This assumes that you know what the following words mean (google them real quick if you don't):  symmetric key encryption, cryptographic hash, side channel attacks, and de-duplication.

Obvious initial complaints
These aren't necessarily problems with their system but they summarize the visceral issues one might have with it.
  • You "can't" de-duplicate with client side encryption
  • Infinite storage is impossible
  • $10 is not enough to make this viable
  • There are no simple ways to know how much privacy/security you actually have with their closed system architecture
What we know
They use  convergent encryption  to do most of the counter-intuitive cryptographic tricks.  The way that "client side encryption" works in a common use case is the user has a key, the file is encrypted with the key, then the file is given to the service.  Using this model the only weak points are the key and the crypto (which can both be made very strong).  The problem with that setup is that the same data encrypted by different keys should be unpredictably different (i.e., psuedorandom) when they are encrypted, which means tons of space is wasted storing copies of the same data.  Convergent encryption is a trick where you compute the key from the file itself by using a cryptographic hash.  That means if two users have the same file, they would come up with the same key and they would then produce the same encrypted file (ciphertext).

Assumptions
Since much of their system is secret, to talk about it we need to assume a few things.  They need not have exactly this kind of setup as many variants would have similar properties.
  • They know who their user's are in terms of credit card numbers, names, etc. instead of keeping people totally anonymous.  They currently have quotas for free accounts so they need a way to budget this.
  • They have an encrypted file list for each user:  This is necessary because if the user's drive gets nuked they need to be able to provide a key and get the files back.
  • They know (at least roughly) how many files or what the total usage each user has.  Even if your filelist is encrypted, you could use its size to roughly determine the number of files.
  • Files are stored in chunks with metadata on each chunk.  They show that seeking is possible in videos so clearly this would be an obvious thing to implement.
  • They are expected to keep the files around for the lifetime of your account.  If they lose something they might get scolded but aren't liable, but the expectation is there along with reasonable service (though not necessarily with an SLA).

Potential Flaw #1: The Space Burner
There are always people who want to ruin your day.  If they provide infinite storage, what would stop you from renting a server near there and just making a tremendous number of small files (around 128 bits each) which are likely more costly to store because of the metadata and there are still 2 ^ 128 of them so you can't possibly store them all on disk to benefit from de-duplication.  From their pitch, this is well within your infinite quota, and on their side it would have to store likely much more than the > 128 bits (as they are encrypted) that you are sending in the form of metadata.  Then you randomly read them back to ensure they aren't being put on a tape drive or other 'cold' storage. They have to keep your data around, so the storage and maintenance is high; moreover, the transfer (as noted by Google's Marissa Mayer @ Disrupt'11) between the client/server is nontrivial.

This can be prevented by putting caps/quotas which are clearly against their primary marketing message; moreover, by reserving the ability to place a cap you open yourself up to several tricky privacy/security issues (more later).  Note that while it is true that many 'unlimited' web services can be exploited in similar ways, the reason I bring this up is that the fixes necessarily reduce the user's privacy/security from what is advertised. This isn't the case for most webservices as they don't claim to provide this level of service.  More discussion below on how countermeasures to this problem lead to additional problems.

Potential Flaw #2: The Rapidest Share
I have a lot of opinions on copyright law, illegal numbers, etc.; however, whatever system you build is still subject to the laws of the countries you operate/reside in and need to be considered (or changed).  Presently, sharing files is subject to $100K+ fines per file and is being actively litigated.  Using their system it is possible (in theory) to turn it into the ultimate file sharing site.  I want a specific file my friend has, he puts the file up, tells me the key the system used to encrypt it (possibly be reverse engineering the client code).  Then I make a new client that communicates with the server (again reverse the protocol and client) and tells it I have this file by passing the hash.  They (presumably) add that file to my list of accessible files and I can then download it whenever I want without ever actually having it.  A copyright holder would just check to see if the file already exists on the system (as it wouldn't be re-uploaded by the off-the-shelf client) and that it evidence that their file is on the server (possibly enough evidence for them to get additional info).  If the client file lists are encrypted, the clients are mostly safe (more on that next); however, any provider of this service could be legally responsible.

While it is true that the DMCA provides legal cover for hosting providers, it compels cooperation (see #3 below) and it isn't complete protection.  For example, Google Music makes a copy of your music even if it exists remotely; however, Apple secured an expensive license to hash match (check if the file exists remotely before sending it, what Bitcasa is doing certainly without a license).  This is thought to be a response to the legal gray area caused by this technique; however, a recent ruling sided with hosting companies on this issue.

Potential Flaw #3: The Ultimate Copyright Enforcement Tool
Similar to above, a copyright holder uses the standard system to determine if something already exists and uses that as proof to get more info.  I mentioned that clients were mostly safe, which means there are times when they are not.  Lets say they use a more secure system than they clearly do, where they don't even know what files are yours but they do know you are customer (necessary if you do credit card payment, possible to fix if you use bitcoin for payment).  If you are the only customer then this 'data probe' will prove without any doubt that you have the copyrighted file.  Ok well that's a bit extreme, so lets say there are enough people that the previous argument wouldn't be enough evidence to convince anyone that the file belongs to you.  They can determine that 'someone' put the file there, and then get a court order to watch all IP addresses that we send that file to.  This works even if they don't keep file lists and is only really preventable if the clients use TOR which would destroy any reasonable expectation of speed.

An alternative method for determining that a file belongs to a user is the correlation between when a file is placed and a change in the user's file list.  This kind of information would have to be backed up regularly, so historical data can show this trend and possibly expose patterns of file sharing between users out of band (a friend emails an MP3 to another friend).

These "side channel" attacks are extremely subtle and are a common way to break a cryptographic system as the obvious routes are heavily analyzed and the side channels normally come about during implementation.  Note that preventing #1 makes #3 worse.

Potential Flaw #4: The Cache Burner
Since lots of users will access similar files (e.g., OSX system files, new MP3s) they must do some kind of in memory cache.  Similar to #1, a nasty person could pollute the cache by requesting random files (potentially while masquerading as several users) in hopes of making it difficult to efficiently cache files for service.

Potential Flaw #5: Decrypting Small Files
If someone were to get access to a small encrypted file with < 64 bits of plaintext, it is easy to brute force the original encryption key and you know it is correct when the resulting plaintext produces the same key.  This would not be an effective attack in the more common interpretation of client side encryption (where > 128 bit keys are standard).  This is counter-intuitive and would be difficult to explain to end-users that their security decreases dramatically when they approach these small file sizes.  Since Bitcasa works on a whole drive level, it is nearly impossible to ensure that all important files are "large enough".

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def break_crypto(ciphertext, num_bytes=1):
    """Brute force convergent encryption

Psuedocode to show vulnerability to small messages.
This is an example for small plaintexts (roughly known byte size)

Args:
ciphertext: plaintext that has been encrypted using symmetric
encryption with the key being a hash of the plaintext.
num_bytes: Number of bytes in the plaintext message.

Returns:
plaintext

Raises:
ValueError: The number of bytes provided is incorrect.
"""

    for plaintext_val in range(8 ** num_bytes):
        # Convert the value into a fixed length bytestring
        plaintext_guess = int_to_str(plaintext_val, num_bytes)
        key_guess = crypto_hash(plaintext_guess)
        # Try to decrypt using symmetric crypto
        if sym_dec(ciphertext, key_guess) == plaintext_guess:
            return plaintext_guess
    raise ValueError('Number of bytes guessed is wrong')

Shortsighted Solutions 
There are several things that can be implemented to address these but they cause other problems.

To fix the griefing (#1 and #4) you have more detailed user models (necessarily unencrypted) including usage, traffic, etc.  Each of those can make the service much cheaper, but the trade-off is to be able to find the few bad guys they are reducing their privacy level by opening everyone up to increased liability and privacy problems (see #3 above).  As a startup, cash is your runway so it is highly unlikely that they took the high road here and likely keep detailed user statistics to terminate malicious accounts.

To fix #3 they need to allow anonymous payment with bitcoin, require the usage of TOR, and maintain no identifiable information (name, email, statistics, etc.).  TOR and no info are highly unlikely to be acceptable for a variety of reasons (e.g., marketing, speed, analytics, regulations).  Additional user info is a possible side channel attack vector (e.g., maximum file size for a user, number of files) and they are hidden security threats for the client.

To fix #2 they basically need to play the same cat and mouse game as DRM companies and that never works out in the end.  There will likely be ways to download files you never had access to, even if it is hard to do it.

To fix #5 you can set a threshold where you switch over to "natural" client side crypto; however, this compounds the problem with #1 where now a bad guy can find the largest such filesize and they won't even be de-duplicated.

Conclusion
I really like this new approach to efficient and (somewhat) secure file storage; however, without being more transparent about how things actually work I don't think I'd be able to trust this system.  There are just too many places to make subtle mistakes that only hurt the client (and in ways that aren't observable).  Ideally they would drop this infinite gimmick and use a more natural storage model with clear explanations of what user data is stored and what attacks are possible.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1596343/IMG_0071.JPG http://posterous.com/users/YXqRcSjGHwl Brandyn White bwhite Brandyn White
Sat, 19 Mar 2011 21:46:00 -0700 Thoughts on the Cr-48, cloudbooks, and the future of Chrome OS http://brandynwhite.com/thoughts-on-the-cr-48-cloudbooks-and-the-futu http://brandynwhite.com/thoughts-on-the-cr-48-cloudbooks-and-the-futu

I recently received one of the magic Cr-48 devices from Google and I wanted to share my thoughts about the platform.  I've been living the "cloud" lifestyle for about a year now:  storage on S3/GDocs, servers on Linode/EC2, and my laptops are only used for browser/ssh.  For me, it's comforting knowing that if my laptop gets broken or stolen it can be replaced without data loss.  Moreover, my view of my data/applications is the same no matter what machine I happen to be using.  Several people have done extensive Cr-48 reviews, this will not be one of them.  I'll briefly talk about the aspects that I find interesting and focus on the future of this and similar devices.

Hardware
Being an experimental device, it is surprising how good the Cr-48 actually is in terms of build quality and ergonomics.  The trackpad is nice and big, very macbook like.  The keyboard is in lowercase letters, has no function keys, and has nice chunky ctrl/alt keys.  Most importantly, the caps lock key (happens to default to new tab) can be remapped to ctrl which is essential to avoid emacs pinky.  It has 8 hours of battery life and it comes with 100MB of verizon 3G free per month.

Software
Besides a browser, the only other local application I need on a regular basis is a terminal.  I planned on using a web terminal; however, chrome has a pretty bad design decision that doesn't allow web applications to remap certain keys (e.g., Ctrl-W).  This means that using emacs in a web terminal is pretty much impossible with standard bindings.  Luckily googlers must also have this issue because they provided a minimalistic terminal that effectively only supports ssh.  This works perfectly for me and I've been using this combination for a week now exclusively.

Zapped by the Cloud
Part of the fun of this laptop is trying out the newest webapps like Lucid Charts and LatexLab which for me could replace OmniGraffle and emacs (at least for editing latex); however, I quickly found things weren't as rosy as I was led to believe.  Lots of people like Lucid Charts and there are some nice qualities about it but I find it lacks the polish of OmniGraffle.  I am deeply afraid to use services like Lucid Charts because there is no way to know if they will exist a month from now due to any number of reasons:  bought out, data loss, bankrupt, or apathy.  This is a subtle point that has nothing to do with the $5 / month subscription cost; my problem is that without having access to the source code to run on my own I am at their mercy.  If the code was free I would gladly pay for the service, knowing that if they disappear like nearly every such company does, I can still function on my own.

LatexLab is effectively a clone of GDocs that compiles latex and has a built in preview.  It works about as good as most latex gui's but the lack of keyboard shortcuts (again chrome's fault) really cripples it.  Unlike Lucid Charts, the code is available as free software (Apache License) which means I can see myself settling into it.  However, at the height of my optimism I decided to do my Crypto exam on it and it ended up not saving properly which resulted in a few hours of lost work.  This doesn't mean you shouldn't use it, just be careful because it has different semantics than Google Docs, particularly compiling does not save the document.

I attempted to program using a mix of cloud9ide and github triggers to run tests and compile; however, the same editor issue forced me back to ssh+emacs.  It's such a shame because this is another open source project with a freemium price model that did everything right, but it is impractical to use if you rely on keyboard input for navigation because of chrome.  The best suggestion I've seen so far is to have an authorization banner (like the one for location) for webapps that require more control over key bindings.  This would only affect a handful of apps such as IDE's, document editors, and others that heavily rely on the keyboard for control.

Chrome OS too little too late?
This is a question posed in this discussion on the Chrome Pilot Group and it brings up several good points.  Paraphrasing the submitter's two main concerns.
  1. Programs have increased latency when run remotely and cloud computing results in your machine being similar to a dumb terminal.
  2. Cloud providers can't be expected to care about your data as much as you do; moreover, the recent GMail problems show that data stored in the cloud is not safe.
My response, mostly verbatim.
  1. Think long term.  Wireless broadband is still in its infancy, in 5 years it will be tremendously faster.  This enables people to be more mobile and battery life is not growing nearly as fast as CPU speed, disk storage, and internet speed.  If you have the bandwidth necessary, it will always be more efficient to carry around a "battery with a screen" and just let Google do all of the processing on their lean server infrastructure.  This is exactly like a dumb terminal, but remove all of the slow parts.  It's true a desktop will always be more responsive, but we are humans, we don't need minimal latency we just need acceptable latency.
  2. As a multi billion dollar business built on reputation, Google has every incentive to keep your data secure.  Even if they don't "care" (in a warm fuzzy sense) they have so much experience it isn't particularly hard to do a much better job than average users.  This is about economies of scale, you don't generate your own power.  You let the power company produce energy and it is much more efficient; however, I'm sure the power companies don't specifically "care" about each individual home.  By this reasoning the first time the power goes out, it would show that the electricity supply isn't "safe".  Local backups suffer from "correlated failure".  Lets say you have a fire, flood, earthquake, or burglary you will likely lose all of your copies at once, no matter how many times you back them up locally.  Datacenters are built in relatively safe areas and your data is generally stored on 3+ data centers and multiple times in each data center.
Now given that, I think the current way that all information is centralized in a few huge companies can be argued to be problematic (watch this it's very interesting).  But that itself may be negative but has nothing to do with cloud computing.  You can benefit from these economies of scale and still allow users to maintain control over data (e.g., clipperz).  Something that isn't talked about enough is that if your data is in the cloud it is tremendously easier for governments to get it without your knowledge.  Compare this to storing the data in your house where it would be a necessity for you to be notified.

So overall I'd summarize by saying if anything Google is about 2 years ahead of demand on Chrome OS, which is just where it should be.  There are some risks with storing data in the cloud but this is just society's first "real" attempt at it.  Advances in encryption will ultimately allow users to know that their data isn't being used in unauthorized ways (see this) or will allow users to store on federated trusted storage platforms that give applications data on demand client side.

What is the future of Chrome OS?
This is another question from the forums.  Paraphrasing, the submitter asks what will change in the final version in terms of form factors, software, and hardware.  My response, mostly verbatim.

Google has two major launch partners (Samsung and ASUS), it now seems that the "imac" form factor in monitors will be available, and clearly verizon got in early so expect them to subsidize the cost to $0.  A major difference between chrome os and the "walled gardens" of iOS and android is that the 'apps' are adaptable by design.  It took Apple a long time to change the form factor of iOS because developers had assumed the same form factor, not so with the web.  So expect dramatically different devices than we've seen for any new platform:  tv interface, public PC's in buildings, netbooks, tablets, and maybe even a phone (someone will try).

The dark horse here is graphics, and once again Google is on top of things with the recently standardized WebGL.  One standardized graphics interface that works on all mobile devices will be tremendous.  No more bickering about flash, and no more settling for bad graphics in online games.  Nvidia is ahead of things here with their mobile arm/gpu SoC's.  Once the graphics catch up to a certain point the whole picture will be complete and you can actually have a respectable laptop for a few years.  If this happens watch for an increase in the durability of machines and 'wear' items such as the case and keyboard will be changed out more often, instead of thrown away entirely as they are now.

Don't be surprised when Starbucks announces that they are purchasing 5 Chrome OS notebooks for every store, free of use when you buy a coffee.  I expect schools will buy pallets of these things and every child will get one.  Can you imagine not having "outdated" machines every 3 years, if all you care about is web applications when it comes to cash strapped schools?  That will push down the cost of textbooks and make education more engaging (relevant and lucrative).  When you check into a hotel, expect to ask the concierge for toothpaste and a chrome laptop for the night.  Driving around a foreign town?  No problem, your car has a single SoC running chrome OS for your co-pilot; no more pecking around on your phone in a car when you can give voice commands, and enjoy that large flip down screen instead of your 4" pocket slate.

Another App Store?!
The move to consolidate webapps and streamline distribution is smart and there will certainly be competition on this front (e.g., firefox).  The key to a successful cloudbook is having the necessary applications available and accessible.  All of this is the same as any other appstore and in that sense it's boring; however, there is an opportunity for someone to advance user privacy control by standardizing user data storage.  I'm interested in storing data in a user specified location, and then using javascript libraries to assemble everything client-side so that no data is transmitted to the web-app provider.  Factoring out the application from the data improves privacy, promotes competition among storage providers, promotes data sharing between apps, and simplifies legal jurisdiction concerns.  This move would allow users to be sure that if their application developer disappears, that other developers are able to provide alternatives.

Final Thoughts
I believe the success of Chrome OS comes down to the price of the devices.  I have no doubt that many users could perform all of their computing on this platform and with such a simple design the OS and hardware can surely slim down to the single chip level.  There are only a handful of serious applications that aren't on the short term roadmap for Chrome OS (e.g., robust external device support) and for those tasks you will use something else or perhaps realize they aren't important anymore (why am I sending photos to my computer instead of directly to the web?).

While cloud computing is not a new idea, this attempt is backed by advances in web frameworks, wireless broadband, reliable infrastructure, and a tech savvy public.  Like scanning all of the worlds books and photographing public roadways with cars, nobody is asking for permission to innovate and it is up to users to decide if putting the final pieces of their identies online is something they are interested in participating in.

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1596343/IMG_0071.JPG http://posterous.com/users/YXqRcSjGHwl Brandyn White bwhite Brandyn White
Tue, 18 Jan 2011 14:03:00 -0800 Hadoopy: Cython based MapReduce library for Python (w/ Oozie Support) http://brandynwhite.com/hadoopy-cython-based-mapreduce-library-for-py http://brandynwhite.com/hadoopy-cython-based-mapreduce-library-for-py

For the last two years, one of my research interests has been in MapReduce for computer vision tasks.  My initial reason for looking into MapReduce was to encourage the Computer Vision community to move towards larger, web-based, datasets (e.g., Flickr, YouTube, Google Images).  These datasets tend to be more realistic, show generality, and demonstrate computational tractability of research methods.  MapReduce comes in because it enables programmers to abstract away the complexity of distributed processing at the expense of efficiency (compared to fine-grained methods such as MPI).  A major hurdle for me starting out with Hadoop was that its native API is in Java, a language that gets little love in the vision community.  This is primarily due to the need for fast pixel operations (often with pointer tricks) and linear algebra, tasks that come more naturally in languages like C and to some extent Python using numpy.  My personal style is to write all code that touches pixels in C or in numpy arrays (similar to MATLAB) and use Python for the higher level architecture.

After a brief search I found Hadoop Streaming, a guide to using streaming with Python, and the dumbo library.  The obvious choice is to work with dumbo, the de facto standard; however, after a few months of use I found it to be a bit difficult to use the way I wanted as it was designed with different goals in mind.  At a high level I have ~20 stage MapReduce flows with C modules, Cython, and ctypes where dumbo is really designed for fast one-off scripts; moreover, I have access to a few different clusters, some of which I am unable to install libraries or even Python on, clearly dumbo was optimized for the case where you have some amount of cluster admin access.  I decided that the interface I needed was simple enough, and that rolling my own library would satisfy my own research goals while diversifying the existing Python/Hadoop community.

My goals for hadoopy are
  1. Similar interface to Hadoop API (design patterns usable between Python/Java interfaces)
  2. General compatibility with dumbo to allow users to switch back and forth (in some instances this isn't possible due to #1)
  3. Usable on Hadoop clusters without Python and admin access (this also simplifies use with EC2 as there is no setup)
  4. Fast conversion and processing (Klaas, the author of Dumbo, has done an excellent job advocating for Hadoop Streaming enhancements such as TypedBytes)
  5. Stay small and well documented
  6. Be transparent with what is going on
  7. Handle programs with complicated .so's, ctypes, and extensions
  8. Code written for hack-ability (library is targeted at longer term projects where users will likely dive into the code-base)
  9. Simple HDFS access (e.g., ls and cat)
  10. Oozie support (both dumbo and hadoopy were written before Oozie, but it is clearly the standard now)
  11. Protocol Buffers support (in progress)
  12. Cython user code support (in progress)
Hadoopy's core is written in Cython so that the plumbing that handles the parsed KeyValue pairs is efficient.  In my experiments, for IO bound tasks like log parsing, the TypedBytes parsing is the bottleneck.  I wrote my own implementation of TypedBytes in Cython that is optimized for the common case (input via stdin, output via stdout).  I validated my implementation with the reference TypedBytes/cTypedBytes libraries and have achieved a fairly dramatic performance boost.  It would be good to see some of the optimizations make it back into those implementations so that everyone can benefit.

#3 seems impossible at first glance, how can you make a Python library that runs on a cluster without a reasonable Python?  cx_Freeze to the rescue!  cx_Freeze effectively builds you a custom executable that consists of your target program and a Python parser.  It also gives you the shared libraries you need to take along for the ride.  In Hadoopy, there is a 'launch' function that starts a Hadoop Job like normal (similar to how dumbo's executable does) and a 'launch_frozen' function which does the same thing but auto-magically cx_Freeze's everything and handles the changes internally.  If you use EC2 that means no more complicated setup scripts (I used to have a huge script to set everything up just right, now I just start the cluster).  If you are running on someone else's cluster and you don't want to bug them with installing Python, the new library that you need, etc then you are in luck.  Even if you have your own carefully manicured cluster it may be worth not customizing each node as it saves time and effort when you add nodes.  Keep in mind this is basically what Java is doing, you just bring along all of your goodies and it lets you not have to worry about the actual cluster details.  I'm experimenting with using Cython's own support for "freeze" functionality.  This would remove the cx_Freeze dependency and allow for user programs to be written in Cython (basically you get C performance from Python syntax).

During analysis, I kept wanting to grab a KeyValue stream off of HDFS in python (normally while using the ipython shell) and just get a feel for the data.  Instead I had to convert it to a nicer output format with another Map task.  For #9 I use the hadoop command line functionality dumptb (written by Klaas in CDH3) which will uncompress the data and take it from the SequenceFile format, resulting in a stream of TypedBytes that I can parse in.  This lets you get an iterator of KeyValue pairs from any TypedBytes encoded file, dramatically simplifying the integration of Hadoop into auxiliary analysis scripts that run locally.  It is also possible to read off of HDFS while inside of a hadoopy job, simplifying tasks such as K-Means Clustering where you need to package together your most recent cluster centers for the next iteration.

The Hadoop framework is focused on the Map and Reduce phases of a single job; however, projects consist of several of these jobs chained together.  The Oozie framework made by Yahoo and supported by Cloudera is designed to handle this task by representing each Hadoop job as a node in an directed acyclic graph.  My previous stab at a similar idea (before Oozie came out) worked fine until jobs failed.  Almost all new work occurs at the end of the pipeline so that is where your new bugs are; moreover, Python exacerbates this problem as most checks occur at runtime.  This makes Oozie and Python the perfect team, you can hack away your latest job, push it out, if it fails you change your code,  update the list of nodes to skip, and rerun the task.  I'll provide more details in a later post.

I recently rewrote major parts of Hadoopy in Cython (I have been using it consistently for about a year now) and want to get feedback.  I'm working on documenting the code, providing, tutorials, and examples of differences between the Hadoopy/Dumbo interfaces.  As of this writing the documentation includes benchmarks, a full tutorial, and API specification.


Hadoopy - Documentation

Hadoopy - Source

Computer Vision Code written using Hadoopy

Papers using the Hadoopy library
A Case for Query by Image and Text Content: Searching Computer Help using Screenshots and Keywords (to appear in WWW'11)

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1596343/IMG_0071.JPG http://posterous.com/users/YXqRcSjGHwl Brandyn White bwhite Brandyn White
Fri, 26 Nov 2010 02:10:00 -0800 Fakenect - OpenKinect driver simulator, experiment with the Kinect without the hardware http://brandynwhite.com/fakenect-openkinect-driver-simulator-experime http://brandynwhite.com/fakenect-openkinect-driver-simulator-experime

The Kinect is quickly becoming a transformative device for Human/Computer Interaction, Computer Vision, and Robotics.  I have been fairly acive in the OpenKinect community and one thing I commonly hear is that "I want to experiment with the demos out there but I don't have a kinect".  From that, I decided to make a library that identically replicates the libfreenect library so that anything that dynamically links it can function without requiring a Kinect.  Instead of generating synthetic data, I made a "record" program that dumps the RGB, Depth, and accelerometer data into a directory that can be loaded by fakenect.

Examples
I will use my demo program demo_cv_async.py in libfreenect/wrappers/python/ for this example.

./demo_cv_async.py

Here is a frame from the output

Screen0

The kinect sensor data can be dumped by running 'record' in libfreenect/build/utils
./record
Records the Kinect sensor data to a directory
Result can be used as input to Fakenect
Usage: ./record <out_dir>

./record legos0
...
legos0/a-1290762884.085605-2032009319-30.dump
legos0/d-1290762884.101368-2042233973-614400.pgm
legos0/a-1290762884.151033-2042233973-30.dump
legos0/r-1290762884.174879-2044011044-921600.ppm
legos0/a-1290762884.249213-2044011044-30.dump
legos0/a-1290762884.251880-2044011044-30.dump
legos0/a-1290762884.260512-2044011044-30.dump
...

The dump output is available here.  During the example I moved the Kinect to get different viewpoints.  I can now load this into the demo shown previously by running this on Linux (the following is one big command, not multiple commands)
LD_PRELOAD="/usr/local/lib/fakenect/libfreenect.so" FAKENECT_PATH="../../build/utils/legos0" ./demo_cv_async.py

or on OS X

DYLD_LIBRARY_PATH="/usr/local/lib/fakenect/" FAKENECT_PATH="../../build/utils/legos0" ./demo_cv_async.py

Here is a frame from the output

Screen1

Note that the demo was not created with fakenect in mind, I just overrode the library it uses with fakenect and passed fakenect the sensor dump path with an environmental variable.

Another example on glview in build/bin
LD_PRELOAD="/usr/local/lib/fakenect/libfreenect.so" FAKENECT_PATH="../utils/legos0" ./glview

Here is a frame from the output

Screen2

Again, nothing was recompiled and glview has been around since the beginning (so clearly it works on 'legacy' applications).

Applications
There are a variety of use cases for this.

I work in a research lab and it rarely makes sense to do experiments on a live sensor as it makes reproducing results difficult and it constraints you to real-time performance (first make it work, then make it work fast).  This allows you to record Kinect data and easily switch between live and recorded modes.

If you don't have a kinect but want to play with one, this is your opportunity.  Get a sensor dump (you can use the one I posted above), build the libfreenect driver, and run demos like I showed previously.  This can 1.) show your friends how cool you are, 2.) test your hardware out, 3.) help you understand what the kinect does, 4.) start writing code even before you get a kinect, and 5.) save yourself a few hundred bucks.

This can be used to mock the Kinect out for testing purposes where you can have a few test shots and run automated unit testing, etc.  

Where to get it
This is currently pending acceptance in the main repo but you can get it now from
and the pull request
It is in the mainline tree now
https://github.com/OpenKinect/libfreenect 

Implementation
If you've hung around this far you probably want to know more, well here it is.

The program takes one argument (the output directory) and saves the acceleration, depth, and rgb data as individual files with names in the form "TYPE-CURRENTIME-TIMESTAMP" where TYPE is either (a)ccel, (d)epth, or (r)gb, TIMESTAMP corresponds to the timestamp associated with the observation (or in the case of accel, the last timestamp seen), and CURRENTTIME corresponds to a floating point version of the time in seconds.  The purpose of storing the current time is so that delays can be recreated exactly as they occurred.  For RGB and DEPTH the dump is just the entirety of the data provided in PPM and PGM formats respectively (just a 1 line header above the raw dump).  For ACCEL, the dump is the 'freenect_raw_device_state'.  Only the front part of the file name is used, with the rest left undefined (extension, extra info, etc).

A file called INDEX.txt is also output with all of the filenames local to that directory to simplify the format (e.g., no need to read the directory structure).

And it will keep running, when you want to stop it, hit Ctrl-C and the signal will be caught, runloop stopped, and everything will be stored cleanly.

We read 1 update from the index per call, so this needs to be called in a loop like usual.  If the index line is a Depth/RGB image the provided callback is called.  If the index line is accelerometer data, then it is used to update our internal state.  If you query for the accelerometer data you get the last sensor reading that we have.  The time delays are compensated as best as we can to match those from the original data and current run conditions (e.g., if it takes longer to run this code then we wait less).

Sensor Dumps

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1596343/IMG_0071.JPG http://posterous.com/users/YXqRcSjGHwl Brandyn White bwhite Brandyn White