Thursday, October 3, 2013

WebHDFS vs Native Performance

So after I heard about WebHDFS, I became curious about it's performance characteristics compared to the native client, particularly after reading this blog entry.  Oddly, I found my results to be dramatically different from Andre's findings.

Here's the results of my experiments
Size Native Avg % Faster
10 MB -20.0%
100 MB 34.3%
500 MB 48.3%
1 GB 79.4%
10 GB 90.1%

As you can see, the native client generally handily beats WebHDFS, and there seems to be a correlation between the performance gap and the file size.  I haven't had the time yet to look into the technical details of why this might be.  There are some differences between our tests to note:
  • The latency between my client and the server is much lower (about 0.29ms instead of 23ms)
  • My client is in the same data center rather than a remote data center, with 10GbE connecting it to the server
  • I used wget instead of a Python WebHDFS client

It's possible there's network or cluster configuration differences that could contribute as well (including differences in Hadoop versions).  My takeaway from this was that it's better to observe your actual performance before deciding which approach to take.

HDFS NameNode Username Hack

I created a userscript to override the username when (when programmatically detectible) to allow you to read HDFS directories and files that aren't world readable.  Nothing fancy here, you could edit the URL yourself, this just makes it easier.  The script is hosted here:, and the source is available here: