Here's the results of my experiments
Size | Native Avg % Faster |
---|---|
10 MB | -20.0% |
100 MB | 34.3% |
500 MB | 48.3% |
1 GB | 79.4% |
10 GB | 90.1% |
As you can see, the native client generally handily beats WebHDFS, and there seems to be a correlation between the performance gap and the file size. I haven't had the time yet to look into the technical details of why this might be. There are some differences between our tests to note:
- The latency between my client and the server is much lower (about 0.29ms instead of 23ms)
- My client is in the same data center rather than a remote data center, with 10GbE connecting it to the server
- I used wget instead of a Python WebHDFS client
It's possible there's network or cluster configuration differences that could contribute as well (including differences in Hadoop versions). My takeaway from this was that it's better to observe your actual performance before deciding which approach to take.