Witty Keegan: hadoop

Showing posts with label hadoop. Show all posts

Tuesday, May 26, 2015

Guava Hadoop Classpath Issue

Blogging this because it was slightly too large for a tweet. If you've got a stacktrace like

java.lang.NoClassDefFoundError: com/google/common/io/LimitInputStream
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:467) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)

You may find this problematic dependency tree
\---org.apache.hadoop:hadoop-client
+---org.apache.hadoop:hadoop-common
+---org.apache.hadoop:hadoop-auth

It seems Google has once again broken compatibility in Guava by removing LimitInputStream in Guava 15. And while much of Hadoop (except the new versions which have upgraded their Guava version) are on an older version of Guva, the hadoop-auth module contains a newer version of Guava that most dependency management tools (aka Maven and Gradle) will choose over the older version. Adding an exclusion for this transitive dependency should resolve this issue.

Thursday, October 3, 2013

WebHDFS vs Native Performance

So after I heard about WebHDFS, I became curious about it's performance characteristics compared to the native client, particularly after reading this blog entry. Oddly, I found my results to be dramatically different from Andre's findings.

Here's the results of my experiments

Size	Native Avg % Faster
10 MB	-20.0%
100 MB	34.3%
500 MB	48.3%
1 GB	79.4%
10 GB	90.1%

As you can see, the native client generally handily beats WebHDFS, and there seems to be a correlation between the performance gap and the file size. I haven't had the time yet to look into the technical details of why this might be. There are some differences between our tests to note:

The latency between my client and the server is much lower (about 0.29ms instead of 23ms)
My client is in the same data center rather than a remote data center, with 10GbE connecting it to the server
I used wget instead of a Python WebHDFS client

It's possible there's network or cluster configuration differences that could contribute as well (including differences in Hadoop versions). My takeaway from this was that it's better to observe your actual performance before deciding which approach to take.

HDFS NameNode Username Hack

I created a userscript to override the username when (when programmatically detectible) to allow you to read HDFS directories and files that aren't world readable. Nothing fancy here, you could edit the URL yourself, this just makes it easier. The script is hosted here: http://userscripts.org/scripts/show/179132, and the source is available here: https://gist.github.com/keeganwitt/6810986.

Monday, July 22, 2013

Hadoop SleepInputFormat

I whipped up a little class to provide dummy input to Hadoop jobs for testing purposes. Hadoop had something like this, but they haven't updated it for Hadoop 2 for some reason. My class works with the new API.
https://gist.github.com/keeganwitt/6053872

Edit: They've now updated it.

Thursday, April 25, 2013

Hadoop Writing Bytes

There are times where you might want to write bytes directly to HDFS. Maybe you're writing binary data. Maybe you're writing data with varying encodings. In our case, we were doing both (depending on profile) and were trying to use MultipleOutputs to do so. We discovered that there was no built-in OutputFormat that supported bytes, nor was there any examples on the web of how to do this with the new API. Granted, it's not overly complicated, but to save you a little time, here's what I came up with.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.DataOutputStream;
import java.io.IOException;


public class BytesValueOutputFormat extends FileOutputFormat<NullWritable, BytesWritable> {

    @Override
    public RecordWriter<NullWritable, BytesWritable> getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        Configuration conf = taskAttemptContext.getConfiguration();
        boolean isCompressed = getCompressOutput(taskAttemptContext);
        CompressionCodec codec = null;
        String extension = "";
        if (isCompressed) {
            Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(taskAttemptContext, GzipCodec.class);
            codec = ReflectionUtils.newInstance(codecClass, conf);
            extension = codec.getDefaultExtension();
        }
        Path file = getDefaultWorkFile(taskAttemptContext, extension);
        FileSystem fs = file.getFileSystem(conf);
        if (!isCompressed) {
            FSDataOutputStream fileOut = fs.create(file, false);
            return new ByteRecordWriter(fileOut);
        } else {
            FSDataOutputStream fileOut = fs.create(file, false);
            return new ByteRecordWriter(new DataOutputStream(codec.createOutputStream(fileOut)));
        }
    }

    protected static class ByteRecordWriter extends RecordWriter<NullWritable, BytesWritable> {
        private DataOutputStream out;

        public ByteRecordWriter(DataOutputStream out) {
            this.out = out;
        }

        @Override
        public void write(NullWritable key, BytesWritable value) throws IOException {
            boolean nullValue = value == null;
            if (!nullValue) {
                out.write(value.getBytes(), 0, value.getLength());
            }
        }

        @Override
        public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
            out.close();
        }
    }

}

Here's an example usage with MultipleOutputs
Instantiation

MultipleOutputs<NullWritable, BytesWritable> multipleOutputs = new MultipleOutputs<NullWritable, BytesWritable>(context);

Writing

byte[] bytesToWrite = someAppLogic();
MultipleOutputs.write(NullWritable.get(), new BytesWritable(bytesToWrite), fileName);

And of course, since it's like any other OutputFormat, it can also work with LazyOutputFormat if desired (as well as just about anything else you might choose to do with an OutputFormat).

LazyOutputFormat.setOutputFormatClass(job, BytesValueOutputFormat.class);

In our case, this the last step in our sequence of Hadoop jobs so we had no further need for the key. One could conceive of situations in which further manipulation is needed. In such cases, you could attempt some sort of delimited binary (to separate the key from the value), but it might be easier to just keep it all as text and use Common Codec's Base64 to pass the bytes value between jobs.

Tuesday, March 12, 2013

Hadoop overwrite options

There's an undocumented feature (Hadoop's documentation needs some serious love) that allows you to overwrite the destination just like you can with Unix's cp -f in Hadoop's dfs in the cp and copyFromLocal commands. I've added HADOOP-9381 with a patch to document this feature in both the help page and the web page.

While I was looking at this, I realized that the mv and moveFromLocal commands didn't recognize the -f option, even though Unix's mv command does. Since it was simple to add, I created HADOOP-9382 with a patch to address that issue.

Witty Keegan

Pages