org.apache.avro.mapred.package.html Maven / Gradle / Ivy
Run Hadoop MapReduce jobs over
Avro data, with map and reduce functions written in Java.
Avro data files do not contain key/value pairs as expected by
Hadoop's MapReduce API, but rather just a sequence of values. Thus
we provide here a layer on top of Hadoop's MapReduce API.
In all cases, input and output paths are set and jobs are submitted
as with standard Hadoop jobs:
- Specify input files with {@link
org.apache.hadoop.mapred.FileInputFormat#setInputPaths}
- Specify an output directory with {@link
org.apache.hadoop.mapred.FileOutputFormat#setOutputPath}
- Run your job with {@link org.apache.hadoop.mapred.JobClient#runJob}
For jobs whose input and output are Avro data files:
- Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} and
{@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
job's input and output schemas.
- Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
this as your job's mapper with {@link
org.apache.avro.mapred.AvroJob#setMapperClass}
- Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
this as your job's reducer and perhaps combiner, with {@link
org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
org.apache.avro.mapred.AvroJob#setCombinerClass}
For jobs whose input is an Avro data file and which use an {@link
org.apache.avro.mapred.AvroMapper}, but whose reducer is a non-Avro
{@link org.apache.hadoop.mapred.Reducer} and whose output is a
non-Avro format:
- Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} with your
job's input schema.
- Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
this as your job's mapper with {@link
org.apache.avro.mapred.AvroJob#setMapperClass}
- Implement {@link org.apache.hadoop.mapred.Reducer} and specify
your job's reducer with {@link
org.apache.hadoop.mapred.JobConf#setReducerClass}. The input key
and value types should be {@link org.apache.avro.mapred.AvroKey} and {@link
org.apache.avro.mapred.AvroValue}.
- Optionally implement {@link org.apache.hadoop.mapred.Reducer} and
specify your job's combiner with {@link
org.apache.hadoop.mapred.JobConf#setCombinerClass}. You will be unable to
re-use the same Reducer class as the Combiner, as the Combiner will need
input and output key to be {@link org.apache.avro.mapred.AvroKey}, and
input and output value to be {@link org.apache.avro.mapred.AvroValue}.
- Specify your job's output key and value types {@link
org.apache.hadoop.mapred.JobConf#setOutputKeyClass} and {@link
org.apache.hadoop.mapred.JobConf#setOutputValueClass}.
- Specify your job's output format {@link
org.apache.hadoop.mapred.JobConf#setOutputFormat}.
For jobs whose input is non-Avro data file and which use a
non-Avro {@link org.apache.hadoop.mapred.Mapper}, but whose reducer
is an {@link org.apache.avro.mapred.AvroReducer} and whose output is
an Avro data file:
- Set your input file format with {@link
org.apache.hadoop.mapred.JobConf#setInputFormat}.
- Implement {@link org.apache.hadoop.mapred.Mapper} and specify
your job's mapper with {@link
org.apache.hadoop.mapred.JobConf#setMapperClass}. The output key
and value type should be {@link org.apache.avro.mapred.AvroKey} and
{@link org.apache.avro.mapred.AvroValue}.
- Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
this as your job's reducer and perhaps combiner, with {@link
org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
org.apache.avro.mapred.AvroJob#setCombinerClass}
- Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
job's output schema.
For jobs whose input is non-Avro data file and which use a
non-Avro {@link org.apache.hadoop.mapred.Mapper} and no reducer,
i.e., a map-only job:
- Set your input file format with {@link
org.apache.hadoop.mapred.JobConf#setInputFormat}.
- Implement {@link org.apache.hadoop.mapred.Mapper} and specify
your job's mapper with {@link
org.apache.hadoop.mapred.JobConf#setMapperClass}. The output key
and value type should be {@link org.apache.avro.mapred.AvroWrapper} and
{@link org.apache.hadoop.io.NullWritable}.
- Call {@link
org.apache.hadoop.mapred.JobConf#setNumReduceTasks(int)} with zero.
- Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
job's output schema.