org.apache.avro.mapred.package.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of avro-mapred Show documentation
An org.apache.hadoop.mapred compatible API for using Avro Serializatin in Hadoop
The newest version!





Run Hadoop MapReduce jobs over
Avro data, with map and reduce functions written in Java.

Avro data files do not contain key/value pairs as expected by
  Hadoop's MapReduce API, but rather just a sequence of values.  Thus
  we provide here a layer on top of Hadoop's MapReduce API.

In all cases, input and output paths are set and jobs are submitted
  as with standard Hadoop jobs:
 

   Specify input files with {@link
   org.apache.hadoop.mapred.FileInputFormat#setInputPaths}
   Specify an output directory with {@link
   org.apache.hadoop.mapred.FileOutputFormat#setOutputPath}
   Run your job with {@link org.apache.hadoop.mapred.JobClient#runJob}
 


For jobs whose input and output are Avro data files:
 

   Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} and
   {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
   job's input and output schemas.
   Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
   this as your job's mapper with {@link
   org.apache.avro.mapred.AvroJob#setMapperClass}
   Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
   this as your job's reducer and perhaps combiner, with {@link
   org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
   org.apache.avro.mapred.AvroJob#setCombinerClass}
 


For jobs whose input is an Avro data file and which use an {@link
  org.apache.avro.mapred.AvroMapper}, but whose reducer is a non-Avro
  {@link org.apache.hadoop.mapred.Reducer} and whose output is a
  non-Avro format:
 

   Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} with your
   job's input schema.
   Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
   this as your job's mapper with {@link
   org.apache.avro.mapred.AvroJob#setMapperClass}
   Implement {@link org.apache.hadoop.mapred.Reducer} and specify
   your job's reducer with {@link
   org.apache.hadoop.mapred.JobConf#setReducerClass}.  The input key
   and value types should be {@link org.apache.avro.mapred.AvroKey} and {@link
   org.apache.avro.mapred.AvroValue}.
   Optionally implement {@link org.apache.hadoop.mapred.Reducer} and
   specify your job's combiner with {@link
   org.apache.hadoop.mapred.JobConf#setCombinerClass}.  You will be unable to
   re-use the same Reducer class as the Combiner, as the Combiner will need
   input and output key to be {@link org.apache.avro.mapred.AvroKey}, and
   input and output value to be {@link org.apache.avro.mapred.AvroValue}.
   Specify your job's output key and value types {@link
   org.apache.hadoop.mapred.JobConf#setOutputKeyClass} and {@link
   org.apache.hadoop.mapred.JobConf#setOutputValueClass}.
   Specify your job's output format {@link
   org.apache.hadoop.mapred.JobConf#setOutputFormat}.
 


For jobs whose input is non-Avro data file and which use a
  non-Avro {@link org.apache.hadoop.mapred.Mapper}, but whose reducer
  is an {@link org.apache.avro.mapred.AvroReducer} and whose output is
  an Avro data file:
 

   Set your input file format with {@link
   org.apache.hadoop.mapred.JobConf#setInputFormat}.
   Implement {@link org.apache.hadoop.mapred.Mapper} and specify
   your job's mapper with {@link
   org.apache.hadoop.mapred.JobConf#setMapperClass}.  The output key
   and value type should be {@link org.apache.avro.mapred.AvroKey} and
   {@link org.apache.avro.mapred.AvroValue}.
   Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
   this as your job's reducer and perhaps combiner, with {@link
   org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
   org.apache.avro.mapred.AvroJob#setCombinerClass}
   Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
   job's output schema.
 


For jobs whose input is non-Avro data file and which use a
  non-Avro {@link org.apache.hadoop.mapred.Mapper} and no reducer,
  i.e., a map-only job:
 

   Set your input file format with {@link
   org.apache.hadoop.mapred.JobConf#setInputFormat}.
   Implement {@link org.apache.hadoop.mapred.Mapper} and specify
   your job's mapper with {@link
   org.apache.hadoop.mapred.JobConf#setMapperClass}.  The output key
   and value type should be {@link org.apache.avro.mapred.AvroWrapper} and
   {@link org.apache.hadoop.io.NullWritable}.
   Call {@link
   org.apache.hadoop.mapred.JobConf#setNumReduceTasks(int)} with zero.
   
Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
   job's output schema.