org.apache.avro.mapred.package.html Maven / Gradle / Ivy

Go to download





Run Hadoop MapReduce jobs over
Avro data, with map and reduce functions written in Java.

Avro data files do not contain key/value pairs as expected by
  Hadoop's MapReduce API, but rather just a sequence of values.  Thus
  we provide here a layer on top of Hadoop's MapReduce API.

In all cases, input and output paths are set and jobs are submitted
  as with standard Hadoop jobs:
 

   Specify input files with {@link
   org.apache.hadoop.mapred.FileInputFormat#setInputPaths}
   Specify an output directory with {@link
   org.apache.hadoop.mapred.FileOutputFormat#setOutputPath}
   Run your job with {@link org.apache.hadoop.mapred.JobClient#runJob}
 


For jobs whose input and output are Avro data files:
 

   Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} and
   {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
   job's input and output schemas.
   Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
   this as your job's mapper with {@link
   org.apache.avro.mapred.AvroJob#setMapperClass}
   Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
   this as your job's reducer and perhaps combiner, with {@link
   org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
   org.apache.avro.mapred.AvroJob#setCombinerClass}
 


For jobs whose input is an Avro data file and which use an {@link
  org.apache.avro.mapred.AvroMapper}, but whose reducer is a non-Avro
  {@link org.apache.hadoop.mapred.Reducer} and whose output is a
  non-Avro format:
 

   Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} with your
   job's input schema.
   Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
   this as your job's mapper with {@link
   org.apache.avro.mapred.AvroJob#setMapperClass}
   Implement {@link org.apache.hadoop.mapred.Reducer} and specify
   your job's reducer with {@link
   org.apache.hadoop.mapred.JobConf#setReducerClass}.  The input key
   and value types should be {@link org.apache.avro.mapred.AvroKey} and {@link
   org.apache.avro.mapred.AvroValue}.
   Optionally implement {@link org.apache.hadoop.mapred.Reducer} and
   specify your job's combiner with {@link
   org.apache.hadoop.mapred.JobConf#setCombinerClass}.  You will be unable to
   re-use the same Reducer class as the Combiner, as the Combiner will need
   input and output key to be {@link org.apache.avro.mapred.AvroKey}, and
   input and output value to be {@link org.apache.avro.mapred.AvroValue}.
   Specify your job's output key and value types {@link
   org.apache.hadoop.mapred.JobConf#setOutputKeyClass} and {@link
   org.apache.hadoop.mapred.JobConf#setOutputValueClass}.
   Specify your job's output format {@link
   org.apache.hadoop.mapred.JobConf#setOutputFormat}.
 


For jobs whose input is non-Avro data file and which use a
  non-Avro {@link org.apache.hadoop.mapred.Mapper}, but whose reducer
  is an {@link org.apache.avro.mapred.AvroReducer} and whose output is
  an Avro data file:
 

   Set your input file format with {@link
   org.apache.hadoop.mapred.JobConf#setInputFormat}.
   Implement {@link org.apache.hadoop.mapred.Mapper} and specify
   your job's mapper with {@link
   org.apache.hadoop.mapred.JobConf#setMapperClass}.  The output key
   and value type should be {@link org.apache.avro.mapred.AvroKey} and
   {@link org.apache.avro.mapred.AvroValue}.
   Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
   this as your job's reducer and perhaps combiner, with {@link
   org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
   org.apache.avro.mapred.AvroJob#setCombinerClass}
   Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
   job's output schema.
 


For jobs whose input is non-Avro data file and which use a
  non-Avro {@link org.apache.hadoop.mapred.Mapper} and no reducer,
  i.e., a map-only job:
 

   Set your input file format with {@link
   org.apache.hadoop.mapred.JobConf#setInputFormat}.
   Implement {@link org.apache.hadoop.mapred.Mapper} and specify
   your job's mapper with {@link
   org.apache.hadoop.mapred.JobConf#setMapperClass}.  The output key
   and value type should be {@link org.apache.avro.mapred.AvroWrapper} and
   {@link org.apache.hadoop.io.NullWritable}.
   Call {@link
   org.apache.hadoop.mapred.JobConf#setNumReduceTasks(int)} with zero.
   
Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
   job's output schema.