All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.avro.mapred.package.html Maven / Gradle / Ivy

Go to download

An org.apache.hadoop.mapred compatible API for using Avro Serializatin in Hadoop

The newest version!





Run Hadoop MapReduce jobs over
Avro data, with map and reduce functions written in Java.

Avro data files do not contain key/value pairs as expected by Hadoop's MapReduce API, but rather just a sequence of values. Thus we provide here a layer on top of Hadoop's MapReduce API.

In all cases, input and output paths are set and jobs are submitted as with standard Hadoop jobs:

  • Specify input files with {@link org.apache.hadoop.mapred.FileInputFormat#setInputPaths}
  • Specify an output directory with {@link org.apache.hadoop.mapred.FileOutputFormat#setOutputPath}
  • Run your job with {@link org.apache.hadoop.mapred.JobClient#runJob}

For jobs whose input and output are Avro data files:

  • Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} and {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your job's input and output schemas.
  • Subclass {@link org.apache.avro.mapred.AvroMapper} and specify this as your job's mapper with {@link org.apache.avro.mapred.AvroJob#setMapperClass}
  • Subclass {@link org.apache.avro.mapred.AvroReducer} and specify this as your job's reducer and perhaps combiner, with {@link org.apache.avro.mapred.AvroJob#setReducerClass} and {@link org.apache.avro.mapred.AvroJob#setCombinerClass}

For jobs whose input is an Avro data file and which use an {@link org.apache.avro.mapred.AvroMapper}, but whose reducer is a non-Avro {@link org.apache.hadoop.mapred.Reducer} and whose output is a non-Avro format:

  • Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} with your job's input schema.
  • Subclass {@link org.apache.avro.mapred.AvroMapper} and specify this as your job's mapper with {@link org.apache.avro.mapred.AvroJob#setMapperClass}
  • Implement {@link org.apache.hadoop.mapred.Reducer} and specify your job's reducer with {@link org.apache.hadoop.mapred.JobConf#setReducerClass}. The input key and value types should be {@link org.apache.avro.mapred.AvroKey} and {@link org.apache.avro.mapred.AvroValue}.
  • Optionally implement {@link org.apache.hadoop.mapred.Reducer} and specify your job's combiner with {@link org.apache.hadoop.mapred.JobConf#setCombinerClass}. You will be unable to re-use the same Reducer class as the Combiner, as the Combiner will need input and output key to be {@link org.apache.avro.mapred.AvroKey}, and input and output value to be {@link org.apache.avro.mapred.AvroValue}.
  • Specify your job's output key and value types {@link org.apache.hadoop.mapred.JobConf#setOutputKeyClass} and {@link org.apache.hadoop.mapred.JobConf#setOutputValueClass}.
  • Specify your job's output format {@link org.apache.hadoop.mapred.JobConf#setOutputFormat}.

For jobs whose input is non-Avro data file and which use a non-Avro {@link org.apache.hadoop.mapred.Mapper}, but whose reducer is an {@link org.apache.avro.mapred.AvroReducer} and whose output is an Avro data file:

  • Set your input file format with {@link org.apache.hadoop.mapred.JobConf#setInputFormat}.
  • Implement {@link org.apache.hadoop.mapred.Mapper} and specify your job's mapper with {@link org.apache.hadoop.mapred.JobConf#setMapperClass}. The output key and value type should be {@link org.apache.avro.mapred.AvroKey} and {@link org.apache.avro.mapred.AvroValue}.
  • Subclass {@link org.apache.avro.mapred.AvroReducer} and specify this as your job's reducer and perhaps combiner, with {@link org.apache.avro.mapred.AvroJob#setReducerClass} and {@link org.apache.avro.mapred.AvroJob#setCombinerClass}
  • Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your job's output schema.

For jobs whose input is non-Avro data file and which use a non-Avro {@link org.apache.hadoop.mapred.Mapper} and no reducer, i.e., a map-only job:

  • Set your input file format with {@link org.apache.hadoop.mapred.JobConf#setInputFormat}.
  • Implement {@link org.apache.hadoop.mapred.Mapper} and specify your job's mapper with {@link org.apache.hadoop.mapred.JobConf#setMapperClass}. The output key and value type should be {@link org.apache.avro.mapred.AvroWrapper} and {@link org.apache.hadoop.io.NullWritable}.
  • Call {@link org.apache.hadoop.mapred.JobConf#setNumReduceTasks(int)} with zero.
  • Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your job's output schema.





© 2015 - 2025 Weber Informatics LLC | Privacy Policy