All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.avro.mapred.package.html Maven / Gradle / Ivy






Run Hadoop MapReduce jobs over
Avro data, with map and reduce functions written in Java.

Avro data files do not contain key/value pairs as expected by Hadoop's MapReduce API, but rather just a sequence of values. Thus we provide here a layer on top of Hadoop's MapReduce API.

In all cases, input and output paths are set and jobs are submitted as with standard Hadoop jobs:

  • Specify input files with {@link org.apache.hadoop.mapred.FileInputFormat#setInputPaths}
  • Specify an output directory with {@link org.apache.hadoop.mapred.FileOutputFormat#setOutputPath}
  • Run your job with {@link org.apache.hadoop.mapred.JobClient#runJob}

For jobs whose input and output are Avro data files:

  • Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} and {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your job's input and output schemas.
  • Subclass {@link org.apache.avro.mapred.AvroMapper} and specify this as your job's mapper with {@link org.apache.avro.mapred.AvroJob#setMapperClass}
  • Subclass {@link org.apache.avro.mapred.AvroReducer} and specify this as your job's reducer and perhaps combiner, with {@link org.apache.avro.mapred.AvroJob#setReducerClass} and {@link org.apache.avro.mapred.AvroJob#setCombinerClass}

For jobs whose input is an Avro data file and which use an {@link org.apache.avro.mapred.AvroMapper}, but whose reducer is a non-Avro {@link org.apache.hadoop.mapred.Reducer} and whose output is a non-Avro format:

  • Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} with your job's input schema.
  • Subclass {@link org.apache.avro.mapred.AvroMapper} and specify this as your job's mapper with {@link org.apache.avro.mapred.AvroJob#setMapperClass}
  • Implement {@link org.apache.hadoop.mapred.Reducer} and specify your job's reducer with {@link org.apache.hadoop.mapred.JobConf#setReducerClass}. The input key and value types should be {@link org.apache.avro.mapred.AvroKey} and {@link org.apache.avro.mapred.AvroValue}.
  • Optionally implement {@link org.apache.hadoop.mapred.Reducer} and specify your job's combiner with {@link org.apache.hadoop.mapred.JobConf#setCombinerClass}. You will be unable to re-use the same Reducer class as the Combiner, as the Combiner will need input and output key to be {@link org.apache.avro.mapred.AvroKey}, and input and output value to be {@link org.apache.avro.mapred.AvroValue}.
  • Specify your job's output key and value types {@link org.apache.hadoop.mapred.JobConf#setOutputKeyClass} and {@link org.apache.hadoop.mapred.JobConf#setOutputValueClass}.
  • Specify your job's output format {@link org.apache.hadoop.mapred.JobConf#setOutputFormat}.

For jobs whose input is non-Avro data file and which use a non-Avro {@link org.apache.hadoop.mapred.Mapper}, but whose reducer is an {@link org.apache.avro.mapred.AvroReducer} and whose output is an Avro data file:

  • Set your input file format with {@link org.apache.hadoop.mapred.JobConf#setInputFormat}.
  • Implement {@link org.apache.hadoop.mapred.Mapper} and specify your job's mapper with {@link org.apache.hadoop.mapred.JobConf#setMapperClass}. The output key and value type should be {@link org.apache.avro.mapred.AvroKey} and {@link org.apache.avro.mapred.AvroValue}.
  • Subclass {@link org.apache.avro.mapred.AvroReducer} and specify this as your job's reducer and perhaps combiner, with {@link org.apache.avro.mapred.AvroJob#setReducerClass} and {@link org.apache.avro.mapred.AvroJob#setCombinerClass}
  • Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your job's output schema.

For jobs whose input is non-Avro data file and which use a non-Avro {@link org.apache.hadoop.mapred.Mapper} and no reducer, i.e., a map-only job:

  • Set your input file format with {@link org.apache.hadoop.mapred.JobConf#setInputFormat}.
  • Implement {@link org.apache.hadoop.mapred.Mapper} and specify your job's mapper with {@link org.apache.hadoop.mapred.JobConf#setMapperClass}. The output key and value type should be {@link org.apache.avro.mapred.AvroWrapper} and {@link org.apache.hadoop.io.NullWritable}.
  • Call {@link org.apache.hadoop.mapred.JobConf#setNumReduceTasks(int)} with zero.
  • Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your job's output schema.





© 2015 - 2024 Weber Informatics LLC | Privacy Policy