org.apache.hadoop.mapred.package.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of hadoop-core Show documentation
Hadoop is the distributed computing framework of Apache; hadoop-core contains the filesystem, job tracker and map/reduce modules
The newest version!






A software framework for easily writing applications which process vast 
amounts of data (multi-terabyte data-sets) parallelly on large clusters 
(thousands of nodes) built of commodity hardware in a reliable, fault-tolerant 
manner.

A Map-Reduce job usually splits the input data-set into independent 
chunks which processed by map tasks in completely parallel manner, 
followed by reduce tasks which aggregating their output. Typically both 
the input and the output of the job are stored in a 
{@link org.apache.hadoop.fs.FileSystem}. The framework takes care of monitoring 
tasks and re-executing failed ones. Since, usually, the compute nodes and the 
storage nodes are the same i.e. Hadoop's Map-Reduce framework and Distributed 
FileSystem are running on the same set of nodes, tasks are effectively scheduled 
on the nodes where data is already present, resulting in very high aggregate 
bandwidth across the cluster.

The Map-Reduce framework operates exclusively on <key, value> 
pairs i.e. the input to the job is viewed as a set of <key, value>
pairs and the output as another, possibly different, set of 
<key, value> pairs. The keys and values have to 
be serializable as {@link org.apache.hadoop.io.Writable}s and additionally the
keys have to be {@link org.apache.hadoop.io.WritableComparable}s in 
order to facilitate grouping by the framework.

Data flow:
                                (input)
                                <k1, v1>
       
                                   |
                                   V
       
                                  map
       
                                   |
                                   V

                                <k2, v2>
       
                                   |
                                   V
       
                                combine
       
                                   |
                                   V
       
                                <k2, v2>
       
                                   |
                                   V
       
                                 reduce
       
                                   |
                                   V
       
                                <k3, v3>
                                (output)


Applications typically implement 
{@link org.apache.hadoop.mapred.Mapper#map(Object, Object, OutputCollector, Reporter)} 
and
{@link org.apache.hadoop.mapred.Reducer#reduce(Object, Iterator, OutputCollector, Reporter)} 
methods.  The application-writer also specifies various facets of the job such
as input and output locations, the Partitioner, InputFormat 
& OutputFormat implementations to be used etc. as 
a {@link org.apache.hadoop.mapred.JobConf}. The client program, 
{@link org.apache.hadoop.mapred.JobClient}, then submits the job to the framework 
and optionally monitors it.

The framework spawns one map task per 
{@link org.apache.hadoop.mapred.InputSplit} generated by the 
{@link org.apache.hadoop.mapred.InputFormat} of the job and calls 
{@link org.apache.hadoop.mapred.Mapper#map(Object, Object, OutputCollector, Reporter)} 
with each <key, value> pair read by the 
{@link org.apache.hadoop.mapred.RecordReader} from the InputSplit for 
the task. The intermediate outputs of the maps are then grouped by keys
and optionally aggregated by combiner. The key space of intermediate 
outputs are paritioned by the {@link org.apache.hadoop.mapred.Partitioner}, where 
the number of partitions is exactly the number of reduce tasks for the job.

The reduce tasks fetch the sorted intermediate outputs of the maps, via http, 
merge the <key, value> pairs and call 
{@link org.apache.hadoop.mapred.Reducer#reduce(Object, Iterator, OutputCollector, Reporter)} 
for each <key, list of values> pair. The output of the reduce tasks' is 
stored on the FileSystem by the 
{@link org.apache.hadoop.mapred.RecordWriter} provided by the
{@link org.apache.hadoop.mapred.OutputFormat} of the job.

Map-Reduce application to perform a distributed grep:

public class Grep extends Configured implements Tool {

  // map: Search for the pattern specified by 'grep.mapper.regex' &
  //      'grep.mapper.regex.group'

  class GrepMapper<K, Text> 
  extends MapReduceBase  implements Mapper<K, Text, Text, LongWritable> {

    private Pattern pattern;
    private int group;

    public void configure(JobConf job) {
      pattern = Pattern.compile(job.get("grep.mapper.regex"));
      group = job.getInt("grep.mapper.regex.group", 0);
    }

    public void map(K key, Text value,
                    OutputCollector<Text, LongWritable> output,
                    Reporter reporter)
    throws IOException {
      String text = value.toString();
      Matcher matcher = pattern.matcher(text);
      while (matcher.find()) {
        output.collect(new Text(matcher.group(group)), new LongWritable(1));
      }
    }
  }

  // reduce: Count the number of occurrences of the pattern

  class GrepReducer<K> extends MapReduceBase
  implements Reducer<K, LongWritable, K, LongWritable> {

    public void reduce(K key, Iterator<LongWritable> values,
                       OutputCollector<K, LongWritable> output,
                       Reporter reporter)
    throws IOException {

      // sum all values for this key
      long sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }

      // output sum
      output.collect(key, new LongWritable(sum));
    }
  }
  
  public int run(String[] args) throws Exception {
    if (args.length < 3) {
      System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
      ToolRunner.printGenericCommandUsage(System.out);
      return -1;
    }

    JobConf grepJob = new JobConf(getConf(), Grep.class);
    
    grepJob.setJobName("grep");

    FileInputFormat.setInputPaths(grepJob, new Path(args[0]));
    FileOutputFormat.setOutputPath(grepJob, args[1]);

    grepJob.setMapperClass(GrepMapper.class);
    grepJob.setCombinerClass(GrepReducer.class);
    grepJob.setReducerClass(GrepReducer.class);

    grepJob.set("mapred.mapper.regex", args[2]);
    if (args.length == 4)
      grepJob.set("mapred.mapper.regex.group", args[3]);

    grepJob.setOutputFormat(SequenceFileOutputFormat.class);
    grepJob.setOutputKeyClass(Text.class);
    grepJob.setOutputValueClass(LongWritable.class);

    JobClient.runJob(grepJob);

    return 0;
  }

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new Grep(), args);
    System.exit(res);
  }

}


Notice how the data-flow of the above grep job is very similar to doing the
same via the unix pipeline:

cat input/*   |   grep   |   sort    |   uniq -c   >   out


      input   |    map   |  shuffle  |   reduce    >   out


Hadoop Map-Reduce applications need not be written in 
Java^TM only. 
Hadoop Streaming is a utility
which allows users to create and run jobs with any executables (e.g. shell 
utilities) as the mapper and/or the reducer. 
Hadoop Pipes is a 
SWIG-compatible C++ API to implement
Map-Reduce applications (non JNI^TM based).

See Google's original 
Map/Reduce paper for background information.

Java and JNI are trademarks or registered trademarks of 
Sun Microsystems, Inc. in the United States and other countries.