org.apache.hadoop.mapred.package.html Maven / Gradle / Ivy
Show all versions of hadoop-core Show documentation
A software framework for easily writing applications which process vast
amounts of data (multi-terabyte data-sets) parallelly on large clusters
(thousands of nodes) built of commodity hardware in a reliable, fault-tolerant
manner.
A Map-Reduce job usually splits the input data-set into independent
chunks which processed by map tasks in completely parallel manner,
followed by reduce tasks which aggregating their output. Typically both
the input and the output of the job are stored in a
{@link org.apache.hadoop.fs.FileSystem}. The framework takes care of monitoring
tasks and re-executing failed ones. Since, usually, the compute nodes and the
storage nodes are the same i.e. Hadoop's Map-Reduce framework and Distributed
FileSystem are running on the same set of nodes, tasks are effectively scheduled
on the nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
The Map-Reduce framework operates exclusively on <key, value>
pairs i.e. the input to the job is viewed as a set of <key, value>
pairs and the output as another, possibly different, set of
<key, value> pairs. The keys and values have to
be serializable as {@link org.apache.hadoop.io.Writable}s and additionally the
keys have to be {@link org.apache.hadoop.io.WritableComparable}s in
order to facilitate grouping by the framework.
Data flow:
(input)
<k1, v1>
|
V
map
|
V
<k2, v2>
|
V
combine
|
V
<k2, v2>
|
V
reduce
|
V
<k3, v3>
(output)
Applications typically implement
{@link org.apache.hadoop.mapred.Mapper#map(Object, Object, OutputCollector, Reporter)}
and
{@link org.apache.hadoop.mapred.Reducer#reduce(Object, Iterator, OutputCollector, Reporter)}
methods. The application-writer also specifies various facets of the job such
as input and output locations, the Partitioner, InputFormat
& OutputFormat implementations to be used etc. as
a {@link org.apache.hadoop.mapred.JobConf}. The client program,
{@link org.apache.hadoop.mapred.JobClient}, then submits the job to the framework
and optionally monitors it.
The framework spawns one map task per
{@link org.apache.hadoop.mapred.InputSplit} generated by the
{@link org.apache.hadoop.mapred.InputFormat} of the job and calls
{@link org.apache.hadoop.mapred.Mapper#map(Object, Object, OutputCollector, Reporter)}
with each <key, value> pair read by the
{@link org.apache.hadoop.mapred.RecordReader} from the InputSplit for
the task. The intermediate outputs of the maps are then grouped by keys
and optionally aggregated by combiner. The key space of intermediate
outputs are paritioned by the {@link org.apache.hadoop.mapred.Partitioner}, where
the number of partitions is exactly the number of reduce tasks for the job.
The reduce tasks fetch the sorted intermediate outputs of the maps, via http,
merge the <key, value> pairs and call
{@link org.apache.hadoop.mapred.Reducer#reduce(Object, Iterator, OutputCollector, Reporter)}
for each <key, list of values> pair. The output of the reduce tasks' is
stored on the FileSystem by the
{@link org.apache.hadoop.mapred.RecordWriter} provided by the
{@link org.apache.hadoop.mapred.OutputFormat} of the job.
Map-Reduce application to perform a distributed grep:
public class Grep extends Configured implements Tool {
// map: Search for the pattern specified by 'grep.mapper.regex' &
// 'grep.mapper.regex.group'
class GrepMapper<K, Text>
extends MapReduceBase implements Mapper<K, Text, Text, LongWritable> {
private Pattern pattern;
private int group;
public void configure(JobConf job) {
pattern = Pattern.compile(job.get("grep.mapper.regex"));
group = job.getInt("grep.mapper.regex.group", 0);
}
public void map(K key, Text value,
OutputCollector<Text, LongWritable> output,
Reporter reporter)
throws IOException {
String text = value.toString();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
output.collect(new Text(matcher.group(group)), new LongWritable(1));
}
}
}
// reduce: Count the number of occurrences of the pattern
class GrepReducer<K> extends MapReduceBase
implements Reducer<K, LongWritable, K, LongWritable> {
public void reduce(K key, Iterator<LongWritable> values,
OutputCollector<K, LongWritable> output,
Reporter reporter)
throws IOException {
// sum all values for this key
long sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
// output sum
output.collect(key, new LongWritable(sum));
}
}
public int run(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
JobConf grepJob = new JobConf(getConf(), Grep.class);
grepJob.setJobName("grep");
FileInputFormat.setInputPaths(grepJob, new Path(args[0]));
FileOutputFormat.setOutputPath(grepJob, args[1]);
grepJob.setMapperClass(GrepMapper.class);
grepJob.setCombinerClass(GrepReducer.class);
grepJob.setReducerClass(GrepReducer.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Grep(), args);
System.exit(res);
}
}
Notice how the data-flow of the above grep job is very similar to doing the
same via the unix pipeline:
cat input/* | grep | sort | uniq -c > out
input | map | shuffle | reduce > out
Hadoop Map-Reduce applications need not be written in
JavaTM only.
Hadoop Streaming is a utility
which allows users to create and run jobs with any executables (e.g. shell
utilities) as the mapper and/or the reducer.
Hadoop Pipes is a
SWIG-compatible C++ API to implement
Map-Reduce applications (non JNITM based).
See Google's original
Map/Reduce paper for background information.
Java and JNI are trademarks or registered trademarks of
Sun Microsystems, Inc. in the United States and other countries.