
datafu.hourglass.jobs.package-info Maven / Gradle / Ivy
Show all versions of datafu-hourglass-incubating Show documentation
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
/**
* Incremental Hadoop jobs and some supporting classes.
*
*
* Jobs within this package form the core of the incremental framework implementation.
* There are two types of incremental jobs: partition-preserving and
* partition-collapsing.
*
*
*
* A partition-preserving job consumes input data partitioned by day and produces output data partitioned by day.
* This is equivalent to running a MapReduce job for each individual day of input data,
* but much more efficient. It compares the input data against the existing output data and only processes
* input data with no corresponding output.
*
*
*
* A partition-collapsing job consumes input data partitioned by day and produces a single output.
* What distinguishes this job from a standard MapReduce job is that it can reuse the previous output.
* This enables it to process data much more efficiently. Rather than consuming all input data on each
* run, it can consume only the new data since the previous run and merges it with the previous output.
*
*
*
* Partition-preserving and partition-collapsing jobs can be created by extending {@link datafu.hourglass.jobs.AbstractPartitionPreservingIncrementalJob}
* and {@link datafu.hourglass.jobs.AbstractPartitionCollapsingIncrementalJob}, respectively, and implementing the necessary methods.
* Alternatively, there are concrete versions of these classes, {@link datafu.hourglass.jobs.PartitionPreservingIncrementalJob} and
* {@link datafu.hourglass.jobs.PartitionCollapsingIncrementalJob}, which can be used instead. With these classes, the implementations are provided
* through setters.
*
*
*
* Incremental jobs use Avro for input, intermediate, and output data. To implement an incremental job, one must define their schemas.
* A key schema and intermediate value schema specify the output of the mapper and combiner, which output key-value pairs.
* The key schema and an output value schema specify the output of the reducer, which outputs a record having key and value
* fields.
*
*
*
* An incremental job also requires that implementations of map and reduce be defined, and optionally combine. The map implementation must
* implement a {@link datafu.hourglass.model.Mapper} interface, which is very similar to the standard map interface in Hadoop.
* The combine and reduce operations are implemented through an {@link datafu.hourglass.model.Accumulator} interface.
* This is similar to the standard reduce in Hadoop, however values are provided one-at-a-time rather than by an enumerable list.
* Also an accumulator returns either one value or no value at all by returning null. That is, the accumulator may not return an arbitrary number of values
* for the output. This restriction precludes the implementation of certain operations, like flatten, which do not fit well within the
* incremental programming model.
*
*/
package datafu.hourglass.jobs;