datafu.hourglass.jobs.package-info Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of datafu-hourglass-incubating Show documentation
Librares that make easier to solve data problems using Hadoop and higher level languages based on it.
There is a newer version: 1.3.3
Show newest version
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */

/**
 * Incremental Hadoop jobs and some supporting classes.  
 * 
 * 
 * Jobs within this package form the core of the incremental framework implementation.
 * There are two types of incremental jobs: partition-preserving and 
 * partition-collapsing.
 * 
 * 
 * 
 * A partition-preserving job consumes input data partitioned by day and produces output data partitioned by day.
 * This is equivalent to running a MapReduce job for each individual day of input data,
 * but much more efficient.  It compares the input data against the existing output data and only processes
 * input data with no corresponding output.
 * 
 * 
 * 
 * A partition-collapsing job consumes input data partitioned by day and produces a single output.
 * What distinguishes this job from a standard MapReduce job is that it can reuse the previous output.
 * This enables it to process data much more efficiently.  Rather than consuming all input data on each
 * run, it can consume only the new data since the previous run and merges it with the previous output.
 * 
 * 
 * 
 * Partition-preserving and partition-collapsing jobs can be created by extending {@link datafu.hourglass.jobs.AbstractPartitionPreservingIncrementalJob}
 * and {@link datafu.hourglass.jobs.AbstractPartitionCollapsingIncrementalJob}, respectively, and implementing the necessary methods.
 * Alternatively, there are concrete versions of these classes, {@link datafu.hourglass.jobs.PartitionPreservingIncrementalJob} and 
 * {@link datafu.hourglass.jobs.PartitionCollapsingIncrementalJob}, which can be used instead.  With these classes, the implementations are provided
 * through setters.  
 * 
 * 
 * 
 * Incremental jobs use Avro for input, intermediate, and output data.  To implement an incremental job, one must define their schemas.
 * A key schema and intermediate value schema specify the output of the mapper and combiner, which output key-value pairs.
 * The key schema and an output value schema specify the output of the reducer, which outputs a record having key and value
 * fields.
 * 
 * 
 * 
 * An incremental job also requires that implementations of map and reduce be defined, and optionally combine.  The map implementation must 
 * implement a {@link datafu.hourglass.model.Mapper} interface, which is very similar to the standard map interface in Hadoop.
 * The combine and reduce operations are implemented through an {@link datafu.hourglass.model.Accumulator} interface.
 * This is similar to the standard reduce in Hadoop, however values are provided one-at-a-time rather than by an enumerable list.
 * Also an accumulator returns either one value or no value at all by returning null.  That is, the accumulator may not return an arbitrary number of values
 * for the output.  This restriction precludes the implementation of certain operations, like flatten, which do not fit well within the 
 * incremental programming model.
 * 
 */
package datafu.hourglass.jobs;