com.google.cloud.dataflow.sdk.io.range.RangeTracker Maven / Gradle / Ivy

Show more of this group Show more artifacts with this name
Show all versions of google-cloud-dataflow-java-sdk-all Show documentation
Google Cloud Dataflow Java SDK provides a simple, Java-based interface for processing virtually any size data using Google cloud resources. This artifact includes entire Dataflow Java SDK.
There is a newer version: 2.5.0
Show newest version
/*******************************************************************************
 * Copyright (C) 2015 Google Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 ******************************************************************************/

package com.google.cloud.dataflow.sdk.io.range;

/**
 * A {@code RangeTracker} is a thread-safe helper object for implementing dynamic work rebalancing
 * in position-based {@link com.google.cloud.dataflow.sdk.io.BoundedSource.BoundedReader}
 * subclasses.
 *
 * Usage of the RangeTracker class hierarchy
 * The abstract {@code RangeTracker} interface should not be used per se - all users should use its
 * subclasses directly. We declare it here because all subclasses have roughly the same interface
 * and the same properties, to centralize the documentation. Currently we provide one
 * implementation - {@link OffsetRangeTracker}.
 *
 * Position-based sources
 * A position-based source is one where the source can be described by a range of positions of
 * an ordered type and the records returned by the reader can be described by positions of the
 * same type.
 *
 * In case a record occupies a range of positions in the source, the most important thing about
 * the record is the position where it starts.
 *
 * 
Defining the semantics of positions for a source is entirely up to the source class, however
 * the chosen definitions have to obey certain properties in order to make it possible to correctly
 * split the source into parts, including dynamic splitting. Two main aspects need to be defined:
 * 

 *   How to assign starting positions to records.
 *   
Which records should be read by a source with a range {@code [A, B)}.
 * 
 * Moreover, reading a range must be efficient, i.e., the performance of reading a range
 * should not significantly depend on the location of the range. For example, reading the range
 * {@code [A, B)} should not require reading all data before {@code A}.
 *
 * The sections below explain exactly what properties these definitions must satisfy, and
 * how to use a {@code RangeTracker} with a properly defined source.
 *
 * 
Properties of position-based sources
 * The main requirement for position-based sources is associativity: reading records from
 * {@code [A, B)} and records from {@code [B, C)} should give the same records as reading from
 * {@code [A, C)}, where {@code A <= B <= C}. This property ensures that no matter how a range
 * of positions is split into arbitrarily many sub-ranges, the total set of records described by
 * them stays the same.
 *
 * The other important property is how the source's range relates to positions of records in
 * the source. In many sources each record can be identified by a unique starting position.
 * In this case:
 * 

 *   All records returned by a source {@code [A, B)} must have starting positions
 *   in this range.
 *   
All but the last record should end within this range. The last record may or may not
 *   extend past the end of the range.
 *   
Records should not overlap.
 * 
 * Such sources should define "read {@code [A, B)}" as "read from the first record starting at or
 * after A, up to but not including the first record starting at or after B".
 *
 * Some examples of such sources include reading lines or CSV from a text file, reading keys and
 * values from a BigTable, etc.
 *
 * 
The concept of split points allows to extend the definitions for dealing with sources
 * where some records cannot be identified by a unique starting position.
 *
 * 
In all cases, all records returned by a source {@code [A, B)} must start at or after
 * {@code A}.
 *
 * 
Split points
 *
 * Some sources may have records that are not directly addressable. For example, imagine a file
 * format consisting of a sequence of compressed blocks. Each block can be assigned an offset, but
 * records within the block cannot be directly addressed without decompressing the block. Let us
 * refer to this hypothetical format as CBF (Compressed Blocks Format).
 *
 * 
Many such formats can still satisfy the associativity property. For example, in CBF, reading
 * {@code [A, B)} can mean "read all the records in all blocks whose starting offset is in
 * {@code [A, B)}".
 *
 * 
To support such complex formats, we introduce the notion of split points. We say that
 * a record is a split point if there exists a position {@code A} such that the record is the first
 * one to be returned when reading the range  {@code [A, infinity)}. In CBF, the only split points
 * would be the first records in each block.
 *
 * 
Split points allow us to define the meaning of a record's position and a source's range
 * in all cases:
 * 

 *   For a record that is at a split point, its position is defined to be the largest
 *   {@code A} such that reading a source with the range {@code [A, infinity)} returns this record;
 *   
Positions of other records are only required to be non-decreasing;
 *   
Reading the source {@code [A, B)} must return records starting from the first split point
 *   at or after {@code A}, up to but not including the first split point at or after {@code B}.
 *   In particular, this means that the first record returned by a source MUST always be
 *   a split point.
 *   
Positions of split points must be unique.
 * 
 * As a result, for any decomposition of the full range of the source into position ranges, the
 * total set of records will be the full set of records in the source, and each record
 * will be read exactly once.
 *
 * Consumed positions
 * As the source is being read, and records read from it are being passed to the downstream
 * transforms in the pipeline, we say that positions in the source are being consumed.
 * When a reader has read a record (or promised to a caller that a record will be returned),
 * positions up to and including the record's start position are considered consumed.
 *
 * Dynamic splitting can happen only at unconsumed positions. If the reader just
 * returned a record at offset 42 in a file, dynamic splitting can happen only at offset 43 or
 * beyond, as otherwise that record could be read twice (by the current reader and by a reader
 * of the task starting at 43).
 *
 * 
Example
 * The following example uses an {@link OffsetRangeTracker} to support dynamically splitting
 * a source with integer positions (offsets).
 *  {@code
 *   class MyReader implements BoundedReader {
 *     private MySource currentSource;
 *     private final OffsetRangeTracker tracker = new OffsetRangeTracker();
 *     ...
 *     MyReader(MySource source) {
 *       this.currentSource = source;
 *       this.tracker = new MyRangeTracker<>(source.getStartOffset(), source.getEndOffset())
 *     }
 *     ...
 *     boolean start() {
 *       ... (general logic for locating the first record) ...
 *       if (!tracker.tryReturnRecordAt(true, recordStartOffset)) return false;
 *       ... (any logic that depends on the record being returned, e.g. counting returned records)
 *       return true;
 *     }
 *     boolean advance() {
 *       ... (general logic for locating the next record) ...
 *       if (!tracker.tryReturnRecordAt(isAtSplitPoint, recordStartOffset)) return false;
 *       ... (any logic that depends on the record being returned, e.g. counting returned records)
 *       return true;
 *     }
 *
 *     double getFractionConsumed() {
 *       return tracker.getFractionConsumed();
 *     }
 *   }
 * } 
 *
 * Usage with different models of iteration
 * When using this class to protect a
 * {@link com.google.cloud.dataflow.sdk.io.BoundedSource.BoundedReader}, follow the pattern
 * described above.
 *
 * When using this class to protect iteration in the {@code hasNext()/next()}
 * model, consider the record consumed when {@code hasNext()} is about to return true, rather than
 * when {@code next()} is called, because {@code hasNext()} returning true is promising the caller
 * that {@code next()} will have an element to return - so {@link #trySplitAtPosition} must not
 * split the range in a way that would make the record promised by {@code hasNext()} belong to
 * a different range.
 *
 * 
Also note that implementations of {@code hasNext()} need to ensure
 * that they call {@link #tryReturnRecordAt} only once even if {@code hasNext()} is called
 * repeatedly, due to the requirement on uniqueness of split point positions.
 *
 * @param  Type of positions used by the source to define ranges and identify records.
 */
public interface RangeTracker {
  /**
   * Returns the starting position of the current range, inclusive.
   */
  PositionT getStartPosition();

  /**
   * Returns the ending position of the current range, exclusive.
   */
  PositionT getStopPosition();

  /**
   * Atomically determines whether a record at the given position can be returned and updates
   * internal state. In particular:
   *
   * 

   *   If {@code isAtSplitPoint} is {@code true}, and {@code recordStart} is outside the current
   *   range, returns {@code false};
   *   
Otherwise, updates the last-consumed position to {@code recordStart} and returns
   *   {@code true}.
   * 
   *
   * This method MUST be called on all split point records. It may be called on every record.
   */
  boolean tryReturnRecordAt(boolean isAtSplitPoint, PositionT recordStart);

  /**
   * Atomically splits the current range [{@link #getStartPosition}, {@link #getStopPosition})
   * into a "primary" part [{@link #getStartPosition}, {@code splitPosition})
   * and a "residual" part [{@code splitPosition}, {@link #getStopPosition}), assuming the current
   * last-consumed position is within [{@link #getStartPosition}, splitPosition)
   * (i.e., {@code splitPosition} has not been consumed yet).
   *
   * 
Updates the current range to be the primary and returns {@code true}. This means that
   * all further calls on the current object will interpret their arguments relative to the
   * primary range.
   *
   * If the split position has already been consumed, or if no {@link #tryReturnRecordAt} call
   * was made yet, returns {@code false}. The second condition is to prevent dynamic splitting
   * during reader start-up.
   */
  boolean trySplitAtPosition(PositionT splitPosition);

  /**
   * Returns the approximate fraction of positions in the source that have been consumed by
   * successful {@link #tryReturnRecordAt} calls, or 0.0 if no such calls have happened.
   */
  double getFractionConsumed();
}