com.google.cloud.dataflow.sdk.io.range.RangeTracker Maven / Gradle / Ivy
Show all versions of google-cloud-dataflow-java-sdk-all Show documentation
/*******************************************************************************
* Copyright (C) 2015 Google Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
******************************************************************************/
package com.google.cloud.dataflow.sdk.io.range;
/**
* A {@code RangeTracker} is a thread-safe helper object for implementing dynamic work rebalancing
* in position-based {@link com.google.cloud.dataflow.sdk.io.BoundedSource.BoundedReader}
* subclasses.
*
* Usage of the RangeTracker class hierarchy
* The abstract {@code RangeTracker} interface should not be used per se - all users should use its
* subclasses directly. We declare it here because all subclasses have roughly the same interface
* and the same properties, to centralize the documentation. Currently we provide one
* implementation - {@link OffsetRangeTracker}.
*
* Position-based sources
* A position-based source is one where the source can be described by a range of positions of
* an ordered type and the records returned by the reader can be described by positions of the
* same type.
*
* In case a record occupies a range of positions in the source, the most important thing about
* the record is the position where it starts.
*
*
Defining the semantics of positions for a source is entirely up to the source class, however
* the chosen definitions have to obey certain properties in order to make it possible to correctly
* split the source into parts, including dynamic splitting. Two main aspects need to be defined:
*
* - How to assign starting positions to records.
*
- Which records should be read by a source with a range {@code [A, B)}.
*
* Moreover, reading a range must be efficient, i.e., the performance of reading a range
* should not significantly depend on the location of the range. For example, reading the range
* {@code [A, B)} should not require reading all data before {@code A}.
*
* The sections below explain exactly what properties these definitions must satisfy, and
* how to use a {@code RangeTracker} with a properly defined source.
*
*
Properties of position-based sources
* The main requirement for position-based sources is associativity: reading records from
* {@code [A, B)} and records from {@code [B, C)} should give the same records as reading from
* {@code [A, C)}, where {@code A <= B <= C}. This property ensures that no matter how a range
* of positions is split into arbitrarily many sub-ranges, the total set of records described by
* them stays the same.
*
* The other important property is how the source's range relates to positions of records in
* the source. In many sources each record can be identified by a unique starting position.
* In this case:
*
* - All records returned by a source {@code [A, B)} must have starting positions
* in this range.
*
- All but the last record should end within this range. The last record may or may not
* extend past the end of the range.
*
- Records should not overlap.
*
* Such sources should define "read {@code [A, B)}" as "read from the first record starting at or
* after A, up to but not including the first record starting at or after B".
*
* Some examples of such sources include reading lines or CSV from a text file, reading keys and
* values from a BigTable, etc.
*
*
The concept of split points allows to extend the definitions for dealing with sources
* where some records cannot be identified by a unique starting position.
*
*
In all cases, all records returned by a source {@code [A, B)} must start at or after
* {@code A}.
*
*
Split points
*
* Some sources may have records that are not directly addressable. For example, imagine a file
* format consisting of a sequence of compressed blocks. Each block can be assigned an offset, but
* records within the block cannot be directly addressed without decompressing the block. Let us
* refer to this hypothetical format as CBF (Compressed Blocks Format).
*
*
Many such formats can still satisfy the associativity property. For example, in CBF, reading
* {@code [A, B)} can mean "read all the records in all blocks whose starting offset is in
* {@code [A, B)}".
*
*
To support such complex formats, we introduce the notion of split points. We say that
* a record is a split point if there exists a position {@code A} such that the record is the first
* one to be returned when reading the range {@code [A, infinity)}. In CBF, the only split points
* would be the first records in each block.
*
*
Split points allow us to define the meaning of a record's position and a source's range
* in all cases:
*
* - For a record that is at a split point, its position is defined to be the largest
* {@code A} such that reading a source with the range {@code [A, infinity)} returns this record;
*
- Positions of other records are only required to be non-decreasing;
*
- Reading the source {@code [A, B)} must return records starting from the first split point
* at or after {@code A}, up to but not including the first split point at or after {@code B}.
* In particular, this means that the first record returned by a source MUST always be
* a split point.
*
- Positions of split points must be unique.
*
* As a result, for any decomposition of the full range of the source into position ranges, the
* total set of records will be the full set of records in the source, and each record
* will be read exactly once.
*
* Consumed positions
* As the source is being read, and records read from it are being passed to the downstream
* transforms in the pipeline, we say that positions in the source are being consumed.
* When a reader has read a record (or promised to a caller that a record will be returned),
* positions up to and including the record's start position are considered consumed.
*
* Dynamic splitting can happen only at unconsumed positions. If the reader just
* returned a record at offset 42 in a file, dynamic splitting can happen only at offset 43 or
* beyond, as otherwise that record could be read twice (by the current reader and by a reader
* of the task starting at 43).
*
*
Example
* The following example uses an {@link OffsetRangeTracker} to support dynamically splitting
* a source with integer positions (offsets).
* {@code
* class MyReader implements BoundedReader {
* private MySource currentSource;
* private final OffsetRangeTracker tracker = new OffsetRangeTracker();
* ...
* MyReader(MySource source) {
* this.currentSource = source;
* this.tracker = new MyRangeTracker<>(source.getStartOffset(), source.getEndOffset())
* }
* ...
* boolean start() {
* ... (general logic for locating the first record) ...
* if (!tracker.tryReturnRecordAt(true, recordStartOffset)) return false;
* ... (any logic that depends on the record being returned, e.g. counting returned records)
* return true;
* }
* boolean advance() {
* ... (general logic for locating the next record) ...
* if (!tracker.tryReturnRecordAt(isAtSplitPoint, recordStartOffset)) return false;
* ... (any logic that depends on the record being returned, e.g. counting returned records)
* return true;
* }
*
* double getFractionConsumed() {
* return tracker.getFractionConsumed();
* }
* }
* }
*
* Usage with different models of iteration
* When using this class to protect a
* {@link com.google.cloud.dataflow.sdk.io.BoundedSource.BoundedReader}, follow the pattern
* described above.
*
* When using this class to protect iteration in the {@code hasNext()/next()}
* model, consider the record consumed when {@code hasNext()} is about to return true, rather than
* when {@code next()} is called, because {@code hasNext()} returning true is promising the caller
* that {@code next()} will have an element to return - so {@link #trySplitAtPosition} must not
* split the range in a way that would make the record promised by {@code hasNext()} belong to
* a different range.
*
*
Also note that implementations of {@code hasNext()} need to ensure
* that they call {@link #tryReturnRecordAt} only once even if {@code hasNext()} is called
* repeatedly, due to the requirement on uniqueness of split point positions.
*
* @param Type of positions used by the source to define ranges and identify records.
*/
public interface RangeTracker {
/**
* Returns the starting position of the current range, inclusive.
*/
PositionT getStartPosition();
/**
* Returns the ending position of the current range, exclusive.
*/
PositionT getStopPosition();
/**
* Atomically determines whether a record at the given position can be returned and updates
* internal state. In particular:
*
*
* - If {@code isAtSplitPoint} is {@code true}, and {@code recordStart} is outside the current
* range, returns {@code false};
*
- Otherwise, updates the last-consumed position to {@code recordStart} and returns
* {@code true}.
*
*
* This method MUST be called on all split point records. It may be called on every record.
*/
boolean tryReturnRecordAt(boolean isAtSplitPoint, PositionT recordStart);
/**
* Atomically splits the current range [{@link #getStartPosition}, {@link #getStopPosition})
* into a "primary" part [{@link #getStartPosition}, {@code splitPosition})
* and a "residual" part [{@code splitPosition}, {@link #getStopPosition}), assuming the current
* last-consumed position is within [{@link #getStartPosition}, splitPosition)
* (i.e., {@code splitPosition} has not been consumed yet).
*
*
Updates the current range to be the primary and returns {@code true}. This means that
* all further calls on the current object will interpret their arguments relative to the
* primary range.
*
*
If the split position has already been consumed, or if no {@link #tryReturnRecordAt} call
* was made yet, returns {@code false}. The second condition is to prevent dynamic splitting
* during reader start-up.
*/
boolean trySplitAtPosition(PositionT splitPosition);
/**
* Returns the approximate fraction of positions in the source that have been consumed by
* successful {@link #tryReturnRecordAt} calls, or 0.0 if no such calls have happened.
*/
double getFractionConsumed();
}