All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.google.cloud.dataflow.sdk.io.range.RangeTracker Maven / Gradle / Ivy

Go to download

Google Cloud Dataflow Java SDK provides a simple, Java-based interface for processing virtually any size data using Google cloud resources. This artifact includes entire Dataflow Java SDK.

There is a newer version: 2.5.0
Show newest version
/*******************************************************************************
 * Copyright (C) 2015 Google Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 ******************************************************************************/

package com.google.cloud.dataflow.sdk.io.range;

/**
 * A {@code RangeTracker} is a thread-safe helper object for implementing dynamic work rebalancing
 * in position-based {@link com.google.cloud.dataflow.sdk.io.BoundedSource.BoundedReader}
 * subclasses.
 *
 * 

Usage of the RangeTracker class hierarchy

* The abstract {@code RangeTracker} interface should not be used per se - all users should use its * subclasses directly. We declare it here because all subclasses have roughly the same interface * and the same properties, to centralize the documentation. Currently we provide one * implementation - {@link OffsetRangeTracker}. * *

Position-based sources

* A position-based source is one where the source can be described by a range of positions of * an ordered type and the records returned by the reader can be described by positions of the * same type. * *

In case a record occupies a range of positions in the source, the most important thing about * the record is the position where it starts. * *

Defining the semantics of positions for a source is entirely up to the source class, however * the chosen definitions have to obey certain properties in order to make it possible to correctly * split the source into parts, including dynamic splitting. Two main aspects need to be defined: *

    *
  • How to assign starting positions to records. *
  • Which records should be read by a source with a range {@code [A, B)}. *
* Moreover, reading a range must be efficient, i.e., the performance of reading a range * should not significantly depend on the location of the range. For example, reading the range * {@code [A, B)} should not require reading all data before {@code A}. * *

The sections below explain exactly what properties these definitions must satisfy, and * how to use a {@code RangeTracker} with a properly defined source. * *

Properties of position-based sources

* The main requirement for position-based sources is associativity: reading records from * {@code [A, B)} and records from {@code [B, C)} should give the same records as reading from * {@code [A, C)}, where {@code A <= B <= C}. This property ensures that no matter how a range * of positions is split into arbitrarily many sub-ranges, the total set of records described by * them stays the same. * *

The other important property is how the source's range relates to positions of records in * the source. In many sources each record can be identified by a unique starting position. * In this case: *

    *
  • All records returned by a source {@code [A, B)} must have starting positions * in this range. *
  • All but the last record should end within this range. The last record may or may not * extend past the end of the range. *
  • Records should not overlap. *
* Such sources should define "read {@code [A, B)}" as "read from the first record starting at or * after A, up to but not including the first record starting at or after B". * *

Some examples of such sources include reading lines or CSV from a text file, reading keys and * values from a BigTable, etc. * *

The concept of split points allows to extend the definitions for dealing with sources * where some records cannot be identified by a unique starting position. * *

In all cases, all records returned by a source {@code [A, B)} must start at or after * {@code A}. * *

Split points

* *

Some sources may have records that are not directly addressable. For example, imagine a file * format consisting of a sequence of compressed blocks. Each block can be assigned an offset, but * records within the block cannot be directly addressed without decompressing the block. Let us * refer to this hypothetical format as CBF (Compressed Blocks Format). * *

Many such formats can still satisfy the associativity property. For example, in CBF, reading * {@code [A, B)} can mean "read all the records in all blocks whose starting offset is in * {@code [A, B)}". * *

To support such complex formats, we introduce the notion of split points. We say that * a record is a split point if there exists a position {@code A} such that the record is the first * one to be returned when reading the range {@code [A, infinity)}. In CBF, the only split points * would be the first records in each block. * *

Split points allow us to define the meaning of a record's position and a source's range * in all cases: *

    *
  • For a record that is at a split point, its position is defined to be the largest * {@code A} such that reading a source with the range {@code [A, infinity)} returns this record; *
  • Positions of other records are only required to be non-decreasing; *
  • Reading the source {@code [A, B)} must return records starting from the first split point * at or after {@code A}, up to but not including the first split point at or after {@code B}. * In particular, this means that the first record returned by a source MUST always be * a split point. *
  • Positions of split points must be unique. *
* As a result, for any decomposition of the full range of the source into position ranges, the * total set of records will be the full set of records in the source, and each record * will be read exactly once. * *

Consumed positions

* As the source is being read, and records read from it are being passed to the downstream * transforms in the pipeline, we say that positions in the source are being consumed. * When a reader has read a record (or promised to a caller that a record will be returned), * positions up to and including the record's start position are considered consumed. * *

Dynamic splitting can happen only at unconsumed positions. If the reader just * returned a record at offset 42 in a file, dynamic splitting can happen only at offset 43 or * beyond, as otherwise that record could be read twice (by the current reader and by a reader * of the task starting at 43). * *

Example

* The following example uses an {@link OffsetRangeTracker} to support dynamically splitting * a source with integer positions (offsets). *
 {@code
 *   class MyReader implements BoundedReader {
 *     private MySource currentSource;
 *     private final OffsetRangeTracker tracker = new OffsetRangeTracker();
 *     ...
 *     MyReader(MySource source) {
 *       this.currentSource = source;
 *       this.tracker = new MyRangeTracker<>(source.getStartOffset(), source.getEndOffset())
 *     }
 *     ...
 *     boolean start() {
 *       ... (general logic for locating the first record) ...
 *       if (!tracker.tryReturnRecordAt(true, recordStartOffset)) return false;
 *       ... (any logic that depends on the record being returned, e.g. counting returned records)
 *       return true;
 *     }
 *     boolean advance() {
 *       ... (general logic for locating the next record) ...
 *       if (!tracker.tryReturnRecordAt(isAtSplitPoint, recordStartOffset)) return false;
 *       ... (any logic that depends on the record being returned, e.g. counting returned records)
 *       return true;
 *     }
 *
 *     double getFractionConsumed() {
 *       return tracker.getFractionConsumed();
 *     }
 *   }
 * } 
* *

Usage with different models of iteration

* When using this class to protect a * {@link com.google.cloud.dataflow.sdk.io.BoundedSource.BoundedReader}, follow the pattern * described above. * *

When using this class to protect iteration in the {@code hasNext()/next()} * model, consider the record consumed when {@code hasNext()} is about to return true, rather than * when {@code next()} is called, because {@code hasNext()} returning true is promising the caller * that {@code next()} will have an element to return - so {@link #trySplitAtPosition} must not * split the range in a way that would make the record promised by {@code hasNext()} belong to * a different range. * *

Also note that implementations of {@code hasNext()} need to ensure * that they call {@link #tryReturnRecordAt} only once even if {@code hasNext()} is called * repeatedly, due to the requirement on uniqueness of split point positions. * * @param Type of positions used by the source to define ranges and identify records. */ public interface RangeTracker { /** * Returns the starting position of the current range, inclusive. */ PositionT getStartPosition(); /** * Returns the ending position of the current range, exclusive. */ PositionT getStopPosition(); /** * Atomically determines whether a record at the given position can be returned and updates * internal state. In particular: * *

    *
  • If {@code isAtSplitPoint} is {@code true}, and {@code recordStart} is outside the current * range, returns {@code false}; *
  • Otherwise, updates the last-consumed position to {@code recordStart} and returns * {@code true}. *
* *

This method MUST be called on all split point records. It may be called on every record. */ boolean tryReturnRecordAt(boolean isAtSplitPoint, PositionT recordStart); /** * Atomically splits the current range [{@link #getStartPosition}, {@link #getStopPosition}) * into a "primary" part [{@link #getStartPosition}, {@code splitPosition}) * and a "residual" part [{@code splitPosition}, {@link #getStopPosition}), assuming the current * last-consumed position is within [{@link #getStartPosition}, splitPosition) * (i.e., {@code splitPosition} has not been consumed yet). * *

Updates the current range to be the primary and returns {@code true}. This means that * all further calls on the current object will interpret their arguments relative to the * primary range. * *

If the split position has already been consumed, or if no {@link #tryReturnRecordAt} call * was made yet, returns {@code false}. The second condition is to prevent dynamic splitting * during reader start-up. */ boolean trySplitAtPosition(PositionT splitPosition); /** * Returns the approximate fraction of positions in the source that have been consumed by * successful {@link #tryReturnRecordAt} calls, or 0.0 if no such calls have happened. */ double getFractionConsumed(); }





© 2015 - 2024 Weber Informatics LLC | Privacy Policy