com.google.cloud.dataflow.sdk.io.Sink Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of google-cloud-dataflow-java-sdk-all Show documentation
Google Cloud Dataflow Java SDK provides a simple, Java-based interface for processing virtually any size data using Google cloud resources. This artifact includes entire Dataflow Java SDK.
There is a newer version: 2.5.0
Show newest version
/*
 * Copyright (C) 2015 Google Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
 * in compliance with the License. You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software distributed under the License
 * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
 * or implied. See the License for the specific language governing permissions and limitations under
 * the License.
 */

package com.google.cloud.dataflow.sdk.io;

import com.google.cloud.dataflow.sdk.annotations.Experimental;
import com.google.cloud.dataflow.sdk.coders.Coder;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.values.PCollection;

import java.io.Serializable;

/**
 * A {@code Sink} represents a resource that can be written to using the {@link Write} transform.
 *
 * A parallel write to a {@code Sink} consists of three phases:
 * 

 * A sequential initialization phase (e.g., creating a temporary output directory, etc.)
 * 
A parallel write phase where workers write bundles of records
 * 
A sequential finalization phase (e.g., committing the writes, merging output files,
 * etc.)
 * 
 *
 * The {@link Write} transform can be used in a Dataflow pipeline to perform this write.
 * Specifically, a Write transform can be applied to a {@link PCollection} {@code p} by:
 *
 * 
{@code p.apply(Write.to(new MySink()));}
 *
 * 
Implementing a {@link Sink} and the corresponding write operations requires extending three
 * abstract classes:
 *
 * 

 * {@link Sink}: an immutable logical description of the location/resource to write to.
 * Depending on the type of sink, it may contain fields such as the path to an output directory
 * on a filesystem, a database table name, etc. Implementors of {@link Sink} must
 * implement two methods: {@link Sink#validate} and {@link Sink#createWriteOperation}.
 * {@link Sink#validate Validate} is called by the Write transform at pipeline creation, and should
 * validate that the Sink can be written to. The createWriteOperation method is also called at
 * pipeline creation, and should return a WriteOperation object that defines how to write to the
 * Sink. Note that implementations of Sink must be serializable and Sinks must be immutable.
 *
 * 
{@link WriteOperation}: The WriteOperation implements the initialization and
 * finalization phases of a write. Implementors of {@link WriteOperation} must implement
 * corresponding {@link WriteOperation#initialize} and {@link WriteOperation#finalize} methods. A
 * WriteOperation must also implement {@link WriteOperation#createWriter} that creates Writers,
 * {@link WriteOperation#getWriterResultCoder} that returns a {@link Coder} for the result of a
 * parallel write, and a {@link WriteOperation#getSink} that returns the Sink that the write
 * operation corresponds to. See below for more information about these methods and restrictions on
 * their implementation.
 *
 * 
{@link Writer}: A Writer writes a bundle of records. Writer defines four methods:
 * {@link Writer#open}, which is called once at the start of writing a bundle; {@link Writer#write},
 * which writes a single record from the bundle; {@link Writer#close}, which is called once at the
 * end of writing a bundle; and {@link Writer#getWriteOperation}, which returns the write operation
 * that the writer belongs to.
 * 
 *
 * WriteOperation
 * {@link WriteOperation#initialize} and {@link WriteOperation#finalize} are conceptually called
 * once: at the beginning and end of a Write transform. However, implementors must ensure that these
 * methods are idempotent, as they may be called multiple times on different machines in the case of
 * failure/retry or for redundancy.
 *
 * 
The finalize method of WriteOperation is passed an Iterable of a writer result type. This
 * writer result type should encode the result of a write and, in most cases, some encoding of the
 * unique bundle id.
 *
 * 
All implementations of {@link WriteOperation} must be serializable.
 *
 * 
WriteOperation may have mutable state. For instance, {@link WriteOperation#initialize} may
 * mutate the object state. These mutations will be visible in {@link WriteOperation#createWriter}
 * and {@link WriteOperation#finalize} because the object will be serialized after initialize and
 * deserialized before these calls. However, it is not serialized again after createWriter is
 * called, as createWriter will be called within workers to create Writers for the bundles that are
 * distributed to these workers. Therefore, newWriter should not mutate the WriteOperation state (as
 * these mutations will not be visible in finalize).
 *
 * 
Bundle Ids:
 * In order to ensure fault-tolerance, a bundle may be executed multiple times (e.g., in the
 * event of failure/retry or for redundancy). However, exactly one of these executions will have its
 * result passed to the WriteOperation's finalize method. Each call to {@link Writer#open} is passed
 * a unique bundle id when it is called by the Write transform, so even redundant or retried
 * bundles will have a unique way of identifying their output.
 *
 * 
The bundle id should be used to guarantee that a bundle's output is unique. This uniqueness
 * guarantee is important; if a bundle is to be output to a file, for example, the name of the file
 * must be unique to avoid conflicts with other Writers. The bundle id should be encoded in the
 * writer result returned by the Writer and subsequently used by the WriteOperation's finalize
 * method to identify the results of successful writes.
 *
 * 
For example, consider the scenario where a Writer writes files containing serialized records
 * and the WriteOperation's finalization step is to merge or rename these output files. In this
 * case, a Writer may use its unique id to name its output file (to avoid conflicts) and return the
 * name of the file it wrote as its writer result. The WriteOperation will then receive an Iterable
 * of output file names that it can then merge or rename using some bundle naming scheme.
 *
 * 
Writer Results:
 * {@link WriteOperation}s and {@link Writer}s must agree on a writer result type that will be
 * returned by a Writer after it writes a bundle. This type can be a client-defined object or an
 * existing type; {@link WriteOperation#getWriterResultCoder} should return a {@link Coder} for the
 * type.
 *
 * 
A note about thread safety: Any use of static members or methods in Writer should be thread
 * safe, as different instances of Writer objects may be created in different threads on the same
 * worker.
 *
 * @param  the type that will be written to the Sink.
 */
@Experimental(Experimental.Kind.SOURCE_SINK)
public abstract class Sink implements Serializable {
  /**
   * Ensures that the sink is valid and can be written to before the write operation begins. One
   * should use {@link com.google.common.base.Preconditions} to implement this method.
   */
  public abstract void validate(PipelineOptions options);

  /**
   * Returns an instance of a {@link WriteOperation} that can write to this Sink.
   */
  public abstract WriteOperation createWriteOperation(PipelineOptions options);

  /**
   * A {@link WriteOperation} defines the process of a parallel write of objects to a Sink.
   *
   * 
The {@code WriteOperation} defines how to perform initialization and finalization of a
   * parallel write to a sink as well as how to create a {@link Sink.Writer} object that can write
   * a bundle to the sink.
   *
   * 
Since operations in Dataflow may be run multiple times for redundancy or fault-tolerance,
   * the initialization and finalization defined by a WriteOperation must be idempotent.
   *
   * 
{@code WriteOperation}s may be mutable; a {@code WriteOperation} is serialized after the
   * call to {@code initialize} method and deserialized before calls to
   * {@code createWriter} and {@code finalized}. However, it is not
   * reserialized after {@code createWriter}, so {@code createWriter} should not mutate the
   * state of the {@code WriteOperation}.
   *
   * 
See {@link Sink} for more detailed documentation about the process of writing to a Sink.
   *
   * @param  The type of objects to write
   * @param  The result of a per-bundle write
   */
  public abstract static class WriteOperation implements Serializable {
    /**
     * Performs initialization before writing to the sink. Called before writing begins.
     */
    public abstract void initialize(PipelineOptions options) throws Exception;

    /**
     * Given an Iterable of results from bundle writes, performs finalization after writing and
     * closes the sink. Called after all bundle writes are complete.
     *
     * 
The results that are passed to finalize are those returned by bundles that completed
     * successfully. Although bundles may have been run multiple times (for fault-tolerance), only
     * one writer result will be passed to finalize for each bundle. An implementation of finalize
     * should perform clean up of any failed and successfully retried bundles.  Note that these
     * failed bundles will not have their writer result passed to finalize, so finalize should be
     * capable of locating any temporary/partial output written by failed bundles.
     *
     * 
A best practice is to make finalize atomic. If this is impossible given the semantics
     * of the sink, finalize should be idempotent, as it may be called multiple times in the case of
     * failure/retry or for redundancy.
     *
     * 
Note that the iteration order of the writer results is not guaranteed to be consistent if
     * finalize is called multiple times.
     *
     * @param writerResults an Iterable of results from successful bundle writes.
     */
    public abstract void finalize(Iterable writerResults, PipelineOptions options)
        throws Exception;

    /**
     * Creates a new {@link Sink.Writer} to write a bundle of the input to the sink.
     *
     * 
The bundle id that the writer will use to uniquely identify its output will be passed to
     * {@link Writer#open}.
     *
     * 
Must not mutate the state of the WriteOperation.
     */
    public abstract Writer createWriter(PipelineOptions options) throws Exception;

    /**
     * Returns the Sink that this write operation writes to.
     */
    public abstract Sink getSink();

    /**
     * Returns a coder for the writer result type.
     */
    public Coder getWriterResultCoder() {
      return null;
    }
  }

  /**
   * A Writer writes a bundle of elements from a PCollection to a sink. {@link Writer#open} is
   * called before writing begins and {@link Writer#close} is called after all elements in the
   * bundle have been written. {@link Writer#write} writes an element to the sink.
   *
   * 
Note that any access to static members or methods of a Writer must be thread-safe, as
   * multiple instances of a Writer may be instantiated in different threads on the same worker.
   *
   * 
See {@link Sink} for more detailed documentation about the process of writing to a Sink.
   *
   * @param  The type of object to write
   * @param  The writer results type (e.g., the bundle's output filename, as String)
   */
  public abstract static class Writer {
    /**
     * Performs bundle initialization. For example, creates a temporary file for writing or
     * initializes any state that will be used across calls to {@link Writer#write}.
     *
     * 
The unique id that is given to open should be used to ensure that the writer's output does
     * not interfere with the output of other Writers, as a bundle may be executed many times for
     * fault tolerance. See {@link Sink} for more information about bundle ids.
     */
    public abstract void open(String uId) throws Exception;

    /**
     * Called for each value in the bundle.
     */
    public abstract void write(T value) throws Exception;

    /**
     * Finishes writing the bundle. Closes any resources used for writing the bundle.
     *
     * Returns a writer result that will be used in the {@link Sink.WriteOperation}'s
     * finalization. The result should contain some way to identify the output of this bundle (using
     * the bundle id). {@link WriteOperation#finalize} will use the writer result to identify
     * successful writes. See {@link Sink} for more information about bundle ids.
     *
     * @return the writer result
     */
    public abstract WriteT close() throws Exception;

    /**
     * Returns the write operation this writer belongs to.
     */
    public abstract WriteOperation getWriteOperation();
  }
}