
com.google.cloud.dataflow.sdk.io.Sink Maven / Gradle / Ivy
Show all versions of google-cloud-dataflow-java-sdk-all Show documentation
/*
* Copyright (C) 2015 Google Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
* in compliance with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software distributed under the License
* is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
* or implied. See the License for the specific language governing permissions and limitations under
* the License.
*/
package com.google.cloud.dataflow.sdk.io;
import com.google.cloud.dataflow.sdk.annotations.Experimental;
import com.google.cloud.dataflow.sdk.coders.Coder;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.values.PCollection;
import java.io.Serializable;
/**
* A {@code Sink} represents a resource that can be written to using the {@link Write} transform.
*
* A parallel write to a {@code Sink} consists of three phases:
*
* - A sequential initialization phase (e.g., creating a temporary output directory, etc.)
*
- A parallel write phase where workers write bundles of records
*
- A sequential finalization phase (e.g., committing the writes, merging output files,
* etc.)
*
*
* The {@link Write} transform can be used in a Dataflow pipeline to perform this write.
* Specifically, a Write transform can be applied to a {@link PCollection} {@code p} by:
*
*
{@code p.apply(Write.to(new MySink()));}
*
*
Implementing a {@link Sink} and the corresponding write operations requires extending three
* abstract classes:
*
*
* - {@link Sink}: an immutable logical description of the location/resource to write to.
* Depending on the type of sink, it may contain fields such as the path to an output directory
* on a filesystem, a database table name, etc. Implementors of {@link Sink} must
* implement two methods: {@link Sink#validate} and {@link Sink#createWriteOperation}.
* {@link Sink#validate Validate} is called by the Write transform at pipeline creation, and should
* validate that the Sink can be written to. The createWriteOperation method is also called at
* pipeline creation, and should return a WriteOperation object that defines how to write to the
* Sink. Note that implementations of Sink must be serializable and Sinks must be immutable.
*
*
- {@link WriteOperation}: The WriteOperation implements the initialization and
* finalization phases of a write. Implementors of {@link WriteOperation} must implement
* corresponding {@link WriteOperation#initialize} and {@link WriteOperation#finalize} methods. A
* WriteOperation must also implement {@link WriteOperation#createWriter} that creates Writers,
* {@link WriteOperation#getWriterResultCoder} that returns a {@link Coder} for the result of a
* parallel write, and a {@link WriteOperation#getSink} that returns the Sink that the write
* operation corresponds to. See below for more information about these methods and restrictions on
* their implementation.
*
*
- {@link Writer}: A Writer writes a bundle of records. Writer defines four methods:
* {@link Writer#open}, which is called once at the start of writing a bundle; {@link Writer#write},
* which writes a single record from the bundle; {@link Writer#close}, which is called once at the
* end of writing a bundle; and {@link Writer#getWriteOperation}, which returns the write operation
* that the writer belongs to.
*
*
* WriteOperation
* {@link WriteOperation#initialize} and {@link WriteOperation#finalize} are conceptually called
* once: at the beginning and end of a Write transform. However, implementors must ensure that these
* methods are idempotent, as they may be called multiple times on different machines in the case of
* failure/retry or for redundancy.
*
*
The finalize method of WriteOperation is passed an Iterable of a writer result type. This
* writer result type should encode the result of a write and, in most cases, some encoding of the
* unique bundle id.
*
*
All implementations of {@link WriteOperation} must be serializable.
*
*
WriteOperation may have mutable state. For instance, {@link WriteOperation#initialize} may
* mutate the object state. These mutations will be visible in {@link WriteOperation#createWriter}
* and {@link WriteOperation#finalize} because the object will be serialized after initialize and
* deserialized before these calls. However, it is not serialized again after createWriter is
* called, as createWriter will be called within workers to create Writers for the bundles that are
* distributed to these workers. Therefore, newWriter should not mutate the WriteOperation state (as
* these mutations will not be visible in finalize).
*
*
Bundle Ids:
* In order to ensure fault-tolerance, a bundle may be executed multiple times (e.g., in the
* event of failure/retry or for redundancy). However, exactly one of these executions will have its
* result passed to the WriteOperation's finalize method. Each call to {@link Writer#open} is passed
* a unique bundle id when it is called by the Write transform, so even redundant or retried
* bundles will have a unique way of identifying their output.
*
*
The bundle id should be used to guarantee that a bundle's output is unique. This uniqueness
* guarantee is important; if a bundle is to be output to a file, for example, the name of the file
* must be unique to avoid conflicts with other Writers. The bundle id should be encoded in the
* writer result returned by the Writer and subsequently used by the WriteOperation's finalize
* method to identify the results of successful writes.
*
*
For example, consider the scenario where a Writer writes files containing serialized records
* and the WriteOperation's finalization step is to merge or rename these output files. In this
* case, a Writer may use its unique id to name its output file (to avoid conflicts) and return the
* name of the file it wrote as its writer result. The WriteOperation will then receive an Iterable
* of output file names that it can then merge or rename using some bundle naming scheme.
*
*
Writer Results:
* {@link WriteOperation}s and {@link Writer}s must agree on a writer result type that will be
* returned by a Writer after it writes a bundle. This type can be a client-defined object or an
* existing type; {@link WriteOperation#getWriterResultCoder} should return a {@link Coder} for the
* type.
*
*
A note about thread safety: Any use of static members or methods in Writer should be thread
* safe, as different instances of Writer objects may be created in different threads on the same
* worker.
*
* @param the type that will be written to the Sink.
*/
@Experimental(Experimental.Kind.SOURCE_SINK)
public abstract class Sink implements Serializable {
/**
* Ensures that the sink is valid and can be written to before the write operation begins. One
* should use {@link com.google.common.base.Preconditions} to implement this method.
*/
public abstract void validate(PipelineOptions options);
/**
* Returns an instance of a {@link WriteOperation} that can write to this Sink.
*/
public abstract WriteOperation createWriteOperation(PipelineOptions options);
/**
* A {@link WriteOperation} defines the process of a parallel write of objects to a Sink.
*
* The {@code WriteOperation} defines how to perform initialization and finalization of a
* parallel write to a sink as well as how to create a {@link Sink.Writer} object that can write
* a bundle to the sink.
*
*
Since operations in Dataflow may be run multiple times for redundancy or fault-tolerance,
* the initialization and finalization defined by a WriteOperation must be idempotent.
*
*
{@code WriteOperation}s may be mutable; a {@code WriteOperation} is serialized after the
* call to {@code initialize} method and deserialized before calls to
* {@code createWriter} and {@code finalized}. However, it is not
* reserialized after {@code createWriter}, so {@code createWriter} should not mutate the
* state of the {@code WriteOperation}.
*
*
See {@link Sink} for more detailed documentation about the process of writing to a Sink.
*
* @param The type of objects to write
* @param The result of a per-bundle write
*/
public abstract static class WriteOperation implements Serializable {
/**
* Performs initialization before writing to the sink. Called before writing begins.
*/
public abstract void initialize(PipelineOptions options) throws Exception;
/**
* Given an Iterable of results from bundle writes, performs finalization after writing and
* closes the sink. Called after all bundle writes are complete.
*
* The results that are passed to finalize are those returned by bundles that completed
* successfully. Although bundles may have been run multiple times (for fault-tolerance), only
* one writer result will be passed to finalize for each bundle. An implementation of finalize
* should perform clean up of any failed and successfully retried bundles. Note that these
* failed bundles will not have their writer result passed to finalize, so finalize should be
* capable of locating any temporary/partial output written by failed bundles.
*
*
A best practice is to make finalize atomic. If this is impossible given the semantics
* of the sink, finalize should be idempotent, as it may be called multiple times in the case of
* failure/retry or for redundancy.
*
*
Note that the iteration order of the writer results is not guaranteed to be consistent if
* finalize is called multiple times.
*
* @param writerResults an Iterable of results from successful bundle writes.
*/
public abstract void finalize(Iterable writerResults, PipelineOptions options)
throws Exception;
/**
* Creates a new {@link Sink.Writer} to write a bundle of the input to the sink.
*
* The bundle id that the writer will use to uniquely identify its output will be passed to
* {@link Writer#open}.
*
*
Must not mutate the state of the WriteOperation.
*/
public abstract Writer createWriter(PipelineOptions options) throws Exception;
/**
* Returns the Sink that this write operation writes to.
*/
public abstract Sink getSink();
/**
* Returns a coder for the writer result type.
*/
public Coder getWriterResultCoder() {
return null;
}
}
/**
* A Writer writes a bundle of elements from a PCollection to a sink. {@link Writer#open} is
* called before writing begins and {@link Writer#close} is called after all elements in the
* bundle have been written. {@link Writer#write} writes an element to the sink.
*
* Note that any access to static members or methods of a Writer must be thread-safe, as
* multiple instances of a Writer may be instantiated in different threads on the same worker.
*
*
See {@link Sink} for more detailed documentation about the process of writing to a Sink.
*
* @param The type of object to write
* @param The writer results type (e.g., the bundle's output filename, as String)
*/
public abstract static class Writer {
/**
* Performs bundle initialization. For example, creates a temporary file for writing or
* initializes any state that will be used across calls to {@link Writer#write}.
*
* The unique id that is given to open should be used to ensure that the writer's output does
* not interfere with the output of other Writers, as a bundle may be executed many times for
* fault tolerance. See {@link Sink} for more information about bundle ids.
*/
public abstract void open(String uId) throws Exception;
/**
* Called for each value in the bundle.
*/
public abstract void write(T value) throws Exception;
/**
* Finishes writing the bundle. Closes any resources used for writing the bundle.
*
*
Returns a writer result that will be used in the {@link Sink.WriteOperation}'s
* finalization. The result should contain some way to identify the output of this bundle (using
* the bundle id). {@link WriteOperation#finalize} will use the writer result to identify
* successful writes. See {@link Sink} for more information about bundle ids.
*
* @return the writer result
*/
public abstract WriteT close() throws Exception;
/**
* Returns the write operation this writer belongs to.
*/
public abstract WriteOperation getWriteOperation();
}
}