All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.spark.sql.streaming.GroupState.scala Maven / Gradle / Ivy

The newest version!
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.streaming

import org.apache.spark.annotation.{Evolving, Experimental}
import org.apache.spark.sql.catalyst.plans.logical.LogicalGroupState

/**
 * :: Experimental ::
 *
 * Wrapper class for interacting with per-group state data in `mapGroupsWithState` and
 * `flatMapGroupsWithState` operations on `KeyValueGroupedDataset`.
 *
 * Detail description on `[map/flatMap]GroupsWithState` operation
 * -------------------------------------------------------------- Both, `mapGroupsWithState` and
 * `flatMapGroupsWithState` in `KeyValueGroupedDataset` will invoke the user-given function on
 * each group (defined by the grouping function in `Dataset.groupByKey()`) while maintaining a
 * user-defined per-group state between invocations. For a static batch Dataset, the function will
 * be invoked once per group. For a streaming Dataset, the function will be invoked for each group
 * repeatedly in every trigger. That is, in every batch of the `StreamingQuery`, the function will
 * be invoked once for each group that has data in the trigger. Furthermore, if timeout is set,
 * then the function will be invoked on timed-out groups (more detail below).
 *
 * The function is invoked with the following parameters.
 *   - The key of the group.
 *   - An iterator containing all the values for this group.
 *   - A user-defined state object set by previous invocations of the given function.
 *
 * In case of a batch Dataset, there is only one invocation and the state object will be empty as
 * there is no prior state. Essentially, for batch Datasets, `[map/flatMap]GroupsWithState` is
 * equivalent to `[map/flatMap]Groups` and any updates to the state and/or timeouts have no
 * effect.
 *
 * The major difference between `mapGroupsWithState` and `flatMapGroupsWithState` is that the
 * former allows the function to return one and only one record, whereas the latter allows the
 * function to return any number of records (including no records). Furthermore, the
 * `flatMapGroupsWithState` is associated with an operation output mode, which can be either
 * `Append` or `Update`. Semantically, this defines whether the output records of one trigger is
 * effectively replacing the previously output records (from previous triggers) or is appending to
 * the list of previously output records. Essentially, this defines how the Result Table (refer to
 * the semantics in the programming guide) is updated, and allows us to reason about the semantics
 * of later operations.
 *
 * Important points to note about the function (both mapGroupsWithState and
 * flatMapGroupsWithState).
 *   - In a trigger, the function will be called only the groups present in the batch. So do not
 *     assume that the function will be called in every trigger for every group that has state.
 *   - There is no guaranteed ordering of values in the iterator in the function, neither with
 *     batch, nor with streaming Datasets.
 *   - All the data will be shuffled before applying the function.
 *   - If timeout is set, then the function will also be called with no values. See more details
 *     on `GroupStateTimeout` below.
 *
 * Important points to note about using `GroupState`.
 *   - The value of the state cannot be null. So updating state with null will throw
 *     `IllegalArgumentException`.
 *   - Operations on `GroupState` are not thread-safe. This is to avoid memory barriers.
 *   - If `remove()` is called, then `exists()` will return `false`, `get()` will throw
 *     `NoSuchElementException` and `getOption()` will return `None`
 *   - After that, if `update(newState)` is called, then `exists()` will again return `true`,
 *     `get()` and `getOption()`will return the updated value.
 *
 * Important points to note about using `GroupStateTimeout`.
 *   - The timeout type is a global param across all the groups (set as `timeout` param in
 *     `[map|flatMap]GroupsWithState`, but the exact timeout duration/timestamp is configurable
 *     per group by calling `setTimeout...()` in `GroupState`.
 *   - Timeouts can be either based on processing time (i.e.
 *     `GroupStateTimeout.ProcessingTimeTimeout`) or event time (i.e.
 *     `GroupStateTimeout.EventTimeTimeout`).
 *   - With `ProcessingTimeTimeout`, the timeout duration can be set by calling
 *     `GroupState.setTimeoutDuration`. The timeout will occur when the clock has advanced by the
 *     set duration. Guarantees provided by this timeout with a duration of D ms are as follows:
 *     - Timeout will never occur before the clock time has advanced by D ms
 *     - Timeout will occur eventually when there is a trigger in the query (i.e. after D ms). So
 *       there is no strict upper bound on when the timeout would occur. For example, the trigger
 *       interval of the query will affect when the timeout actually occurs. If there is no data
 *       in the stream (for any group) for a while, then there will not be any trigger and timeout
 *       function call will not occur until there is data.
 *     - Since the processing time timeout is based on the clock time, it is affected by the
 *       variations in the system clock (i.e. time zone changes, clock skew, etc.).
 *   - With `EventTimeTimeout`, the user also has to specify the event time watermark in the query
 *     using `Dataset.withWatermark()`. With this setting, data that is older than the watermark
 *     is filtered out. The timeout can be set for a group by setting a timeout timestamp
 *     using`GroupState.setTimeoutTimestamp()`, and the timeout would occur when the watermark
 *     advances beyond the set timestamp. You can control the timeout delay by two parameters -
 *     (i) watermark delay and an additional duration beyond the timestamp in the event (which is
 *     guaranteed to be newer than watermark due to the filtering). Guarantees provided by this
 *     timeout are as follows:
 *     - Timeout will never occur before the watermark has exceeded the set timeout.
 *     - Similar to processing time timeouts, there is no strict upper bound on the delay when the
 *       timeout actually occurs. The watermark can advance only when there is data in the stream
 *       and the event time of the data has actually advanced.
 *   - When the timeout occurs for a group, the function is called for that group with no values,
 *     and `GroupState.hasTimedOut()` set to true.
 *   - The timeout is reset every time the function is called on a group, that is, when the group
 *     has new data, or the group has timed out. So the user has to set the timeout duration every
 *     time the function is called, otherwise, there will not be any timeout set.
 *
 * `[map/flatMap]GroupsWithState` can take a user defined initial state as an additional argument.
 * This state will be applied when the first batch of the streaming query is processed. If there
 * are no matching rows in the data for the keys present in the initial state, the state is still
 * applied and the function will be invoked with the values being an empty iterator.
 *
 * Scala example of using GroupState in `mapGroupsWithState`:
 * {{{
 * // A mapping function that maintains an integer state for string keys and returns a string.
 * // Additionally, it sets a timeout to remove the state if it has not received data for an hour.
 * def mappingFunction(key: String, value: Iterator[Int], state: GroupState[Int]): String = {
 *
 *   if (state.hasTimedOut) {                // If called when timing out, remove the state
 *     state.remove()
 *
 *   } else if (state.exists) {              // If state exists, use it for processing
 *     val existingState = state.get         // Get the existing state
 *     val shouldRemove = ...                // Decide whether to remove the state
 *     if (shouldRemove) {
 *       state.remove()                      // Remove the state
 *
 *     } else {
 *       val newState = ...
 *       state.update(newState)              // Set the new state
 *       state.setTimeoutDuration("1 hour")  // Set the timeout
 *     }
 *
 *   } else {
 *     val initialState = ...
 *     state.update(initialState)            // Set the initial state
 *     state.setTimeoutDuration("1 hour")    // Set the timeout
 *   }
 *   ...
 *   // return something
 * }
 *
 * dataset
 *   .groupByKey(...)
 *   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(mappingFunction)
 * }}}
 *
 * Java example of using `GroupState`:
 * {{{
 * // A mapping function that maintains an integer state for string keys and returns a string.
 * // Additionally, it sets a timeout to remove the state if it has not received data for an hour.
 * MapGroupsWithStateFunction mappingFunction =
 *    new MapGroupsWithStateFunction() {
 *
 *      @Override
 *      public String call(String key, Iterator value, GroupState state) {
 *        if (state.hasTimedOut()) {            // If called when timing out, remove the state
 *          state.remove();
 *
 *        } else if (state.exists()) {            // If state exists, use it for processing
 *          int existingState = state.get();      // Get the existing state
 *          boolean shouldRemove = ...;           // Decide whether to remove the state
 *          if (shouldRemove) {
 *            state.remove();                     // Remove the state
 *
 *          } else {
 *            int newState = ...;
 *            state.update(newState);             // Set the new state
 *            state.setTimeoutDuration("1 hour"); // Set the timeout
 *          }
 *
 *        } else {
 *          int initialState = ...;               // Set the initial state
 *          state.update(initialState);
 *          state.setTimeoutDuration("1 hour");   // Set the timeout
 *        }
 *        ...
 *         // return something
 *      }
 *    };
 *
 * dataset
 *     .groupByKey(...)
 *     .mapGroupsWithState(
 *         mappingFunction, Encoders.INT, Encoders.STRING, GroupStateTimeout.ProcessingTimeTimeout);
 * }}}
 *
 * @tparam S
 *   User-defined type of the state to be stored for each group. Must be encodable into Spark SQL
 *   types (see `Encoder` for more details).
 * @since 2.2.0
 */
@Experimental
@Evolving
trait GroupState[S] extends LogicalGroupState[S] {

  /** Whether state exists or not. */
  def exists: Boolean

  /** Get the state value if it exists, or throw NoSuchElementException. */
  @throws[NoSuchElementException]("when state does not exist")
  def get: S

  /** Get the state value as a scala Option. */
  def getOption: Option[S]

  /** Update the value of the state. */
  def update(newState: S): Unit

  /** Remove this state. */
  def remove(): Unit

  /**
   * Whether the function has been called because the key has timed out.
   * @note
   *   This can return true only when timeouts are enabled in `[map/flatMap]GroupsWithState`.
   */
  def hasTimedOut: Boolean

  /**
   * Set the timeout duration in ms for this key.
   *
   * @note
   *   [[GroupStateTimeout Processing time timeout]] must be enabled in
   *   `[map/flatMap]GroupsWithState` for calling this method.
   * @note
   *   This method has no effect when used in a batch query.
   */
  @throws[IllegalArgumentException]("if 'durationMs' is not positive")
  @throws[UnsupportedOperationException](
    "if processing time timeout has not been enabled in [map|flatMap]GroupsWithState")
  def setTimeoutDuration(durationMs: Long): Unit

  /**
   * Set the timeout duration for this key as a string. For example, "1 hour", "2 days", etc.
   *
   * @note
   *   [[GroupStateTimeout Processing time timeout]] must be enabled in
   *   `[map/flatMap]GroupsWithState` for calling this method.
   * @note
   *   This method has no effect when used in a batch query.
   */
  @throws[IllegalArgumentException]("if 'duration' is not a valid duration")
  @throws[UnsupportedOperationException](
    "if processing time timeout has not been enabled in [map|flatMap]GroupsWithState")
  def setTimeoutDuration(duration: String): Unit

  /**
   * Set the timeout timestamp for this key as milliseconds in epoch time. This timestamp cannot
   * be older than the current watermark.
   *
   * @note
   *   [[GroupStateTimeout Event time timeout]] must be enabled in `[map/flatMap]GroupsWithState`
   *   for calling this method.
   * @note
   *   This method has no effect when used in a batch query.
   */
  @throws[IllegalArgumentException](
    "if 'timestampMs' is not positive or less than the current watermark in a streaming query")
  @throws[UnsupportedOperationException](
    "if event time timeout has not been enabled in [map|flatMap]GroupsWithState")
  def setTimeoutTimestamp(timestampMs: Long): Unit

  /**
   * Set the timeout timestamp for this key as milliseconds in epoch time and an additional
   * duration as a string (e.g. "1 hour", "2 days", etc.). The final timestamp (including the
   * additional duration) cannot be older than the current watermark.
   *
   * @note
   *   [[GroupStateTimeout Event time timeout]] must be enabled in `[map/flatMap]GroupsWithState`
   *   for calling this method.
   * @note
   *   This method has no side effect when used in a batch query.
   */
  @throws[IllegalArgumentException](
    "if 'additionalDuration' is invalid or the final timeout timestamp is less than " +
      "the current watermark in a streaming query")
  @throws[UnsupportedOperationException](
    "if event time timeout has not been enabled in [map|flatMap]GroupsWithState")
  def setTimeoutTimestamp(timestampMs: Long, additionalDuration: String): Unit

  /**
   * Set the timeout timestamp for this key as a java.sql.Date. This timestamp cannot be older
   * than the current watermark.
   *
   * @note
   *   [[GroupStateTimeout Event time timeout]] must be enabled in `[map/flatMap]GroupsWithState`
   *   for calling this method.
   * @note
   *   This method has no side effect when used in a batch query.
   */
  @throws[UnsupportedOperationException](
    "if event time timeout has not been enabled in [map|flatMap]GroupsWithState")
  def setTimeoutTimestamp(timestamp: java.sql.Date): Unit

  /**
   * Set the timeout timestamp for this key as a java.sql.Date and an additional duration as a
   * string (e.g. "1 hour", "2 days", etc.). The final timestamp (including the additional
   * duration) cannot be older than the current watermark.
   *
   * @note
   *   [[GroupStateTimeout Event time timeout]] must be enabled in `[map/flatMap]GroupsWithState`
   *   for calling this method.
   * @note
   *   This method has no side effect when used in a batch query.
   */
  @throws[IllegalArgumentException]("if 'additionalDuration' is invalid")
  @throws[UnsupportedOperationException](
    "if event time timeout has not been enabled in [map|flatMap]GroupsWithState")
  def setTimeoutTimestamp(timestamp: java.sql.Date, additionalDuration: String): Unit

  /**
   * Get the current event time watermark as milliseconds in epoch time.
   *
   * @note
   *   In a streaming query, this can be called only when watermark is set before calling
   *   `[map/flatMap]GroupsWithState`. In a batch query, this method always returns -1.
   * @note
   *   The watermark gets propagated in the end of each query. As a result, this method will
   *   return 0 (1970-01-01T00:00:00) for the first micro-batch. If you use this value as a part
   *   of the timestamp set in the `setTimeoutTimestamp`, it may lead to the state expiring
   *   immediately in the next micro-batch, once the watermark gets the real value from your data.
   */
  @throws[UnsupportedOperationException](
    "if watermark has not been set before in [map|flatMap]GroupsWithState")
  def getCurrentWatermarkMs(): Long

  /**
   * Get the current processing time as milliseconds in epoch time.
   * @note
   *   In a streaming query, this will return a constant value throughout the duration of a
   *   trigger, even if the trigger is re-executed.
   */
  def getCurrentProcessingTimeMs(): Long
}




© 2015 - 2024 Weber Informatics LLC | Privacy Policy