com.hazelcast.jet.pipeline.package-info Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of hazelcast-jdbc Show documentation
Hazelcast JDBC Driver
The newest version!
/*
 * Copyright (c) 2008-2024, Hazelcast, Inc. All Rights Reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/**
 * The Pipeline API is Jet's high-level API to build and execute
 * distributed computation jobs. It models the computation using an analogy
 * with a system of interconnected water pipes. The data flows from the
 * pipeline's sources to its sinks. Pipes can bifurcate and merge, but
 * there can't be any closed loops (cycles).
 * 
 * The basic element is a pipeline stage which can be attached to
 * one or more other stages, both in the upstream and the downstream
 * direction. A pipeline accepts the data coming from its upstream stages,
 * transforms it, and directs the resulting data to its downstream stages.
 *
 * 
Kinds of transformation performed by pipeline stages
 *
 * Basic
 *
 * Basic transformations have a single upstream pipeline and statelessly
 * transform individual items in it. Examples are {@code map}, {@code
 * filter}, and {@code flatMap}.
 *
 * Grouping and aggregation
 *
 * The {@code aggregate*()} transformations perform an aggregate operation
 * on a set of items. You can call {@code stage.groupingKey()} to group the
 * items by a key and then Jet will aggregate each group separately. For
 * stream stages you must specify a {@code stage.window()} which will
 * transform the infinite stream into a series of finite windows. If you
 * specify more than one input stage for the aggregation (using {@code
 * stage.aggregate2()}, {@code stage.aggregate3()} or {@code
 * stage.aggregateBuilder()}, the data from all streams will be combined
 * into the aggregation result. The {@link
 * com.hazelcast.jet.aggregate.AggregateOperation AggregateOperation} you
 * supply must define a separate {@link
 * com.hazelcast.jet.aggregate.AggregateOperation#accumulateFn accumulate}
 * primitive for each contributing stream. Refer to its Javadoc for further
 * details.
 *
 * Hash-Join
 *
 * The hash-join is a joining transform designed for the use case of data
 * enrichment with static data. It is an asymmetric join that joins the
 * enriching stage(s) to the primary stage. The
 * enriching stages must be batch stages — they must
 * represent finite datasets. The primary stage may be either a batch or a
 * stream stage.
 * 
 * You must provide a separate pair of functions for each of the enriching
 * stages: one to extract the key from the primary item and one to extract
 * it from the enriching item. For example, you can join a {@code Trade}
 * with a {@code Broker} on {@code trade.getBrokerId() == broker.getId()}
 * and a {@code Product} on {@code trade.getProductId() == product.getId()},
 * and all this can happen in a single hash-join transform.
 * 

 * The hash-join transform is optimized for throughput — each cluster
 * member materializes a local copy of all the enriching data, stored in
 * hashtables (hence the name). It consumes the enriching streams in full
 * before ingesting any data from the primary stream.
 * 

 * The output of {@code hashJoin} is just like an SQL left outer join:
 * for each primary item there are N output items,  one for each matching
 * item in the enriching set. If an enriching set doesn't have a matching
 * item, the output will have a {@code null} instead of the enriching item.
 * 

 * If you need SQL inner join, then you can use the specialised
 * {@code innerHashJoin} function, in which for each primary item with
 * at least one match, there are N output items, one for each matching
 * item in the enriching set. If an enriching set doesn't have a matching
 * item, there will be no records with the given primary item. In this case
 * the output function's arguments are always non-null.
 *
 * 

 * The join also allows duplicate keys on both enriching and primary inputs:
 * the output is a cartesian product of all the matching entries.

 * 

 * Example:
 * +------------------------+-----------------+---------------------------+
 * |     Primary input      | Enriching input |          Output           |
 * +------------------------+-----------------+---------------------------+
 * | Trade{ticker=AA,amt=1} | Ticker{id=AA}   | Tuple2{                   |
 * | Trade{ticker=BB,amt=2} | Ticker{id=BB}   |   Trade{ticker=AA,amt=1}, |
 * | Trade{ticker=AA,amt=3} |                 |   Ticker{id=AA}           |
 * |                        |                 | }                         |
 * |                        |                 | Tuple2{                   |
 * |                        |                 |   Trade{ticker=BB,amt=2}, |
 * |                        |                 |   Ticker{id=BB}           |
 * |                        |                 | }                         |
 * |                        |                 | Tuple2{                   |
 * |                        |                 |   Trade{ticker=AA,amt=3}, |
 * |                        |                 |   Ticker{id=AA}           |
 * |                        |                 | }                         |
 * +------------------------+-----------------+---------------------------+
 * 
 *
 * @since Jet 3.0
 */
package com.hazelcast.jet.pipeline;