org.infinispan.CacheStream Maven / Gradle / Ivy

Go to download
package org.infinispan;

import java.util.Comparator;
import java.util.Iterator;
import java.util.Set;
import java.util.Spliterator;
import java.util.concurrent.TimeUnit;
import java.util.function.Consumer;
import java.util.function.Function;
import java.util.function.Predicate;
import java.util.stream.Collector;
import java.util.stream.Stream;

/**
 * A {@link Stream} that has additional operations to monitor or control behavior when used from a {@link Cache}.  Note that
 * you may only use these additional methods on the CacheStream before any intermediate operations are performed as
 * a {@link Stream} is returned from those methods.
 *
 * Whenever the iterator or spliterator methods are used the user must close the {@link Stream}
 * that the method was invoked on after completion of its operation.  Failure to do so may cause a thread leakage if
 * the iterator or spliterator are not fully consumed.
 *
 * When using stream that is backed by a distributed cache these operations will be performed using remote
 * distribution controlled by the segments that each key maps to.  All intermediate operations are lazy, even the
 * special cases described in later paragraphs and are not evaluated until a final terminal operation is invoked on
 * the stream.  Essentially each set of intermediate operations is shipped to each remote node where they are applied
 * to a local stream there and finally the terminal operation is completed.  If this stream is parallel the processing
 * on remote nodes is also done using a parallel stream.
 *
 * Parallel distribution is enabled by default for all operations except for {@link CacheStream#iterator()} &
 * {@link CacheStream#spliterator()}.  Please see {@link CacheStream#sequentialDistribution()} and
 * {@link CacheStream#parallelDistribution()}.  With this disabled only a single node will process the operation
 * at a time (includes locally).
 *
 * Rehash aware is enabled by default for all operations.  Any intermediate or terminal operation may be invoked
 * multiple times during a rehash and thus you should ensure the are idempotent.  This can be problematic for
 * {@link CacheStream#forEach(Consumer)} as it may be difficult to implement with such requirements, please see it for
 * more information.  If you wish to disable rehash aware operations you can disable them by calling
 * {@link CacheStream#disableRehashAware()} which should provide better performance for some operations.  The
 * performance is most affected for the key aware operations {@link CacheStream#iterator()},
 * {@link CacheStream#spliterator()}, {@link CacheStream#forEach(Consumer)}.  Disabling rehash can cause
 * incorrect results if the terminal operation is invoked and a rehash occurs before the operation completes.  If
 * incorrect results do occur it is guaranteed that it will only be that entries were missed and no entries are
 * duplicated.
 *
 * Any stateful intermediate operation requires pulling all information up to that point local to operate properly.
 * Each of these methods may have slightly different behavior, so make sure you check the method you are utilizing.
 *
 * An example of such an operation is using distinct intermediate operation. What will happen
 * is upon calling the terminal operation a remote retrieval operation will be ran using all of
 * the intermediate operations up to the distinct operation remotely.  This retrieval is then used to fuel a local
 * stream where all of the remaining intermediate operations are performed and then finally the terminal operation is
 * applied as normal.  Note in this case the intermediate iterator still obeys the
 * {@link CacheStream#distributedBatchSize(int)} setting irrespective of the terminal operator.
 *
 * @param  The type of the stream
 * @since 8.0
 */
public interface CacheStream extends Stream {
   /**
    * This would disable sending requests to all other remote nodes compared to one at a time. This can reduce memory
    * pressure on the originator node at the cost of performance.
    * Parallel distribution is enabled by default except for {@link CacheStream#iterator()} &
    * {@link CacheStream#spliterator()}
    * @return a stream with parallel distribution disabled
    */
   CacheStream sequentialDistribution();

   /**
    * This would enable sending requests to all other remote nodes when a terminal operator is performed.  This
    * requires additional overhead as it must process results concurrently from various nodes, but should perform
    * faster in the majority of cases.
    * Parallel distribution is enabled by default except for {@link CacheStream#iterator()} &
    * {@link CacheStream#spliterator()}
    * @return a stream with parallel distribution enabled.
    */
   CacheStream parallelDistribution();

   /**
    * Filters which entries are returned by what segment they are present in.  This method can be substantially more
    * efficient than using a regular {@link CacheStream#filter(Predicate)} method as this can control what nodes are
    * asked for data and what entries are read from the underlying CacheStore if present.
    * @param segments The segments to use for this stream operation.  Any segments not in this set will be ignored.
    * @return a stream with the segments filtered.
    */
   CacheStream filterKeySegments(Set segments);

   /**
    * Filters which entries are returned by only returning ones that map to the given key.  This method will always
    * be faster than a regular {@link CacheStream#filter(Predicate)} if any keys must be retrieved remotely or if a
    * cache store is in use.
    * @param keys The keys that this stream will only operate on.
    * @return a stream with the keys filtered.
    */
   CacheStream filterKeys(Set keys);

   /**
    * Controls how many keys are returned from a remote node when using a stream terminal operation with a distributed
    * cache to back this stream.  This value is ignored when terminal operators that don't track keys are used.  Key
    * tracking terminal operators are {@link CacheStream#iterator()}, {@link CacheStream#spliterator()},
    * {@link CacheStream#forEach(Consumer)}.  Please see those methods for additional information on how this value
    * may affect them.
    * This value may be used in the case of a a terminal operator that doesn't track keys if an intermediate
    * operation is performed that requires bringing keys locally to do computations.  Examples of such intermediate
    * operations are {@link CacheStream#sorted()}, {@link CacheStream#sorted(Comparator)},
    * {@link CacheStream#distinct()}, {@link CacheStream#limit(long)}, {@link CacheStream#skip(long)}
    * This value is always ignored when this stream is backed by a cache that is not distributed as all
    * values are already local.
    * @param batchSize The size of each batch.  This defaults to the state transfer chunk size.
    * @return a stream with the batch size updated
    */
   CacheStream distributedBatchSize(int batchSize);

   /**
    * Allows registration of a segment completion listener that is notified when a segment has completed
    * processing.  If the terminal operator has a short circuit this listener may never be called.
    * This method is designed for the sole purpose of use with the {@link CacheStream#iterator()} to allow for
    * a user to track completion of segments as they are returned from the iterator.  Behavior of other methods
    * is not specified.  Please see {@link CacheStream#iterator()} for more information.
    * Multiple listeners may be registered upon multiple invocations of this method.  The ordering of notified
    * listeners is not specified.
    * @param listener The listener that will be called back as segments are completed.
    * @return a stream with the listener registered.
    */
   CacheStream segmentCompletionListener(SegmentCompletionListener listener);

   /**
    * Disables tracking of rehash events that could occur to the underlying cache.  If a rehash event occurs while
    * a terminal operation is being performed it is possible for some values that are in the cache to not be found.
    * Note that you will never have an entry duplicated when rehash awareness is disabled, only lost values.
    * Most terminal operations will run faster with rehash awareness disabled even without a rehash occuring.
    * However if a rehash occurs with this disabled be prepared to possibly receive only a subset of values.
    * @return a stream with rehash awareness disabled.
    */
   CacheStream disableRehashAware();

   /**
    * Sets a given time to wait for a remote operation to respond by.  This timeout does nothing if the terminal
    * operation does not go remote.
    * If a timeout does occur then a {@link java.util.concurrent.TimeoutException} is thrown from the terminal
    * operation invoking thread or on the next call to the {@link Iterator} or {@link Spliterator}.
    * Note that if a rehash occurs this timeout value is reset for the subsequent retry if rehash aware is
    * enabled.
    * @param timeout the maximum time to wait
    * @param unit the time unit of the timeout argument
    * @return a stream with the timeout set
    */
   CacheStream timeout(long timeout, TimeUnit unit);

   /**
    * Functional interface that is used as a callback when segments are completed.  Please see
    * {@link CacheStream#segmentCompletionListener(SegmentCompletionListener)} for more details.
    * @since 8.0
    */
   @FunctionalInterface
   interface SegmentCompletionListener {
      /**
       * Method invoked when the segment has been found to be consumed properly by the terminal operation.
       * @param segments The segments that were completed
       */
      void segmentCompleted(Set segments);
   }

   /**
    * {@inheritDoc}
    * This operation is performed remotely on the node that is the primary owner for the key tied to the entry(s)
    * in this stream.
    * NOTE: This method while being rehash aware has the lowest consistency of all of the operators.  This
    * operation will be performed on every entry at least once in the cluster, as long as the originator doesn't go
    * down while it is being performed.  This is due to how the distributed action is performed.  Essentially the
    * {@link CacheStream#distributedBatchSize} value controls how many elements are processed per node at a time
    * when rehash is enabled. After those are complete the keys are sent to the originator to confirm that those were
    * processed.  If that node goes down during/before the response those keys will be processed a second time.
    * It is possible to have the cache local to each node injected into this instance if the provided
    * Consumer also implements the {@link org.infinispan.stream.CacheAware} interface.  This method will be invoked
    * before the consumer accept() method is invoked.
    * This method is ran distributed by default with a distributed backing cache.  However if you wish for this
    * operation to run locally you can use the {@code stream().iterator().forEachRemaining(action)} for a single
    * threaded variant.  If you
    * wish to have a parallel variant you can use {@link java.util.stream.StreamSupport#stream(Spliterator, boolean)}
    * passing in the spliterator from the stream.  In either case remember you must close the stream after
    * you are done processing the iterator or spliterator..
    * @param action
    */
   @Override
   void forEach(Consumer action);

   /**
    * {@inheritDoc}
    * Usage of this operator requires closing this stream after you are done with the iterator.  The preferred
    * usage is to use a try with resource block on the stream.
    * This method has special usage with the {@link org.infinispan.CacheStream.SegmentCompletionListener} in
    * that as entries are retrieved from the next method it will complete segments.
    * This method obeys the {@link CacheStream#distributedBatchSize(int)}.  Note that when using methods such as
    * {@link CacheStream#flatMap(Function)} that you will have possibly more than 1 element mapped to a given key
    * so this doesn't guarantee that many number of entries are returned per batch.
    * Note that the {@link Iterator#remove()} method is only supported if no intermediate operations have been
    * applied to the stream and this is not a stream created from a {@link Cache#values()} collection.
    * @return the element iterator for this stream
    */
   @Override
   Iterator iterator();

   /**
    * {@inheritDoc}
    * Usage of this operator requires closing this stream after you are done with the spliterator.  The preferred
    * usage is to use a try with resource block on the stream.
    * @return the element spliterator for this stream
    */
   @Override
   Spliterator spliterator();

   /**
    * {@inheritDoc}
    * This operation is performed entirely on the local node irrespective of the backing cache.  This
    * operation will act as an intermediate iterator operation requiring data be brought locally for proper behavior.
    * Beware this means it will require having all entries of this cache into memory at one time.  This is described in
    * more detail at {@link CacheStream}
    * Any subsequent intermediate operations and the terminal operation are also performed locally.
    * @return the new stream
    */
   @Override
   Stream sorted();

   /**
    * {@inheritDoc}
    * This operation is performed entirely on the local node irrespective of the backing cache.  This
    * operation will act as an intermediate iterator operation requiring data be brought locally for proper behavior.
    * Beware this means it will require having all entries of this cache into memory at one time.  This is described in
    * more detail at {@link CacheStream}
    * Any subsequent intermediate operations and the terminal operation are then performed locally.
    * @param comparator the comparator to be used for sorting the elements
    * @return the new stream
    */
   @Override
   Stream sorted(Comparator comparator);

   /**
    * {@inheritDoc}
    * This intermediate operation will be performed both remotely and locally to reduce how many elements
    * are sent back from each node.  More specifically this operation is applied remotely on each node to only return
    * up to the maxSize value and then the aggregated results are limited once again on the local node.
    * This operation will act as an intermediate iterator operation requiring data be brought locally for proper
    * behavior.  This is described in more detail in the {@link CacheStream} documentation
    * Any subsequent intermediate operations and the terminal operation are then performed locally.
    * @param maxSize how many elements to limit this stream to.
    * @return the new stream
    */
   @Override
   Stream limit(long maxSize);

   /**
    * {@inheritDoc}
    * This operation is performed entirely on the local node irrespective of the backing cache.  This
    * operation will act as an intermediate iterator operation requiring data be brought locally for proper behavior.
    * This is described in more detail in the {@link CacheStream} documentation
    * Depending on the terminal operator this may or may not require all entries or a subset after skip is applied
    * to be in memory all at once.
    * Any subsequent intermediate operations and the terminal operation are then performed locally.
    * @param n how many elements to skip from this stream
    * @return the new stream
    */
   @Override
   Stream skip(long n);

   /**
    * {@inheritDoc}
    * This operation will be invoked both remotely and locally when used with a distributed cache backing this stream.
    * This operation will act as an intermediate iterator operation requiring data be brought locally for proper
    * behavior.  This is described in more detail in the {@link CacheStream} documentation
    * This intermediate iterator operation will be performed locally and remotely requiring possibly a subset of
    * all elements to be in memory
    * Any subsequent intermediate operations and the terminal operation are then performed locally.
    * @return the new stream
    */
   @Override
   Stream distinct();

   /**
    * {@inheritDoc}
    * Note when using a distributed backing cache for this stream the collector must be marshallable.  This
    * prevents the usage of {@link java.util.stream.Collectors} class.  However you can use the
    * {@link org.infinispan.stream.CacheCollectors} static factory methods to create a serializable wrapper, which then
    * creates the actual collector lazily after being deserialized.  This is useful to use any method from the
    * {@link java.util.stream.Collectors} class as you would normally.
    * @param collector
    * @param  collected type
    * @param  intermediate collected type if applicable
    * @return the collected value
    * @see org.infinispan.stream.CacheCollectors
    */
   @Override
    R1 collect(Collector collector);
}