net.pincette.mongo.streams.Pipeline Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of pincette-mongo-streams Show documentation
MongoDB Utilities
There is a newer version: 3.1.22
Show newest version
package net.pincette.mongo.streams;

import static java.time.Instant.now;
import static java.util.Optional.ofNullable;
import static java.util.logging.Level.INFO;
import static java.util.logging.Logger.getLogger;
import static net.pincette.json.JsonUtil.string;
import static net.pincette.rs.Box.box;
import static net.pincette.rs.Mapper.map;
import static net.pincette.rs.Pipe.pipe;
import static net.pincette.util.Collections.map;
import static net.pincette.util.Collections.merge;
import static net.pincette.util.Pair.pair;

import java.util.Map;
import java.util.Optional;
import java.util.concurrent.Flow.Processor;
import java.util.logging.Logger;
import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;
import net.pincette.function.SideEffect;
import net.pincette.json.JsonUtil;
import net.pincette.rs.streams.Message;
import net.pincette.util.State;

/**
 * With this class you can build Kafka streams using MongoDB aggregation pipeline descriptions. All
 * Kafka streams are expressed in terms of javax.json.JsonObject, so you need a
 * serialiser/deserialiser for that. A candidate is net.pincette.jes.util.JsonSerde,
 * which uses compressed CBOR. Only pipeline stages that have a meaning for infinite streams are
 * supported. This is the list:
 *
 * 
 *   $addFields
 *   
Supports the expressions defined in {@link net.pincette.mongo.Expression}.
 *   
$bucket
 *   
This operator is implemented in terms of the $group and $switch
 *       operators, so their constraints apply here. The extension field _collection is
 *       also available.
 *   
$count
 *   
$backTrace
 *   
This is a debugging aid. It logs the number of backpressure requested. To distinguish
 *       between trace spots, you can set the optional field name.
 *   
$delay
 *   
With this extension operator you can send a messages to a Kafka topic with a delay. The
 *       order of the messages is not guaranteed. The operator is an object with two fields. The
 *       duration field is the number of milliseconds the operation is delayed. The
 *       topic field is the Kafka topic to which the message is sent after the delay.
 *       Note that a Kafka producer should be available in the context. Note also that message loss
 *       is possible if there is a failure in the middle of a delay. The main use-case for this
 *       operator is retry logic.
 *   
$deduplicate
 *   
With this extension operator you can filter away messages based on an expression, which
 *       should be the value of the expression field. The collection field
 *       is the MongoDB collection that is used for the state. The optional cacheWindow
 *       field is the number of milliseconds messages are kept in a cache for duplicate checking.
 *       The default value is 1000.
 *   
$delete
 *   
This extension operator has a specification with the mandatory fields from and
 *       on. The former is the name of a MongoDB collection. The latter is either a
 *       string or a non-empty array of strings. It represents fields in the incoming JSON object.
 *       The operator deletes records from the collection for which the given fields have the same
 *       values as for the incoming JSON object. The output of the operator is the incoming JSON
 *       object.
 *   
$group
 *   
Because streams are infinite there are a few deviations for this operator. The accumulation
 *       operators $first, $last and $stdDevSamp don't exist.
 *       The generated Kafka stream will also emit a message each time the value of the grouping
 *       changes. Therefore, the $stdDevPop operator represents the running standard
 *       deviation. The accumulation operators support the expressions defined in 
 *       {@link net.pincette.mongo.Expression}. The $group stage doesn't use
 *       KTables. Instead it uses a backing collection in MongoDB. You can specify one with the
 *       extension property _collection. Otherwise a unique collection name will be
 *       used to create one. The name is prefixed with the application name. A reason to specify a
 *       collection, for example, is when your grouping key is time-based and won't get any new
 *       values after some period. You can then set a TTL index to get rid of old
 *       records quickly.
 *   
$http
 *   
With this extension operator you can "join" a JSON HTTP API with a data stream or cause
 *       side effects to it. The object should at least have the fields url and 
 *       method, which are both expressions that should yield a string. The rest of the
 *       fields are optional. The headers field should be an expression that yields an
 *       object. Its contents will be added as HTTP headers. Array values will result in
 *       multi-valued headers. The result of the expression in the body field will be
 *       used as the request body. The as field, which should be a field name, will
 *       contain the response body in the message that is forwarded. Without that field response
 *       bodies are ignored. When the Boolean unwind field is set to true
 *       and when the response body contains a JSON array, for each entry in the array a message
 *       will be produced with the array entry in the as field. If the array is empty
 *       no messages are produced at all. HTTP errors are put in the httpError field,
 *       which contains the fields statusCode and body. With the object in
 *       the field sslContext you can add client-side authentication. The 
 *       keyStore field should refer to a PKCS#12 key store file. The 
 *       password field should provide the password for the keys in the key store file.
 *   
$jslt
 *   
This extension operator transforms the incoming message with a JSLT script. Its specification should be a
 *       string. If it starts with "resource:/" the script will be loaded as a class path resource,
 *       otherwise it is interpreted as a filename. If the transformation changes or adds the 
 *       _id field then that will become the key of the outgoing message.
 *   
$lookup
 *   
The extra optional boolean field inner is available to make this pipeline
 *       stage behave like an inner join instead of an outer left join, which is the default. When
 *       the other optional boolean field unwind is set, multiple objects may be
 *       returned where the as field will have a single value instead of an array. In
 *       this case the join will always be an inner join. With the unwind feature you can avoid the
 *       accumulation of large arrays in memory. When both the extra fields connectionString
 *        and database are present the query will go to that database instead of
 *       the default one.
 *   
$match
 *   
Supports the expressions defined in {@link net.pincette.mongo.Match}.
 *   
$merge
 *   
Pipeline values for the whenMatched field are currently not supported. The
 *       into field can only be the name of a collection. The database is always the
 *       one given to the pipeline. The optional key field accepts an expression, which
 *       is applied to the incoming message. When it is present it will be used as the value for the
 *       _id field in the MongoDB collection. The output of the stream is whatever has
 *       been updated to or taken from the MongoDB collection. The value of the _id
 *       field of the incoming message will be kept for the output message.
 *   
$out
 *   
Because streams are infinite this operator behaves like $merge with the
 *       following settings:
 *              {
 *         "into": "<output-collection>",
 *         "on": "_id",
 *         "whenMatched": "replace",
 *         "whenNotMatched": "insert"
 *       }
 *   
$per
 *   
This extension operator is an object with the mandatory fields amount and
 *       as. It accumulates the amount of messages and produces a message with only the
 *       field denoted by the as field. The field is an array of messages. With the
 *       optional timeout field, which is a number of milliseconds, a batch of messages
 *       can be emitted before it is full. In that case the length of the generated array will vary.
 *   
$probe
 *   
With this extension operator you can monitor the throughput anywhere in a pipeline. The
 *       specification is an object with the fields name and topic, which
 *       is the name of a Kafka topic. It will write messages to that topic with the fields 
 *       name, minute and count, which represents the number of
 *       messages it has seen in that minute. Note that if your pipeline is running on multiple
 *       topic partitions you should group the messages on the specified topic by the name and
 *       minute and sum the count. That is because every instance of the pipeline only sees the
 *       messages that pass on the partitions that are assigned to it.
 *   
$project
 *   
Supports the expressions defined in {@link net.pincette.mongo.Expression}.
 *   
$redact
 *   
Supports the expressions defined in {@link net.pincette.mongo.Expression}.
 *   
$replaceRoot
 *   
Supports the expressions defined in {@link net.pincette.mongo.Expression}.
 *   
$replaceWith
 *   
Supports the expressions defined in {@link net.pincette.mongo.Expression}.
 *   
$send
 *   
With this extension operator you can send a message to a Kafka topic. The operator is an *
 *       object with a topic field, which is the Kafka topic to which the message is *
 *       sent. Note that a Kafka producer should be available in the context. The main use-case for
 *       * this operator is dynamic routing of messages to topics.
 *   
$set
 *   
Supports the expressions defined in {@link net.pincette.mongo.Expression}.
 *   
$setKey
 *   
With this extension operator you can change the Kafka key of the message without changing
 *       the message itself. The operator expects an expression, the result of which will be
 *       converted to a string. Supports the expressions defined in 
 *       {@link net.pincette.mongo.Expression}.
 *   
$throttle
 *   
With this extension operator you limit the number of messages per second that are let
 *       through. You give it a JSON object with the integer field maxPerSecond.
 *   
$trace
 *   
This extension operator writes all JSON objects that pass through it to the Java logger
 *       "net.pinctte.mongo.streams" with level INFO. That is, when the operator
 *       doesn't have an expression (set to null). If you give it an expression, its
 *       result will be written to the logger. This can be used for pipeline debugging. Supports the
 *       expressions defined in {@link net.pincette.mongo.Expression}.
 *   
$unset
 *   
$unwind
 *   
The Boolean extension option newIds will cause UUIDs to be generated for the
 *       output documents if the given array was not absent or empty.
 * 
 *
 * @author Werner Donn\u00e9
 * @since 1.0
 * @see net.pincette.mongo.Match
 * @see net.pincette.mongo.Expression
 */
public class Pipeline {
  static final String ADD_FIELDS = "$addFields";
  static final String BACK_TRACE = "$backTrace";
  static final String BUCKET = "$bucket";
  static final String COUNT = "$count";
  static final String DEDUPLICATE = "$deduplicate";
  static final String DELAY = "$delay";
  static final String DELETE = "$delete";
  static final String GROUP = "$group";
  static final String HTTP = "$http";
  static final String JSLT = "$jslt";
  static final String LOOKUP = "$lookup";
  static final String MATCH = "$match";
  static final String MERGE = "$merge";
  static final String OUT = "$out";
  static final String PER = "$per";
  static final String PROBE = "$probe";
  static final String PROJECT = "$project";
  static final String REDACT = "$redact";
  static final String REPLACE_ROOT = "$replaceRoot";
  static final String REPLACE_WITH = "$replaceWith";
  static final String SEND = "$send";
  static final String SET = "$set";
  static final String SET_KEY = "$setKey";
  static final String THROTTLE = "$throttle";
  static final String TRACE = "$trace";
  static final String UNSET = "$unset";
  static final String UNWIND = "$unwind";
  private static final Logger logger = getLogger("net.pincette.mongo.streams");
  private static final Map stages =
      map(
          pair(ADD_FIELDS, AddFields::stage),
          pair(BACK_TRACE, (ex, ctx) -> BackTrace.stage(ex)),
          pair(BUCKET, Bucket::stage),
          pair(COUNT, Count::stage),
          pair(DEDUPLICATE, Deduplicate::stage),
          pair(DELAY, Delay::stage),
          pair(DELETE, Delete::stage),
          pair(GROUP, Group::stage),
          pair(HTTP, Http::stage),
          pair(JSLT, Jslt::stage),
          pair(LOOKUP, Lookup::stage),
          pair(MATCH, Match::stage),
          pair(MERGE, Merge::stage),
          pair(OUT, Out::stage),
          pair(PER, (ex, ctx) -> Per.stage(ex)),
          pair(PROBE, Probe::stage),
          pair(PROJECT, Project::stage),
          pair(REDACT, Redact::stage),
          pair(REPLACE_ROOT, ReplaceRoot::stage),
          pair(REPLACE_WITH, ReplaceWith::stage),
          pair(SEND, Send::stage),
          pair(SET, AddFields::stage),
          pair(SET_KEY, SetKey::stage),
          pair(THROTTLE, (ex, ctx) -> Throttle.stage(ex)),
          pair(TRACE, Trace::stage),
          pair(UNSET, Unset::stage),
          pair(UNWIND, (ex, ctx) -> Unwind.stage(ex)));

  private Pipeline() {}

  /**
   * Creates a reactive streams processor from an aggregation pipeline. Pipeline stages
   * that are not recognised are ignored.
   *
   * @param pipeline the aggregation pipeline.
   * @param context the context for the pipeline.
   * @return The processor.
   * @since 3.0
   */
  public static Processor, Message> create(
      final JsonArray pipeline, final Context context) {
    final Map allStages =
        context.stageExtensions != null ? merge(context.stageExtensions, stages) : stages;

    return pipeline.stream()
        .filter(JsonUtil::isObject)
        .map(JsonValue::asJsonObject)
        .map(
            json ->
                name(json)
                    .flatMap(
                        name ->
                            ofNullable(allStages.get(name))
                                .map(
                                    stage ->
                                        (context.trace ? wrapProfile(stage, name) : stage)
                                            .apply(json.getValue("/" + name), context))))
        .filter(Optional::isPresent)
        .map(Optional::get)
        .reduce(
            null,
            (processor, stage) -> processor == null ? stage : box(processor, stage),
            (p1, p2) -> p1);
  }

  private static Processor, Message> markEnd(
      final State start, final String name, final JsonValue expression) {
    return map(
        v ->
            SideEffect.>run(
                    () ->
                        logger.log(
                            INFO,
                            "{0} with expression {1} took {2}ms",
                            new Object[] {
                              name, string(expression), now().toEpochMilli() - start.get()
                            }))
                .andThenGet(() -> v));
  }

  private static Processor, Message> markStart(
      final State start) {
    return map(
        v ->
            SideEffect.>run(() -> start.set(now().toEpochMilli()))
                .andThenGet(() -> v));
  }

  private static Optional name(final JsonObject json) {
    return Optional.of(json.keySet())
        .filter(keys -> keys.size() == 1)
        .map(keys -> keys.iterator().next());
  }

  private static Stage wrapProfile(final Stage stage, final String name) {
    final State start = new State<>();

    return (expression, context) ->
        pipe(markStart(start))
            .then(stage.apply(expression, context))
            .then(markEnd(start, name, expression));
  }
}