All Downloads are FREE. Search and download functionalities are using the official Maven repository.

net.pincette.mongo.streams.Pipeline Maven / Gradle / Ivy

There is a newer version: 3.1.22
Show newest version
package net.pincette.mongo.streams;

import static java.time.Instant.now;
import static java.util.Optional.ofNullable;
import static java.util.logging.Level.INFO;
import static java.util.logging.Logger.getLogger;
import static net.pincette.json.JsonUtil.string;
import static net.pincette.rs.Box.box;
import static net.pincette.rs.Mapper.map;
import static net.pincette.rs.Pipe.pipe;
import static net.pincette.util.Collections.map;
import static net.pincette.util.Collections.merge;
import static net.pincette.util.Pair.pair;

import java.util.Map;
import java.util.Optional;
import java.util.concurrent.Flow.Processor;
import java.util.logging.Logger;
import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;
import net.pincette.function.SideEffect;
import net.pincette.json.JsonUtil;
import net.pincette.rs.streams.Message;
import net.pincette.util.State;

/**
 * With this class you can build Kafka streams using MongoDB aggregation pipeline descriptions. All
 * Kafka streams are expressed in terms of javax.json.JsonObject, so you need a
 * serialiser/deserialiser for that. A candidate is net.pincette.jes.util.JsonSerde,
 * which uses compressed CBOR. Only pipeline stages that have a meaning for infinite streams are
 * supported. This is the list:
 *
 * 
*
$addFields *
Supports the expressions defined in {@link net.pincette.mongo.Expression}. *
$bucket *
This operator is implemented in terms of the $group and $switch * operators, so their constraints apply here. The extension field _collection is * also available. *
$count *
$backTrace *
This is a debugging aid. It logs the number of backpressure requested. To distinguish * between trace spots, you can set the optional field name. *
$delay *
With this extension operator you can send a messages to a Kafka topic with a delay. The * order of the messages is not guaranteed. The operator is an object with two fields. The * duration field is the number of milliseconds the operation is delayed. The * topic field is the Kafka topic to which the message is sent after the delay. * Note that a Kafka producer should be available in the context. Note also that message loss * is possible if there is a failure in the middle of a delay. The main use-case for this * operator is retry logic. *
$deduplicate *
With this extension operator you can filter away messages based on an expression, which * should be the value of the expression field. The collection field * is the MongoDB collection that is used for the state. The optional cacheWindow * field is the number of milliseconds messages are kept in a cache for duplicate checking. * The default value is 1000. *
$delete *
This extension operator has a specification with the mandatory fields from and * on. The former is the name of a MongoDB collection. The latter is either a * string or a non-empty array of strings. It represents fields in the incoming JSON object. * The operator deletes records from the collection for which the given fields have the same * values as for the incoming JSON object. The output of the operator is the incoming JSON * object. *
$group *
Because streams are infinite there are a few deviations for this operator. The accumulation * operators $first, $last and $stdDevSamp don't exist. * The generated Kafka stream will also emit a message each time the value of the grouping * changes. Therefore, the $stdDevPop operator represents the running standard * deviation. The accumulation operators support the expressions defined in * {@link net.pincette.mongo.Expression}. The $group stage doesn't use * KTables. Instead it uses a backing collection in MongoDB. You can specify one with the * extension property _collection. Otherwise a unique collection name will be * used to create one. The name is prefixed with the application name. A reason to specify a * collection, for example, is when your grouping key is time-based and won't get any new * values after some period. You can then set a TTL index to get rid of old * records quickly. *
$http *
With this extension operator you can "join" a JSON HTTP API with a data stream or cause * side effects to it. The object should at least have the fields url and * method, which are both expressions that should yield a string. The rest of the * fields are optional. The headers field should be an expression that yields an * object. Its contents will be added as HTTP headers. Array values will result in * multi-valued headers. The result of the expression in the body field will be * used as the request body. The as field, which should be a field name, will * contain the response body in the message that is forwarded. Without that field response * bodies are ignored. When the Boolean unwind field is set to true * and when the response body contains a JSON array, for each entry in the array a message * will be produced with the array entry in the as field. If the array is empty * no messages are produced at all. HTTP errors are put in the httpError field, * which contains the fields statusCode and body. With the object in * the field sslContext you can add client-side authentication. The * keyStore field should refer to a PKCS#12 key store file. The * password field should provide the password for the keys in the key store file. *
$jslt *
This extension operator transforms the incoming message with a JSLT script. Its specification should be a * string. If it starts with "resource:/" the script will be loaded as a class path resource, * otherwise it is interpreted as a filename. If the transformation changes or adds the * _id field then that will become the key of the outgoing message. *
$lookup *
The extra optional boolean field inner is available to make this pipeline * stage behave like an inner join instead of an outer left join, which is the default. When * the other optional boolean field unwind is set, multiple objects may be * returned where the as field will have a single value instead of an array. In * this case the join will always be an inner join. With the unwind feature you can avoid the * accumulation of large arrays in memory. When both the extra fields connectionString * and database are present the query will go to that database instead of * the default one. *
$match *
Supports the expressions defined in {@link net.pincette.mongo.Match}. *
$merge *
Pipeline values for the whenMatched field are currently not supported. The * into field can only be the name of a collection. The database is always the * one given to the pipeline. The optional key field accepts an expression, which * is applied to the incoming message. When it is present it will be used as the value for the * _id field in the MongoDB collection. The output of the stream is whatever has * been updated to or taken from the MongoDB collection. The value of the _id * field of the incoming message will be kept for the output message. *
$out *
Because streams are infinite this operator behaves like $merge with the * following settings: *
       {
 *         "into": "<output-collection>",
 *         "on": "_id",
 *         "whenMatched": "replace",
 *         "whenNotMatched": "insert"
 *       }
*
$per *
This extension operator is an object with the mandatory fields amount and * as. It accumulates the amount of messages and produces a message with only the * field denoted by the as field. The field is an array of messages. With the * optional timeout field, which is a number of milliseconds, a batch of messages * can be emitted before it is full. In that case the length of the generated array will vary. *
$probe *
With this extension operator you can monitor the throughput anywhere in a pipeline. The * specification is an object with the fields name and topic, which * is the name of a Kafka topic. It will write messages to that topic with the fields * name, minute and count, which represents the number of * messages it has seen in that minute. Note that if your pipeline is running on multiple * topic partitions you should group the messages on the specified topic by the name and * minute and sum the count. That is because every instance of the pipeline only sees the * messages that pass on the partitions that are assigned to it. *
$project *
Supports the expressions defined in {@link net.pincette.mongo.Expression}. *
$redact *
Supports the expressions defined in {@link net.pincette.mongo.Expression}. *
$replaceRoot *
Supports the expressions defined in {@link net.pincette.mongo.Expression}. *
$replaceWith *
Supports the expressions defined in {@link net.pincette.mongo.Expression}. *
$send *
With this extension operator you can send a message to a Kafka topic. The operator is an * * object with a topic field, which is the Kafka topic to which the message is * * sent. Note that a Kafka producer should be available in the context. The main use-case for * * this operator is dynamic routing of messages to topics. *
$set *
Supports the expressions defined in {@link net.pincette.mongo.Expression}. *
$setKey *
With this extension operator you can change the Kafka key of the message without changing * the message itself. The operator expects an expression, the result of which will be * converted to a string. Supports the expressions defined in * {@link net.pincette.mongo.Expression}. *
$throttle *
With this extension operator you limit the number of messages per second that are let * through. You give it a JSON object with the integer field maxPerSecond. *
$trace *
This extension operator writes all JSON objects that pass through it to the Java logger * "net.pinctte.mongo.streams" with level INFO. That is, when the operator * doesn't have an expression (set to null). If you give it an expression, its * result will be written to the logger. This can be used for pipeline debugging. Supports the * expressions defined in {@link net.pincette.mongo.Expression}. *
$unset *
$unwind *
The Boolean extension option newIds will cause UUIDs to be generated for the * output documents if the given array was not absent or empty. *
* * @author Werner Donn\u00e9 * @since 1.0 * @see net.pincette.mongo.Match * @see net.pincette.mongo.Expression */ public class Pipeline { static final String ADD_FIELDS = "$addFields"; static final String BACK_TRACE = "$backTrace"; static final String BUCKET = "$bucket"; static final String COUNT = "$count"; static final String DEDUPLICATE = "$deduplicate"; static final String DELAY = "$delay"; static final String DELETE = "$delete"; static final String GROUP = "$group"; static final String HTTP = "$http"; static final String JSLT = "$jslt"; static final String LOOKUP = "$lookup"; static final String MATCH = "$match"; static final String MERGE = "$merge"; static final String OUT = "$out"; static final String PER = "$per"; static final String PROBE = "$probe"; static final String PROJECT = "$project"; static final String REDACT = "$redact"; static final String REPLACE_ROOT = "$replaceRoot"; static final String REPLACE_WITH = "$replaceWith"; static final String SEND = "$send"; static final String SET = "$set"; static final String SET_KEY = "$setKey"; static final String THROTTLE = "$throttle"; static final String TRACE = "$trace"; static final String UNSET = "$unset"; static final String UNWIND = "$unwind"; private static final Logger logger = getLogger("net.pincette.mongo.streams"); private static final Map stages = map( pair(ADD_FIELDS, AddFields::stage), pair(BACK_TRACE, (ex, ctx) -> BackTrace.stage(ex)), pair(BUCKET, Bucket::stage), pair(COUNT, Count::stage), pair(DEDUPLICATE, Deduplicate::stage), pair(DELAY, Delay::stage), pair(DELETE, Delete::stage), pair(GROUP, Group::stage), pair(HTTP, Http::stage), pair(JSLT, Jslt::stage), pair(LOOKUP, Lookup::stage), pair(MATCH, Match::stage), pair(MERGE, Merge::stage), pair(OUT, Out::stage), pair(PER, (ex, ctx) -> Per.stage(ex)), pair(PROBE, Probe::stage), pair(PROJECT, Project::stage), pair(REDACT, Redact::stage), pair(REPLACE_ROOT, ReplaceRoot::stage), pair(REPLACE_WITH, ReplaceWith::stage), pair(SEND, Send::stage), pair(SET, AddFields::stage), pair(SET_KEY, SetKey::stage), pair(THROTTLE, (ex, ctx) -> Throttle.stage(ex)), pair(TRACE, Trace::stage), pair(UNSET, Unset::stage), pair(UNWIND, (ex, ctx) -> Unwind.stage(ex))); private Pipeline() {} /** * Creates a reactive streams processor from an aggregation pipeline. Pipeline stages * that are not recognised are ignored. * * @param pipeline the aggregation pipeline. * @param context the context for the pipeline. * @return The processor. * @since 3.0 */ public static Processor, Message> create( final JsonArray pipeline, final Context context) { final Map allStages = context.stageExtensions != null ? merge(context.stageExtensions, stages) : stages; return pipeline.stream() .filter(JsonUtil::isObject) .map(JsonValue::asJsonObject) .map( json -> name(json) .flatMap( name -> ofNullable(allStages.get(name)) .map( stage -> (context.trace ? wrapProfile(stage, name) : stage) .apply(json.getValue("/" + name), context)))) .filter(Optional::isPresent) .map(Optional::get) .reduce( null, (processor, stage) -> processor == null ? stage : box(processor, stage), (p1, p2) -> p1); } private static Processor, Message> markEnd( final State start, final String name, final JsonValue expression) { return map( v -> SideEffect.>run( () -> logger.log( INFO, "{0} with expression {1} took {2}ms", new Object[] { name, string(expression), now().toEpochMilli() - start.get() })) .andThenGet(() -> v)); } private static Processor, Message> markStart( final State start) { return map( v -> SideEffect.>run(() -> start.set(now().toEpochMilli())) .andThenGet(() -> v)); } private static Optional name(final JsonObject json) { return Optional.of(json.keySet()) .filter(keys -> keys.size() == 1) .map(keys -> keys.iterator().next()); } private static Stage wrapProfile(final Stage stage, final String name) { final State start = new State<>(); return (expression, context) -> pipe(markStart(start)) .then(stage.apply(expression, context)) .then(markEnd(start, name, expression)); } }




© 2015 - 2025 Weber Informatics LLC | Privacy Policy