com.sindicetech.siren.analysis.package-info Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of siren-core Show documentation
SIREn core module
The newest version!
/**
 * Analyzer for indexing JSON content.
 *
 * Introduction
 *
 * This package extends the Lucene's analysis API to provide support for
 * parsing and indexing JSON content. For an introduction to Lucene's analysis
 * API, see the {@link org.apache.lucene.analysis} package documentation.
 *
 *
 * Overview of the API
 *
 * This package contains concrete components
 * ({@link org.apache.lucene.util.Attribute}s,
 * {@link org.apache.lucene.analysis.Tokenizer}s and
 * {@link org.apache.lucene.analysis.TokenFilter}s) for analyzing different
 * JSON content.
 * 
 * It also provides a pre-built JSON analyzer
 * {@link com.sindicetech.siren.analysis.ExtendedJsonAnalyzer} that you can use to get
 * started quickly.
 * 

 * It also contains a number of
 * {@link com.sindicetech.siren.analysis.NumericAnalyzer}s that are used for
 * supporting datatypes.
 * 

 * The SIREn's analysis API is divided into several packages:
 * 

 * {@link com.sindicetech.siren.analysis.attributes} contains a number of
 * {@link org.apache.lucene.util.Attribute}s that are used to add metadata
 * to a stream of tokens.
 * 
{@link com.sindicetech.siren.analysis.filter} contains a number of
 * {@link org.apache.lucene.analysis.TokenFilter}s that alter incoming tokens.
 * 
 *
 * JSON Analyzer
 *
 * 
 * SIREn provides two different json tokenizers to parse and convert JSON data into
 * a node-labelled tree model:
 * 

 *   The {@link com.sindicetech.siren.analysis.ExtendedJsonTokenizer} converts JSON data into a tree model.
 *   The {@link com.sindicetech.siren.analysis.ConciseJsonTokenizer} converts JSON data into a concise tree model.
 * 
 * The conversion is performed in a streaming mode during the parsing.
 *
 *
 * 
 * The tokenizer traverses the JSON tree using a depth-first search approach.
 * During the traversal of the tree, the tokenizer increments the dewey code
 * (i.e., node label) whenever an object, an array, a field or a value
 * is encountered. The tokenizer attaches to any token generated the current
 * node label using the
 * {@link com.sindicetech.siren.analysis.attributes.NodeAttribute}.
 * 
 *
 * JSON Datatypes
 *
 * The tokenizer attaches also a datatype metadata to any token generated using
 * the {@link com.sindicetech.siren.analysis.attributes.DatatypeAttribute}.
 * A datatype specifies the type of the data a node contains. By default, the
 * tokenizer differentiates five datatypes in the JSON syntax:
 *
 * 
 *  {@link com.sindicetech.siren.util.XSDDatatype#XSD_STRING}
 * 
 {@link com.sindicetech.siren.util.XSDDatatype#XSD_LONG}
 * 
 {@link com.sindicetech.siren.util.XSDDatatype#XSD_DOUBLE}
 * 
 {@link com.sindicetech.siren.util.XSDDatatype#XSD_BOOLEAN}
 * 
 {@link com.sindicetech.siren.util.JSONDatatype#JSON_FIELD}
 * 
 *
 * The datatype metadata is used to perform an appropriate analysis of the
 * content of a node. Such analysis is performed by the
 * {@link com.sindicetech.siren.analysis.filter.DatatypeAnalyzerFilter}. The
 * analysis of each datatype can be configured freely by the user using the
 * method
 * {@link com.sindicetech.siren.analysis.ExtendedJsonAnalyzer#registerDatatype(char[], org.apache.lucene.analysis.Analyzer)}.
 *
 * Custom Datatypes
 *
 * Custom datatypes can also be used thanks to a specific annotation in the JSON object.
 * The schema of the annotation is the following:
 * Custom datatypes can also be used thanks to a specific annotation in the JSON object.
 * The datatype annotation follows the above schema, with `<LABEL>` being a string which represents the name of the
 * datatype to be assigned to the value, and `<VALUE>` is a string representing the value.
 *
 *  * {
 *   "_datatype_" : <LABEL>,
 *   "_value_" : <VALUE>
 * }
 * 
 *
 * This annotation does not have influence on the label of the value node.
 * For example, the label (i.e., 0.0) to the value b below:
 *  * {
 *   "a" : "b"
 * }
 * 
 * is the same for the value b with a custom datatype:
 *  * {
 *   "a" : {
 *     "_datatype_" : "my datatype",
 *     "_value_" : "b"
 *   }
 * }
 * 
 *
 * Trailing Commas
 *
 * The tokenizer allow trailing commas at the end of an array or an object,
 * although this is not possible by the JSON grammar.
 * The reason is that this simplifies the code of the JSON scanner.
 *
 * For example, the following is accepted by our implementation, but not by the grammar:
 *
 *  * { "a" : "b" , }
 * 
 *
 * and
 *
 *  * { "a" : [ "b" , "c" , ] }
 * 
 *
 * Communication with the Posting Writer
 *
 * The Lucene's
 * {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute payload}
 * interface is used by SIREn to encode information such as the node label and
 * the position of the token. This payload is then decoded by the
 * {@link com.sindicetech.siren.index index API} and encoded back into the node-based
 * inverted index data structure.
 *
 */
package com.sindicetech.siren.analysis;