
com.sindicetech.siren.analysis.package-info Maven / Gradle / Ivy
/**
* Analyzer for indexing JSON content.
*
* Introduction
*
* This package extends the Lucene's analysis API to provide support for
* parsing and indexing JSON content. For an introduction to Lucene's analysis
* API, see the {@link org.apache.lucene.analysis} package documentation.
*
*
* Overview of the API
*
* This package contains concrete components
* ({@link org.apache.lucene.util.Attribute}s,
* {@link org.apache.lucene.analysis.Tokenizer}s and
* {@link org.apache.lucene.analysis.TokenFilter}s) for analyzing different
* JSON content.
*
* It also provides a pre-built JSON analyzer
* {@link com.sindicetech.siren.analysis.ExtendedJsonAnalyzer} that you can use to get
* started quickly.
*
* It also contains a number of
* {@link com.sindicetech.siren.analysis.NumericAnalyzer}s that are used for
* supporting datatypes.
*
* The SIREn's analysis API is divided into several packages:
*
* - {@link com.sindicetech.siren.analysis.attributes} contains a number of
* {@link org.apache.lucene.util.Attribute}s that are used to add metadata
* to a stream of tokens.
*
- {@link com.sindicetech.siren.analysis.filter} contains a number of
* {@link org.apache.lucene.analysis.TokenFilter}s that alter incoming tokens.
*
*
* JSON Analyzer
*
*
* SIREn provides two different json tokenizers to parse and convert JSON data into
* a node-labelled tree model:
*
* - The {@link com.sindicetech.siren.analysis.ExtendedJsonTokenizer} converts JSON data into a tree model.
* - The {@link com.sindicetech.siren.analysis.ConciseJsonTokenizer} converts JSON data into a concise tree model.
*
* The conversion is performed in a streaming mode during the parsing.
*
*
*
* The tokenizer traverses the JSON tree using a depth-first search approach.
* During the traversal of the tree, the tokenizer increments the dewey code
* (i.e., node label) whenever an object, an array, a field or a value
* is encountered. The tokenizer attaches to any token generated the current
* node label using the
* {@link com.sindicetech.siren.analysis.attributes.NodeAttribute}.
*
*
* JSON Datatypes
*
* The tokenizer attaches also a datatype metadata to any token generated using
* the {@link com.sindicetech.siren.analysis.attributes.DatatypeAttribute}.
* A datatype specifies the type of the data a node contains. By default, the
* tokenizer differentiates five datatypes in the JSON syntax:
*
*
* - {@link com.sindicetech.siren.util.XSDDatatype#XSD_STRING}
*
- {@link com.sindicetech.siren.util.XSDDatatype#XSD_LONG}
*
- {@link com.sindicetech.siren.util.XSDDatatype#XSD_DOUBLE}
*
- {@link com.sindicetech.siren.util.XSDDatatype#XSD_BOOLEAN}
*
- {@link com.sindicetech.siren.util.JSONDatatype#JSON_FIELD}
*
*
* The datatype metadata is used to perform an appropriate analysis of the
* content of a node. Such analysis is performed by the
* {@link com.sindicetech.siren.analysis.filter.DatatypeAnalyzerFilter}. The
* analysis of each datatype can be configured freely by the user using the
* method
* {@link com.sindicetech.siren.analysis.ExtendedJsonAnalyzer#registerDatatype(char[], org.apache.lucene.analysis.Analyzer)}.
*
* Custom Datatypes
*
* Custom datatypes can also be used thanks to a specific annotation in the JSON object.
* The schema of the annotation is the following:
* Custom datatypes can also be used thanks to a specific annotation in the JSON object.
* The datatype annotation follows the above schema, with `<LABEL>` being a string which represents the name of the
* datatype to be assigned to the value, and `<VALUE>` is a string representing the value.
*
*
* {
* "_datatype_" : <LABEL>,
* "_value_" : <VALUE>
* }
*
*
* This annotation does not have influence on the label of the value node.
* For example, the label (i.e., 0.0
) to the value b
below:
*
* {
* "a" : "b"
* }
*
* is the same for the value b
with a custom datatype:
*
* {
* "a" : {
* "_datatype_" : "my datatype",
* "_value_" : "b"
* }
* }
*
*
* Trailing Commas
*
* The tokenizer allow trailing commas at the end of an array or an object,
* although this is not possible by the JSON grammar.
* The reason is that this simplifies the code of the JSON scanner.
*
* For example, the following is accepted by our implementation, but not by the grammar:
*
*
* { "a" : "b" , }
*
*
* and
*
*
* { "a" : [ "b" , "c" , ] }
*
*
* Communication with the Posting Writer
*
* The Lucene's
* {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute payload}
* interface is used by SIREn to encode information such as the node label and
* the position of the token. This payload is then decoded by the
* {@link com.sindicetech.siren.index index API} and encoded back into the node-based
* inverted index data structure.
*
*/
package com.sindicetech.siren.analysis;