All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.sindicetech.siren.analysis.package-info Maven / Gradle / Ivy

The newest version!
/**
 * Analyzer for indexing JSON content.
 *
 * 

Introduction

* * This package extends the Lucene's analysis API to provide support for * parsing and indexing JSON content. For an introduction to Lucene's analysis * API, see the {@link org.apache.lucene.analysis} package documentation. * * *

Overview of the API

* * This package contains concrete components * ({@link org.apache.lucene.util.Attribute}s, * {@link org.apache.lucene.analysis.Tokenizer}s and * {@link org.apache.lucene.analysis.TokenFilter}s) for analyzing different * JSON content. *

* It also provides a pre-built JSON analyzer * {@link com.sindicetech.siren.analysis.ExtendedJsonAnalyzer} that you can use to get * started quickly. *

* It also contains a number of * {@link com.sindicetech.siren.analysis.NumericAnalyzer}s that are used for * supporting datatypes. *

* The SIREn's analysis API is divided into several packages: *

    *
  • {@link com.sindicetech.siren.analysis.attributes} contains a number of * {@link org.apache.lucene.util.Attribute}s that are used to add metadata * to a stream of tokens. *
  • {@link com.sindicetech.siren.analysis.filter} contains a number of * {@link org.apache.lucene.analysis.TokenFilter}s that alter incoming tokens. *
* *

JSON Analyzer

* *

* SIREn provides two different json tokenizers to parse and convert JSON data into * a node-labelled tree model: *

    *
  • The {@link com.sindicetech.siren.analysis.ExtendedJsonTokenizer} converts JSON data into a tree model.
  • *
  • The {@link com.sindicetech.siren.analysis.ConciseJsonTokenizer} converts JSON data into a concise tree model.
  • *
* The conversion is performed in a streaming mode during the parsing. *

* *

* The tokenizer traverses the JSON tree using a depth-first search approach. * During the traversal of the tree, the tokenizer increments the dewey code * (i.e., node label) whenever an object, an array, a field or a value * is encountered. The tokenizer attaches to any token generated the current * node label using the * {@link com.sindicetech.siren.analysis.attributes.NodeAttribute}. *

* *

JSON Datatypes

* * The tokenizer attaches also a datatype metadata to any token generated using * the {@link com.sindicetech.siren.analysis.attributes.DatatypeAttribute}. * A datatype specifies the type of the data a node contains. By default, the * tokenizer differentiates five datatypes in the JSON syntax: * *
    *
  • {@link com.sindicetech.siren.util.XSDDatatype#XSD_STRING} *
  • {@link com.sindicetech.siren.util.XSDDatatype#XSD_LONG} *
  • {@link com.sindicetech.siren.util.XSDDatatype#XSD_DOUBLE} *
  • {@link com.sindicetech.siren.util.XSDDatatype#XSD_BOOLEAN} *
  • {@link com.sindicetech.siren.util.JSONDatatype#JSON_FIELD} *
* * The datatype metadata is used to perform an appropriate analysis of the * content of a node. Such analysis is performed by the * {@link com.sindicetech.siren.analysis.filter.DatatypeAnalyzerFilter}. The * analysis of each datatype can be configured freely by the user using the * method * {@link com.sindicetech.siren.analysis.ExtendedJsonAnalyzer#registerDatatype(char[], org.apache.lucene.analysis.Analyzer)}. * *

Custom Datatypes

* * Custom datatypes can also be used thanks to a specific annotation in the JSON object. * The schema of the annotation is the following: * Custom datatypes can also be used thanks to a specific annotation in the JSON object. * The datatype annotation follows the above schema, with `<LABEL>` being a string which represents the name of the * datatype to be assigned to the value, and `<VALUE>` is a string representing the value. * *
 * {
 *   "_datatype_" : <LABEL>,
 *   "_value_" : <VALUE>
 * }
 * 
* * This annotation does not have influence on the label of the value node. * For example, the label (i.e., 0.0) to the value b below: *
 * {
 *   "a" : "b"
 * }
 * 
* is the same for the value b with a custom datatype: *
 * {
 *   "a" : {
 *     "_datatype_" : "my datatype",
 *     "_value_" : "b"
 *   }
 * }
 * 
* *

Trailing Commas

* * The tokenizer allow trailing commas at the end of an array or an object, * although this is not possible by the JSON grammar. * The reason is that this simplifies the code of the JSON scanner. * * For example, the following is accepted by our implementation, but not by the grammar: * *
 * { "a" : "b" , }
 * 
* * and * *
 * { "a" : [ "b" , "c" , ] }
 * 
* *

Communication with the Posting Writer

* * The Lucene's * {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute payload} * interface is used by SIREn to encode information such as the node label and * the position of the token. This payload is then decoded by the * {@link com.sindicetech.siren.index index API} and encoded back into the node-based * inverted index data structure. * */ package com.sindicetech.siren.analysis;




© 2015 - 2025 Weber Informatics LLC | Privacy Policy