
com.sindicetech.siren.search.node.package-info Maven / Gradle / Ivy
/**
* Programmatic API to search node-based inverted indexes.
*
* Introduction
*
* This package contains the API for building queries to search JSON data
* over node-based inverted indexes. For an introduction about the Lucene's
* search API, see the {@link org.apache.lucene.search} package documentation.
*
* Search Basics
*
* In contrast to the Lucene's {@link org.apache.lucene.search.Query} API
* which provides complex querying capabilities to search for documents, SIREn
* provide a {@link com.sindicetech.siren.search.node.NodeQuery} API to provide
* complex querying capabilities to search for nodes and documents. The
* information retrieved not only consists of the matching documents, but also
* of the matching nodes within these documents.
*
*
*
* SIREn offers a wide variety of
* {@link com.sindicetech.siren.search.node.NodeQuery} implementations. Most of them
* are similar to the ones provided by the Lucene's
* {@link org.apache.lucene.search.Query} API. For example, while Lucene
* provides a {@link org.apache.lucene.search.TermQuery} implementation
* to search documents that contain a specific term, SIREn provides a {@link
* com.sindicetech.siren.search.node.NodeTermQuery} implementation to search nodes
* and documents that contain a specific term.
*
*
Level and Range Constraints
*
* The {@link com.sindicetech.siren.search.node.NodeQuery} provides methods to set
* constraints on the nodes matched by the query. There are two types of
* constraints:
*
* - Level constraint: this constraint will filter out all nodes that do
* not belong to the specified level of the tree.
*
- Interval constraint: this constraint will filter out all nodes in
* which the last integer of their dewey code vector is not contained in the
* specified interval.
*
*
* Query Classes
*
* {@link com.sindicetech.siren.search.node.NodeTermQuery}
*
* A {@link com.sindicetech.siren.search.node.NodeTermQuery} matches all the
* nodes that contain the specified {@link org.apache.lucene.index.Term},
* which is a word that occurs in a certain
* {@link org.apache.lucene.document.Field} containing JSON data.
*
* Constructing a {@link com.sindicetech.siren.search.node.NodeTermQuery} is as
* simple as:
*
* NodeTermQuery tq = new NodeTermQuery(new Term("json-field", "term"));
*
*
* In this example, the {@link com.sindicetech.siren.search.node.NodeQuery}
* identifies all {@link org.apache.lucene.document.Document}s that have the
* {@link org.apache.lucene.document.Field} named "json-field"
* where a node contains the word "term".
*
* {@link com.sindicetech.siren.search.node.NodePhraseQuery}
*
* A {@link com.sindicetech.siren.search.node.NodePhraseQuery} matches all the nodes
* containing the specified phrase. A phrase is defined as a sequence of
* {@link org.apache.lucene.index.Term}.
*
* {@link com.sindicetech.siren.search.node.NodeBooleanQuery}
*
* A {@link com.sindicetech.siren.search.node.NodeBooleanQuery} matches all the
* nodes containing the specified boolean combination of queries.
* A {@link com.sindicetech.siren.search.node.NodeBooleanQuery} contains multiple
* {@link com.sindicetech.siren.search.node.NodeBooleanClause}s, where each clause
* contains a sub-query
* ({@link com.sindicetech.siren.search.node.NodeQuery} instance) and an
* operator (from {@link com.sindicetech.siren.search.node.NodeBooleanClause.Occur})
* describing how that sub-query is combined with the other clauses. The
* semantic of {@link com.sindicetech.siren.search.node.NodeBooleanClause.Occur} is
* identical to the semantic of {@link org.apache.lucene.search.BooleanClause.Occur}.
*
* {@link com.sindicetech.siren.search.node.NodeTermRangeQuery}
*
* A {@link com.sindicetech.siren.search.node.NodeTermRangeQuery} matches all
* nodes containing a term that occurs in the inclusive or exclusive range of a
* lower {@link org.apache.lucene.index.Term Term} and an upper
* {@link org.apache.lucene.index.Term Term} according to
* {@link org.apache.lucene.index.TermsEnum#getComparator TermsEnum.getComparator()}.
* It is not intended for numerical ranges; use
* {@link com.sindicetech.siren.search.node.NodeNumericRangeQuery} instead.
*
* {@link com.sindicetech.siren.search.node.NodeNumericRangeQuery}
*
* A {@link com.sindicetech.siren.search.node.NodeNumericRangeQuery} matches all
* nodes containing a value that occurs in a numeric range. For
* NodeNumericRangeQuery to work, you must index the values with the datatypes
* configured with the appropriate numeric analyzers
* ({@link com.sindicetech.siren.analysis.NumericAnalyzer}).
*
* {@link com.sindicetech.siren.search.node.NodePrefixQuery},
* {@link com.sindicetech.siren.search.node.NodeWildcardQuery},
* {@link com.sindicetech.siren.search.node.NodeRegexpQuery}
*
* A {@link com.sindicetech.siren.search.node.NodePrefixQuery} matches all nodes
* containing terms that begin with the specified string. A
* {@link com.sindicetech.siren.search.node.NodeWildcardQuery} generalizes this
* by allowing for the use of + (matches 1 or more characters),
* * (matches 0 or more characters) and
* ? (matches exactly one character) wildcards. Note that the
* {@link com.sindicetech.siren.search.node.NodeWildcardQuery} can be quite slow. Also
* note that {@link com.sindicetech.siren.search.node.NodeWildcardQuery} should
* not start with +, * and ?, as these are extremely slow.
* Some QueryParsers may not allow this by default, but provide a
* setAllowLeadingWildcard
method to remove that protection.
* The {@link com.sindicetech.siren.search.node.NodeRegexpQuery} is even more
* general than NodeWildcardQuery, matching all nodes with terms that match a
* regular expression pattern.
*
* {@link com.sindicetech.siren.search.node.NodeFuzzyQuery}
*
* A {@link com.sindicetech.siren.search.node.NodeFuzzyQuery} matches nodes that
* contain terms similar to the specified term. Similarity is determined using
* Levenshtein (edit)
* distance.
*
* {@link com.sindicetech.siren.search.node.TwigQuery}
*
* A {@link com.sindicetech.siren.search.node.TwigQuery} enables to combine
* {@link com.sindicetech.siren.search.node.NodeQuery}s with a Parent-Child or
* Ancestor-Descendant relation. This is the basic building block to build
* tree-shaped queries.
*
*
*
* A {@link com.sindicetech.siren.search.node.TwigQuery} is composed of a root and
* of one or more children or descendants:
*
* - The root is a {@link com.sindicetech.siren.search.node.NodeQuery} instance.
* An empty root is considered as a wildcard node query and will match all
* nodes. We call "root nodes" the set of nodes that are retrieved by the
* root query.
*
- A descendant is a {@link com.sindicetech.siren.search.node.NodeQuery}
* associated to an operator (from
* {@link com.sindicetech.siren.search.node.NodeBooleanClause.Occur}). A
* descendant query will match all the nodes for which it exists a path
* to a root node. A descendant is associated to a node level, which
* corresponds to the relative distance (in term of levels) from the root.
*
- A child is a descendant that is exactly one level above the root level.
*
*
*
*
* A twig query is always associated to a level. If no level is specified, then
* by default the level is set to 1. When a twig query is used as a child or
* descendant of another twig query, then its level is automatically updated
* according to the level of the parent twig query. For example, given
* the following instructions:
*
* TwigQuery tw1 = new TwigQuery();
* TwigQuery tw2 = new TwigQuery();
* tw1.addChild(tw2, Occur.MUST);
*
*
* In this example, the first twig query tw1 is defined at the default
* level 1. The second twig query tw2, after the call to
* {@link com.sindicetech.siren.search.node.TwigQuery#addChild(NodeQuery, com.sindicetech.siren.search.node.NodeBooleanClause.Occur)},
* will have its level updated to 2 since it is now a child of a twig query at a
* level 1.
*
* The Scorer Class
*
* The {@link com.sindicetech.siren.search.node.NodeScorer} abstract class provides
* common scoring functionality for all the node scorer implementations which
* are the heart of the SIREn scoring process.
*
*
*
* The implementation of the query processing framework follows a node-at-a-time
* approach, where the query operators (i.e., {@link com.sindicetech.siren.search.node.NodeScorer})
* process one node at a time. The query processing framework has been
* designed for high efficiency processing:
*
* - All the query operators leverage has much as possible the lazy-loading
* feature of the
* {@link com.sindicetech.siren.index.codecs.siren10.Siren10PostingsReader}. For
* example, there is not the concept of next matching document (i.e.,
* {@link org.apache.lucene.search.Scorer#nextDoc()}) in the
* {@link NodeScorer} interface, but instead the concept of next candidate
* document (i.e.,
* {@link com.sindicetech.siren.search.node.NodeScorer#nextCandidateDocument()}).
* This enables {@link com.sindicetech.siren.search.node.NodeConjunctionScorer} to
* efficiently iterates over the document identifiers wihtout having to
* decode the node labels until a potential candidate is found.
*
- The node label array (i.e., {@link org.apache.lucene.util.IntsRef})
* being processed is the same in all the query operators, which means that
* the same array is reused across and no new arrays are created during the
* query processing.
*
- The node label array is itself a slice of the array of the
* uncompressed node block. The node label array is created by sliding a
* window (i.e., {@link org.apache.lucene.util.IntsRef}) over the array of the
* uncompressed node block.
*
*
*/
package com.sindicetech.siren.search.node;