org.apache.lucene.index.package-info Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of org.apache.servicemix.bundles.lucene

This OSGi bundle wraps ${pkgArtifactId} ${pkgVersion} jar file.

There is a newer version: 6.4.2_1

/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * Code to maintain and access indices. * * Table Of Contents * * * Index APIs * * IndexWriter * IndexReader * Segments and docids * * Field types * * Postings * Stored Fields * DocValues * Points * * Postings APIs * * Fields * Terms * Documents * Positions * * Index Statistics * * Term-level * Field-level * Segment-level * Document-level * * * * * * Index APIs * * * * IndexWriter * * {@link org.apache.lucene.index.IndexWriter} is used to create an index, and to add, update and * delete documents. The IndexWriter class is thread safe, and enforces a single instance per index. * Creating an IndexWriter creates a new index or opens an existing index for writing, in a {@link * org.apache.lucene.store.Directory}, depending on the configuration in {@link * org.apache.lucene.index.IndexWriterConfig}. A Directory is an abstraction that typically * represents a local file-system directory (see various implementations of {@link * org.apache.lucene.store.FSDirectory}), but it may also stand for some other storage, such as RAM. * * * IndexReader * * {@link org.apache.lucene.index.IndexReader} is used to read data from the index, and supports * searching. Many thread-safe readers may be {@link org.apache.lucene.index.DirectoryReader#open * open} concurrently with a single (or no) writer. Each reader maintains a consistent "point in * time" view of an index and must be explicitly refreshed (see {@link * org.apache.lucene.index.DirectoryReader#openIfChanged(DirectoryReader, IndexWriter)}) in order to * incorporate writes that may occur after it is opened. * * Segments and docids * * Lucene's index is composed of segments, each of which contains a subset of all the documents * in the index, and is a complete searchable index in itself, over that subset. As documents are * written to the index, new segments are created and flushed to directory storage. Segments are * composed of an immutable core and per-commit live documents and doc-value updates. Insertions add * new segments. Deletions and doc-value updates in a given segment create a new segment that shares * the same core as the previous segment and new live docs for this segment. Updates are implemented * as an atomic insertion and deletion. * * Over time, the writer merges groups of smaller segments into single larger ones in order to * maintain an index that is efficient to search, and to reclaim dead space left behind by deleted * (and updated) documents. * * Each document is identified by a 32-bit number, its "docid," and is composed of a collection * of Field values of diverse types (postings, stored fields, term vectors, doc values, points and * knn vectors). Docids come in two flavors: global and per-segment. A document's global docid is * just the sum of its per-segment docid and that segment's base docid offset. External, high-level * APIs only handle global docids, but internal APIs that reference a {@link * org.apache.lucene.index.LeafReader}, which is a reader for a single segment, deal in per-segment * docids. * * Docids are assigned sequentially within each segment (starting at 0). Thus the number of * documents in a segment is the same as its maximum docid; some may be deleted, but their docids * are retained until the segment is merged. When segments merge, their documents are assigned new * sequential docids. Accordingly, docid values must always be treated as internal implementation, * not exposed as part of an application, nor stored or referenced outside of Lucene's internal * APIs. * * Field Types * * * * Lucene supports a variety of different document field data structures. Lucene's core, the * inverted index, is comprised of "postings." The postings, with their term dictionary, can be * thought of as a map that provides efficient lookup given a {@link org.apache.lucene.index.Term} * (roughly, a word or token), to (the ordered list of) {@link org.apache.lucene.document.Document}s * containing that Term. Codecs may additionally record {@link * org.apache.lucene.index.ImpactsEnum#getImpacts impacts} alongside postings in order to be able to * skip over low-scoring documents at search time. Postings do not provide any way of retrieving * terms given a document, short of scanning the entire index. * * Stored fields are essentially the opposite of postings, providing efficient retrieval of field * values given a docid. All stored field values for a document are stored together in a block. * Different types of stored field provide high-level datatypes such as strings and numbers on top * of the underlying bytes. Stored field values are usually retrieved by the searcher using an * implementation of {@link org.apache.lucene.index.StoredFieldVisitor}. * * {@link org.apache.lucene.index.TermVectors} store a per-document inverted index. They are * useful for finding similar documents, called MoreLikeThis in Lucene. * * {@link org.apache.lucene.index.DocValues} fields are what are sometimes referred to as * columnar, or column-stride fields, by analogy to relational database terminology, in which * documents are considered as rows, and fields, columns. DocValues fields store values per-field: a * value for every document is held in a single data structure, providing for rapid, sequential * lookup of a field-value given a docid. These fields are used for efficient value-based sorting, * for faceting, and sometimes for filtering on the least selective clauses of a query. * * {@link org.apache.lucene.index.PointValues} represent numeric values using a kd-tree data * structure. Efficient 1- and higher dimensional implementations make these the choice for numeric * range and interval queries, and geo-spatial queries. * * {@link org.apache.lucene.index.KnnVectorValues} represent dense numeric vectors whose * dimensions may either be bytes or floats. They are indexed in a way that allows searching for * nearest neighbors. The vectors are typically produced by a machine-learned model, and used to * perform semantic search. * * * * Postings APIs * * * * Terms * * {@link org.apache.lucene.index.Terms} represents the collection of terms within a field, * exposes some metadata and statistics, and an API for enumeration. * * * Terms terms = leafReader.terms("body"); * // metadata about the field * System.out.println("positions? " + terms.hasPositions()); * System.out.println("offsets? " + terms.hasOffsets()); * System.out.println("payloads? " + terms.hasPayloads()); * // iterate through terms * TermsEnum termsEnum = terms.iterator(); * BytesRef term = null; * while ((term = termsEnum.next()) != null) { * doSomethingWith(term); * } * * * {@link org.apache.lucene.index.TermsEnum} provides an iterator over the list of terms within a * field, some statistics about the term, and methods to access the term's * documents and positions. * * * // seek to a specific term * boolean found = termsEnum.seekExact(new BytesRef("foobar")); * if (found) { * // get the document frequency * System.out.println(termsEnum.docFreq()); * // enumerate through documents * PostingsEnum docs = termsEnum.postings(null); * // enumerate through documents and positions * PostingsEnum docsAndPositions = termsEnum.postings(null, PostingsEnum.POSITIONS); * } * * * * * Documents * * {@link org.apache.lucene.index.PostingsEnum} is an extension of {@link * org.apache.lucene.search.DocIdSetIterator} that iterates over the list of documents for a term, * along with the term frequency within that document. * * * int docid; * while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { * System.out.println(docid); * System.out.println(docsEnum.freq()); * } * * * * * Positions * * PostingsEnum also allows iteration of the positions a term occurred within the document, and * any additional per-position information (offsets and payload). The information available is * controlled by flags passed to TermsEnum#postings * * * int docid; * PostingsEnum postings = termsEnum.postings(null, PostingsEnum.PAYLOADS | PostingsEnum.OFFSETS); * while ((docid = postings.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { * System.out.println(docid); * int freq = postings.freq(); * for (int i = 0; i < freq; i++) { * System.out.println(postings.nextPosition()); * System.out.println(postings.startOffset()); * System.out.println(postings.endOffset()); * System.out.println(postings.getPayload()); * } * } * * * Impacts * * TermsEnum also allows returning an {@link org.apache.lucene.index.ImpactsEnum}, an extension * of PostingsEnum that exposes pareto-optimal tuples of (term frequency, length normalization * factor) per block of postings. It is typically used to compute the maximum possible score over * these blocks of postings, so that they can be skipped if they cannot possibly produce a * competitive hit. * * * int docid; * ImpactsEnum impactsEnum = termsEnum.impacts(PostingsEnum.FREQS); * int targetDocID = 420; * impactsEnum.advanceShallow(targetDocID); * // These impacts expose pareto-optimal tuples of (termFreq, lengthNorm) over various ranges of doc IDs. * Impacts impacts = impactsEnum.getImpacts(); * for (int level = 0; level < impacts.numLevels(); i++) { * int docIdUpTo = impacts.getDocIdUpTo(level); * // List of pareto-optimal (termFreq, lengthNorm) tuples between targetDocID inclusive and docIdUpTo inclusive. * List<Impact> perLevelImpacts = impacts.getImpacts(level); * } * * * * * Index Statistics * * * * Term statistics * * * {@link org.apache.lucene.index.TermsEnum#docFreq}: Returns the number of documents that * contain at least one occurrence of the term. This statistic is always available for an * indexed term. Note that it will also count deleted documents, when segments are merged the * statistic is updated as those deleted documents are merged away. * {@link org.apache.lucene.index.TermsEnum#totalTermFreq}: Returns the number of occurrences * of this term across all documents. Like docFreq(), it will also count occurrences that * appear in deleted documents. * * * * * Field statistics * * * {@link org.apache.lucene.index.Terms#size}: Returns the number of unique terms in the * field. This statistic may be unavailable (returns -1) for some Terms * implementations such as {@link org.apache.lucene.index.MultiTerms}, where it cannot be * efficiently computed. Note that this count also includes terms that appear only in deleted * documents: when segments are merged such terms are also merged away and the statistic is * then updated. * {@link org.apache.lucene.index.Terms#getDocCount}: Returns the number of documents that * contain at least one occurrence of any term for this field. This can be thought of as a * Field-level docFreq(). Like docFreq() it will also count deleted documents. * {@link org.apache.lucene.index.Terms#getSumDocFreq}: Returns the number of postings * (term-document mappings in the inverted index) for the field. This can be thought of as the * sum of {@link org.apache.lucene.index.TermsEnum#docFreq} across all terms in the field, and * like docFreq() it will also count postings that appear in deleted documents. * {@link org.apache.lucene.index.Terms#getSumTotalTermFreq}: Returns the number of tokens for * the field. This can be thought of as the sum of {@link * org.apache.lucene.index.TermsEnum#totalTermFreq} across all terms in the field, and like * totalTermFreq() it will also count occurrences that appear in deleted documents. * * * * * Segment statistics * * * {@link org.apache.lucene.index.IndexReader#maxDoc}: Returns the number of documents * (including deleted documents) in the index. * {@link org.apache.lucene.index.IndexReader#numDocs}: Returns the number of live documents * (excluding deleted documents) in the index. * {@link org.apache.lucene.index.IndexReader#numDeletedDocs}: Returns the number of deleted * documents in the index. * * * * * Document statistics * * Document statistics are available during the indexing process for an indexed field: typically * a {@link org.apache.lucene.search.similarities.Similarity} implementation will store some of * these values (possibly in a lossy way), into the normalization value for the document in its * {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method. * * * {@link org.apache.lucene.index.FieldInvertState#getLength}: Returns the number of tokens * for this field in the document. Note that this is just the number of times that {@link * org.apache.lucene.analysis.TokenStream#incrementToken} returned true, and is unrelated to * the values in {@link * org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}. * {@link org.apache.lucene.index.FieldInvertState#getNumOverlap}: Returns the number of * tokens for this field in the document that had a position increment of zero. This can be * used to compute a document length that discounts artificial tokens such as synonyms. * {@link org.apache.lucene.index.FieldInvertState#getPosition}: Returns the accumulated * position value for this field in the document: computed from the values of {@link * org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute} and including {@link * org.apache.lucene.analysis.Analyzer#getPositionIncrementGap}s across multivalued fields. * {@link org.apache.lucene.index.FieldInvertState#getOffset}: Returns the total character * offset value for this field in the document: computed from the values of {@link * org.apache.lucene.analysis.tokenattributes.OffsetAttribute} returned by {@link * org.apache.lucene.analysis.TokenStream#end}, and including {@link * org.apache.lucene.analysis.Analyzer#getOffsetGap}s across multivalued fields. * {@link org.apache.lucene.index.FieldInvertState#getUniqueTermCount}: Returns the number of * unique terms encountered for this field in the document. * {@link org.apache.lucene.index.FieldInvertState#getMaxTermFrequency}: Returns the maximum * frequency across all unique terms encountered for this field in the document. * * *

Additional user-supplied statistics can be added to the document as DocValues fields and * accessed via {@link org.apache.lucene.index.LeafReader#getNumericDocValues}. */ package org.apache.lucene.index;