All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.lucene.index.package.html Maven / Gradle / Ivy

There is a newer version: 2024.11.18751.20241128T090041Z-241100
Show newest version




   


Code to maintain and access indices.

Table Of Contents

  1. Postings APIs
  2. Index Statistics

Postings APIs

Fields

{@link org.apache.lucene.index.Fields} is the initial entry point into the postings APIs, this can be obtained in several ways:

// access indexed fields for an index segment
Fields fields = reader.fields();
// access term vector fields for a specified document
Fields fields = reader.getTermVectors(docid);
Fields implements Java's Iterable interface, so its easy to enumerate the list of fields:
// enumerate list of fields
for (String field : fields) {
  // access the terms for this field
  Terms terms = fields.terms(field);
}

Terms

{@link org.apache.lucene.index.Terms} represents the collection of terms within a field, exposes some metadata and statistics, and an API for enumeration.

// metadata about the field
System.out.println("positions? " + terms.hasPositions());
System.out.println("offsets? " + terms.hasOffsets());
System.out.println("payloads? " + terms.hasPayloads());
// iterate through terms
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
while ((term = termsEnum.next()) != null) {
  doSomethingWith(termsEnum.term());
}
{@link org.apache.lucene.index.TermsEnum} provides an iterator over the list of terms within a field, some statistics about the term, and methods to access the term's documents and positions.
// seek to a specific term
boolean found = termsEnum.seekExact(new BytesRef("foobar"));
if (found) {
  // get the document frequency
  System.out.println(termsEnum.docFreq());
  // enumerate through documents
  DocsEnum docs = termsEnum.docs(null, null);
  // enumerate through documents and positions
  DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null);
}

Documents

{@link org.apache.lucene.index.DocsEnum} is an extension of {@link org.apache.lucene.search.DocIdSetIterator}that iterates over the list of documents for a term, along with the term frequency within that document.

int docid;
while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  System.out.println(docsEnum.freq());
}

Positions

{@link org.apache.lucene.index.DocsAndPositionsEnum} is an extension of {@link org.apache.lucene.index.DocsEnum} that additionally allows iteration of the positions a term occurred within the document, and any additional per-position information (offsets and payload)

int docid;
while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  int freq = docsAndPositionsEnum.freq();
  for (int i = 0; i < freq; i++) {
     System.out.println(docsAndPositionsEnum.nextPosition());
     System.out.println(docsAndPositionsEnum.startOffset());
     System.out.println(docsAndPositionsEnum.endOffset());
     System.out.println(docsAndPositionsEnum.getPayload());
  }
}

Index Statistics

Term statistics

  • {@link org.apache.lucene.index.TermsEnum#docFreq}: Returns the number of documents that contain at least one occurrence of the term. This statistic is always available for an indexed term. Note that it will also count deleted documents, when segments are merged the statistic is updated as those deleted documents are merged away.
  • {@link org.apache.lucene.index.TermsEnum#totalTermFreq}: Returns the number of occurrences of this term across all documents. Note that this statistic is unavailable (returns -1) if term frequencies were omitted from the index ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) for the field. Like docFreq(), it will also count occurrences that appear in deleted documents.

Field statistics

  • {@link org.apache.lucene.index.Terms#size}: Returns the number of unique terms in the field. This statistic may be unavailable (returns -1) for some Terms implementations such as {@link org.apache.lucene.index.MultiTerms}, where it cannot be efficiently computed. Note that this count also includes terms that appear only in deleted documents: when segments are merged such terms are also merged away and the statistic is then updated.
  • {@link org.apache.lucene.index.Terms#getDocCount}: Returns the number of documents that contain at least one occurrence of any term for this field. This can be thought of as a Field-level docFreq(). Like docFreq() it will also count deleted documents.
  • {@link org.apache.lucene.index.Terms#getSumDocFreq}: Returns the number of postings (term-document mappings in the inverted index) for the field. This can be thought of as the sum of {@link org.apache.lucene.index.TermsEnum#docFreq} across all terms in the field, and like docFreq() it will also count postings that appear in deleted documents.
  • {@link org.apache.lucene.index.Terms#getSumTotalTermFreq}: Returns the number of tokens for the field. This can be thought of as the sum of {@link org.apache.lucene.index.TermsEnum#totalTermFreq} across all terms in the field, and like totalTermFreq() it will also count occurrences that appear in deleted documents, and will be unavailable (returns -1) if term frequencies were omitted from the index ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) for the field.

Segment statistics

  • {@link org.apache.lucene.index.IndexReader#maxDoc}: Returns the number of documents (including deleted documents) in the index.
  • {@link org.apache.lucene.index.IndexReader#numDocs}: Returns the number of live documents (excluding deleted documents) in the index.
  • {@link org.apache.lucene.index.IndexReader#numDeletedDocs}: Returns the number of deleted documents in the index.
  • {@link org.apache.lucene.index.Fields#size}: Returns the number of indexed fields.
  • {@link org.apache.lucene.index.Fields#getUniqueTermCount}: Returns the number of indexed terms, the sum of {@link org.apache.lucene.index.Terms#size} across all fields.

Document statistics

Document statistics are available during the indexing process for an indexed field: typically a {@link org.apache.lucene.search.similarities.Similarity} implementation will store some of these values (possibly in a lossy way), into the normalization value for the document in its {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method.

  • {@link org.apache.lucene.index.FieldInvertState#getLength}: Returns the number of tokens for this field in the document. Note that this is just the number of times that {@link org.apache.lucene.analysis.TokenStream#incrementToken} returned true, and is unrelated to the values in {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}.
  • {@link org.apache.lucene.index.FieldInvertState#getNumOverlap}: Returns the number of tokens for this field in the document that had a position increment of zero. This can be used to compute a document length that discounts artificial tokens such as synonyms.
  • {@link org.apache.lucene.index.FieldInvertState#getPosition}: Returns the accumulated position value for this field in the document: computed from the values of {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute} and including {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap}s across multivalued fields.
  • {@link org.apache.lucene.index.FieldInvertState#getOffset}: Returns the total character offset value for this field in the document: computed from the values of {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute} returned by {@link org.apache.lucene.analysis.TokenStream#end}, and including {@link org.apache.lucene.analysis.Analyzer#getOffsetGap}s across multivalued fields.
  • {@link org.apache.lucene.index.FieldInvertState#getUniqueTermCount}: Returns the number of unique terms encountered for this field in the document.
  • {@link org.apache.lucene.index.FieldInvertState#getMaxTermFrequency}: Returns the maximum frequency across all unique terms encountered for this field in the document.

Additional user-supplied statistics can be added to the document as DocValues fields and accessed via {@link org.apache.lucene.index.AtomicReader#getNumericDocValues}.





© 2015 - 2024 Weber Informatics LLC | Privacy Policy