org.apache.lucene.index.package.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of aem-sdk-api Show documentation
The Adobe Experience Manager SDK
There is a newer version: 2024.11.18751.20241128T090041Z-241100




   


Code to maintain and access indices.

Table Of Contents

    

        Postings APIs
            
                Fields
                Terms
                Documents
                Positions
            
        
        Index Statistics
            
                Term-level
                Field-level
                Segment-level
                Document-level
            
        
    


Postings APIs


    Fields


{@link org.apache.lucene.index.Fields} is the initial entry point into the 
postings APIs, this can be obtained in several ways:
// access indexed fields for an index segment
Fields fields = reader.fields();
// access term vector fields for a specified document
Fields fields = reader.getTermVectors(docid);

Fields implements Java's Iterable interface, so its easy to enumerate the
list of fields:
// enumerate list of fields
for (String field : fields) {
  // access the terms for this field
  Terms terms = fields.terms(field);
}




    Terms


{@link org.apache.lucene.index.Terms} represents the collection of terms
within a field, exposes some metadata and statistics,
and an API for enumeration.
// metadata about the field
System.out.println("positions? " + terms.hasPositions());
System.out.println("offsets? " + terms.hasOffsets());
System.out.println("payloads? " + terms.hasPayloads());
// iterate through terms
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
while ((term = termsEnum.next()) != null) {
  doSomethingWith(termsEnum.term());
}

{@link org.apache.lucene.index.TermsEnum} provides an iterator over the list
of terms within a field, some statistics about the term,
and methods to access the term's documents and
positions.
// seek to a specific term
boolean found = termsEnum.seekExact(new BytesRef("foobar"));
if (found) {
  // get the document frequency
  System.out.println(termsEnum.docFreq());
  // enumerate through documents
  DocsEnum docs = termsEnum.docs(null, null);
  // enumerate through documents and positions
  DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null);
}




    Documents


{@link org.apache.lucene.index.DocsEnum} is an extension of 
{@link org.apache.lucene.search.DocIdSetIterator}that iterates over the list of
documents for a term, along with the term frequency within that document.
int docid;
while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  System.out.println(docsEnum.freq());
}




    Positions


{@link org.apache.lucene.index.DocsAndPositionsEnum} is an extension of 
{@link org.apache.lucene.index.DocsEnum} that additionally allows iteration
of the positions a term occurred within the document, and any additional
per-position information (offsets and payload)
int docid;
while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  int freq = docsAndPositionsEnum.freq();
  for (int i = 0; i < freq; i++) {
     System.out.println(docsAndPositionsEnum.nextPosition());
     System.out.println(docsAndPositionsEnum.startOffset());
     System.out.println(docsAndPositionsEnum.endOffset());
     System.out.println(docsAndPositionsEnum.getPayload());
  }
}



Index Statistics


    Term statistics


    

       {@link org.apache.lucene.index.TermsEnum#docFreq}: Returns the number of 
           documents that contain at least one occurrence of the term. This statistic 
           is always available for an indexed term. Note that it will also count 
           deleted documents, when segments are merged the statistic is updated as 
           those deleted documents are merged away.
       
{@link org.apache.lucene.index.TermsEnum#totalTermFreq}: Returns the number 
           of occurrences of this term across all documents. Note that this statistic 
           is unavailable (returns -1) if term frequencies were omitted 
           from the index 
           ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) 
           for the field. Like docFreq(), it will also count occurrences that appear in 
           deleted documents.
    



    Field statistics


    

       {@link org.apache.lucene.index.Terms#size}: Returns the number of 
           unique terms in the field. This statistic may be unavailable 
           (returns -1) for some Terms implementations such as
           {@link org.apache.lucene.index.MultiTerms}, where it cannot be efficiently
           computed.  Note that this count also includes terms that appear only
           in deleted documents: when segments are merged such terms are also merged
           away and the statistic is then updated.
       
{@link org.apache.lucene.index.Terms#getDocCount}: Returns the number of
           documents that contain at least one occurrence of any term for this field.
           This can be thought of as a Field-level docFreq(). Like docFreq() it will
           also count deleted documents.
       
{@link org.apache.lucene.index.Terms#getSumDocFreq}: Returns the number of
           postings (term-document mappings in the inverted index) for the field. This
           can be thought of as the sum of {@link org.apache.lucene.index.TermsEnum#docFreq}
           across all terms in the field, and like docFreq() it will also count postings
           that appear in deleted documents.
       
{@link org.apache.lucene.index.Terms#getSumTotalTermFreq}: Returns the number
           of tokens for the field. This can be thought of as the sum of 
           {@link org.apache.lucene.index.TermsEnum#totalTermFreq} across all terms in the
           field, and like totalTermFreq() it will also count occurrences that appear in
           deleted documents, and will be unavailable (returns -1) if term 
           frequencies were omitted from the index 
           ({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY}) 
           for the field.
    



    Segment statistics


    

       {@link org.apache.lucene.index.IndexReader#maxDoc}: Returns the number of 
           documents (including deleted documents) in the index. 
       
{@link org.apache.lucene.index.IndexReader#numDocs}: Returns the number 
           of live documents (excluding deleted documents) in the index.
       
{@link org.apache.lucene.index.IndexReader#numDeletedDocs}: Returns the
           number of deleted documents in the index.
       
{@link org.apache.lucene.index.Fields#size}: Returns the number of indexed
           fields.
       
{@link org.apache.lucene.index.Fields#getUniqueTermCount}: Returns the number 
           of indexed terms, the sum of {@link org.apache.lucene.index.Terms#size}
           across all fields.
    



    Document statistics


Document statistics are available during the indexing process for an indexed field: typically
a {@link org.apache.lucene.search.similarities.Similarity} implementation will store some
of these values (possibly in a lossy way), into the normalization value for the document in
its {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method.


    

       {@link org.apache.lucene.index.FieldInvertState#getLength}: Returns the number of 
           tokens for this field in the document. Note that this is just the number
           of times that {@link org.apache.lucene.analysis.TokenStream#incrementToken} returned
           true, and is unrelated to the values in 
           {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}.
       
{@link org.apache.lucene.index.FieldInvertState#getNumOverlap}: Returns the number
           of tokens for this field in the document that had a position increment of zero. This
           can be used to compute a document length that discounts artificial tokens
           such as synonyms.
       
{@link org.apache.lucene.index.FieldInvertState#getPosition}: Returns the accumulated
           position value for this field in the document: computed from the values of
           {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute} and including
           {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap}s across multivalued
           fields.
       
{@link org.apache.lucene.index.FieldInvertState#getOffset}: Returns the total
           character offset value for this field in the document: computed from the values of
           {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute} returned by 
           {@link org.apache.lucene.analysis.TokenStream#end}, and including
           {@link org.apache.lucene.analysis.Analyzer#getOffsetGap}s across multivalued
           fields.
       
{@link org.apache.lucene.index.FieldInvertState#getUniqueTermCount}: Returns the number
           of unique terms encountered for this field in the document.
       
{@link org.apache.lucene.index.FieldInvertState#getMaxTermFrequency}: Returns the maximum
           frequency across all unique terms encountered for this field in the document. 
    


Additional user-supplied statistics can be added to the document as DocValues fields and
accessed via {@link org.apache.lucene.index.AtomicReader#getNumericDocValues}.