org.apache.lucene.search.similarities.package.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of aem-sdk-api Show documentation
The Adobe Experience Manager SDK
There is a newer version: 2025.3.19823.20250304T101418Z-250200




   


This package contains the various ranking models that can be used in Lucene. The
abstract class {@link org.apache.lucene.search.similarities.Similarity} serves
as the base for ranking functions. For searching, users can employ the models
already implemented or create their own by extending one of the classes in this
package.

Table Of Contents

    

        Summary of the Ranking Methods
        Changing the Similarity
    




Summary of the Ranking Methods

{@link org.apache.lucene.search.similarities.DefaultSimilarity} is the original Lucene
scoring function. It is based on a highly optimized 
Vector Space Model. For more
information, see {@link org.apache.lucene.search.similarities.TFIDFSimilarity}.

{@link org.apache.lucene.search.similarities.BM25Similarity} is an optimized
implementation of the successful Okapi BM25 model.

{@link org.apache.lucene.search.similarities.SimilarityBase} provides a basic
implementation of the Similarity contract and exposes a highly simplified
interface, which makes it an ideal starting point for new ranking functions.
Lucene ships the following methods built on
{@link org.apache.lucene.search.similarities.SimilarityBase}:



  Amati and Rijsbergen's {@linkplain org.apache.lucene.search.similarities.DFRSimilarity DFR} framework;
  Clinchant and Gaussier's {@linkplain org.apache.lucene.search.similarities.IBSimilarity Information-based models}
    for IR;
  The implementation of two {@linkplain org.apache.lucene.search.similarities.LMSimilarity language models} from
  Zhai and Lafferty's paper.


Since {@link org.apache.lucene.search.similarities.SimilarityBase} is not
optimized to the same extent as
{@link org.apache.lucene.search.similarities.DefaultSimilarity} and
{@link org.apache.lucene.search.similarities.BM25Similarity}, a difference in
performance is to be expected when using the methods listed above. However,
optimizations can always be implemented in subclasses; see
below.


Changing Similarity

Chances are the available Similarities are sufficient for all
    your searching needs.
    However, in some applications it may be necessary to customize your Similarity implementation. For instance, some
    applications do not need to
    distinguish between shorter and longer documents (see a "fair" similarity).

To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both indexing and
    searching, and the changes must happen before
    either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it
    just isn't well-defined what is going to happen.


To make this change, implement your own {@link org.apache.lucene.search.similarities.Similarity} (likely
    you'll want to simply subclass an existing method, be it
    {@link org.apache.lucene.search.similarities.DefaultSimilarity} or a descendant of
    {@link org.apache.lucene.search.similarities.SimilarityBase}), and
    then register the new class by calling
    {@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity)}
    before indexing and
    {@link org.apache.lucene.search.IndexSearcher#setSimilarity(Similarity)}
    before searching.


Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}

The easiest way to quickly implement a new ranking method is to extend
{@link org.apache.lucene.search.similarities.SimilarityBase}, which provides
basic implementations for the low level . Subclasses are only required to
implement the {@link org.apache.lucene.search.similarities.SimilarityBase#score(BasicStats, float, float)}
and {@link org.apache.lucene.search.similarities.SimilarityBase#toString()}
methods.

Another option is to extend one of the frameworks
based on {@link org.apache.lucene.search.similarities.SimilarityBase}. These
Similarities are implemented modularly, e.g.
{@link org.apache.lucene.search.similarities.DFRSimilarity} delegates
computation of the three parts of its formula to the classes
{@link org.apache.lucene.search.similarities.BasicModel},
{@link org.apache.lucene.search.similarities.AfterEffect} and
{@link org.apache.lucene.search.similarities.Normalization}. Instead of
subclassing the Similarity, one can simply introduce a new basic model and tell
{@link org.apache.lucene.search.similarities.DFRSimilarity} to use it.

Changing {@linkplain org.apache.lucene.search.similarities.DefaultSimilarity}

    If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity.
    In summary, here are a few use cases:
    

        The SweetSpotSimilarity in
            org.apache.lucene.misc gives small
            increases as the frequency increases a small amount
            and then greater increases when you hit the "sweet spot", i.e. where
            you think the frequency of terms is more significant.
        Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a
            matching term occurs. In these
            cases people have overridden Similarity to return 1 from the tf() method.
        Changing Length Normalization — By overriding
            {@link org.apache.lucene.search.similarities.Similarity#computeNorm(FieldInvertState state)},
            it is possible to discount how the length of a field contributes
            to a score. In {@link org.apache.lucene.search.similarities.DefaultSimilarity},
            lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
            1 / (numTerms in field), all fields will be treated
            "fairly".
    
    In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list):
    [One would override the Similarity in] ... any situation where you know more about your data then just
        that
        it's "text" is a situation where it *might* make sense to to override your
        Similarity method.