org.apache.lucene.search.similarities.package-info Maven / Gradle / Ivy
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* This package contains the various ranking models that can be used in Lucene. The
* abstract class {@link org.apache.lucene.search.similarities.Similarity} serves
* as the base for ranking functions. For searching, users can employ the models
* already implemented or create their own by extending one of the classes in this
* package.
*
* Table Of Contents
*
*
*
*
* Summary of the Ranking Methods
*
* {@link org.apache.lucene.search.similarities.DefaultSimilarity} is the original Lucene
* scoring function. It is based on a highly optimized
* Vector Space Model. For more
* information, see {@link org.apache.lucene.search.similarities.TFIDFSimilarity}.
*
*
{@link org.apache.lucene.search.similarities.BM25Similarity} is an optimized
* implementation of the successful Okapi BM25 model.
*
*
{@link org.apache.lucene.search.similarities.SimilarityBase} provides a basic
* implementation of the Similarity contract and exposes a highly simplified
* interface, which makes it an ideal starting point for new ranking functions.
* Lucene ships the following methods built on
* {@link org.apache.lucene.search.similarities.SimilarityBase}:
*
*
*
* - Amati and Rijsbergen's {@linkplain org.apache.lucene.search.similarities.DFRSimilarity DFR} framework;
* - Clinchant and Gaussier's {@linkplain org.apache.lucene.search.similarities.IBSimilarity Information-based models}
* for IR;
* - The implementation of two {@linkplain org.apache.lucene.search.similarities.LMSimilarity language models} from
* Zhai and Lafferty's paper.
* - {@linkplain org.apache.lucene.search.similarities.DFISimilarity Divergence from independence} models as described
* in "IRRA at TREC 2012" (Dinçer).
*
-
*
*
* Since {@link org.apache.lucene.search.similarities.SimilarityBase} is not
* optimized to the same extent as
* {@link org.apache.lucene.search.similarities.DefaultSimilarity} and
* {@link org.apache.lucene.search.similarities.BM25Similarity}, a difference in
* performance is to be expected when using the methods listed above. However,
* optimizations can always be implemented in subclasses; see
* below.
*
*
* Changing Similarity
*
* Chances are the available Similarities are sufficient for all
* your searching needs.
* However, in some applications it may be necessary to customize your Similarity implementation. For instance, some
* applications do not need to
* distinguish between shorter and longer documents (see a "fair" similarity).
*
*
To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both indexing and
* searching, and the changes must happen before
* either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it
* just isn't well-defined what is going to happen.
*
*
To make this change, implement your own {@link org.apache.lucene.search.similarities.Similarity} (likely
* you'll want to simply subclass an existing method, be it
* {@link org.apache.lucene.search.similarities.DefaultSimilarity} or a descendant of
* {@link org.apache.lucene.search.similarities.SimilarityBase}), and
* then register the new class by calling
* {@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity)}
* before indexing and
* {@link org.apache.lucene.search.IndexSearcher#setSimilarity(Similarity)}
* before searching.
*
*
Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}
*
* The easiest way to quickly implement a new ranking method is to extend
* {@link org.apache.lucene.search.similarities.SimilarityBase}, which provides
* basic implementations for the low level . Subclasses are only required to
* implement the {@link org.apache.lucene.search.similarities.SimilarityBase#score(BasicStats, float, float)}
* and {@link org.apache.lucene.search.similarities.SimilarityBase#toString()}
* methods.
*
*
Another option is to extend one of the frameworks
* based on {@link org.apache.lucene.search.similarities.SimilarityBase}. These
* Similarities are implemented modularly, e.g.
* {@link org.apache.lucene.search.similarities.DFRSimilarity} delegates
* computation of the three parts of its formula to the classes
* {@link org.apache.lucene.search.similarities.BasicModel},
* {@link org.apache.lucene.search.similarities.AfterEffect} and
* {@link org.apache.lucene.search.similarities.Normalization}. Instead of
* subclassing the Similarity, one can simply introduce a new basic model and tell
* {@link org.apache.lucene.search.similarities.DFRSimilarity} to use it.
*
*
Changing {@linkplain org.apache.lucene.search.similarities.DefaultSimilarity}
*
* If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity.
* In summary, here are a few use cases:
*
* The SweetSpotSimilarity
in
* org.apache.lucene.misc
gives small
* increases as the frequency increases a small amount
* and then greater increases when you hit the "sweet spot", i.e. where
* you think the frequency of terms is more significant.
* Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a
* matching term occurs. In these
* cases people have overridden Similarity to return 1 from the tf() method.
* Changing Length Normalization — By overriding
* {@link org.apache.lucene.search.similarities.Similarity#computeNorm(org.apache.lucene.index.FieldInvertState state)},
* it is possible to discount how the length of a field contributes
* to a score. In {@link org.apache.lucene.search.similarities.DefaultSimilarity},
* lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
* 1 / (numTerms in field), all fields will be treated
* "fairly".
*
* In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list):
* [One would override the Similarity in] ... any situation where you know more about your data then just
* that
* it's "text" is a situation where it *might* make sense to to override your
* Similarity method.
*/
package org.apache.lucene.search.similarities;