All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.lucene.search.similarities.package-info Maven / Gradle / Ivy

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/**
 * This package contains the various ranking models that can be used in Lucene. The
 * abstract class {@link org.apache.lucene.search.similarities.Similarity} serves
 * as the base for ranking functions. For searching, users can employ the models
 * already implemented or create their own by extending one of the classes in this
 * package.
 * 
 * 

Table Of Contents

*
    *
  1. Summary of the Ranking Methods
  2. *
  3. Changing the Similarity
  4. *
* * * *

Summary of the Ranking Methods

* *

{@link org.apache.lucene.search.similarities.BM25Similarity} is an optimized * implementation of the successful Okapi BM25 model. * *

{@link org.apache.lucene.search.similarities.ClassicSimilarity} is the original Lucene * scoring function. It is based on the * Vector Space Model. For more * information, see {@link org.apache.lucene.search.similarities.TFIDFSimilarity}. * *

{@link org.apache.lucene.search.similarities.SimilarityBase} provides a basic * implementation of the Similarity contract and exposes a highly simplified * interface, which makes it an ideal starting point for new ranking functions. * Lucene ships the following methods built on * {@link org.apache.lucene.search.similarities.SimilarityBase}: * * *

    *
  • Amati and Rijsbergen's {@linkplain org.apache.lucene.search.similarities.DFRSimilarity DFR} framework;
  • *
  • Clinchant and Gaussier's {@linkplain org.apache.lucene.search.similarities.IBSimilarity Information-based models} * for IR;
  • *
  • The implementation of two {@linkplain org.apache.lucene.search.similarities.LMSimilarity language models} from * Zhai and Lafferty's paper.
  • *
  • {@linkplain org.apache.lucene.search.similarities.DFISimilarity Divergence from independence} models as described * in "IRRA at TREC 2012" (Dinçer). *
  • *
* * Since {@link org.apache.lucene.search.similarities.SimilarityBase} is not * optimized to the same extent as * {@link org.apache.lucene.search.similarities.ClassicSimilarity} and * {@link org.apache.lucene.search.similarities.BM25Similarity}, a difference in * performance is to be expected when using the methods listed above. However, * optimizations can always be implemented in subclasses; see * below. * * *

Changing Similarity

* *

Chances are the available Similarities are sufficient for all * your searching needs. * However, in some applications it may be necessary to customize your Similarity implementation. For instance, some * applications do not need to distinguish between shorter and longer documents * and could set BM25's {@link org.apache.lucene.search.similarities.BM25Similarity#BM25Similarity(float,float) b} * parameter to {@code 0}. * *

To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both indexing and * searching, and the changes must happen before * either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it * just isn't well-defined what is going to happen. * *

To make this change, implement your own {@link org.apache.lucene.search.similarities.Similarity} (likely * you'll want to simply subclass {@link org.apache.lucene.search.similarities.SimilarityBase}), and * then register the new class by calling * {@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity)} * before indexing and * {@link org.apache.lucene.search.IndexSearcher#setSimilarity(Similarity)} * before searching. * *

Tuning {@linkplain org.apache.lucene.search.similarities.BM25Similarity}

*

{@link org.apache.lucene.search.similarities.BM25Similarity} has * two parameters that may be tuned: *

    *
  • k1, which calibrates term frequency saturation and must be * positive or null. A value of {@code 0} makes term frequency completely * ignored, making documents scored only based on the value of the IDF * of the matched terms. Higher values of k1 increase the impact of * term frequency on the final score. Default value is {@code 1.2}.
  • *
  • b, which controls how much document length should normalize * term frequency values and must be in {@code [0, 1]}. A value of {@code 0} * disables length normalization completely. Default value is {@code 0.75}.
  • *
* *

Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}

*

* The easiest way to quickly implement a new ranking method is to extend * {@link org.apache.lucene.search.similarities.SimilarityBase}, which provides * basic implementations for the low level . Subclasses are only required to * implement the {@link org.apache.lucene.search.similarities.SimilarityBase#score(BasicStats, double, double)} * and {@link org.apache.lucene.search.similarities.SimilarityBase#toString()} * methods. * *

Another option is to extend one of the frameworks * based on {@link org.apache.lucene.search.similarities.SimilarityBase}. These * Similarities are implemented modularly, e.g. * {@link org.apache.lucene.search.similarities.DFRSimilarity} delegates * computation of the three parts of its formula to the classes * {@link org.apache.lucene.search.similarities.BasicModel}, * {@link org.apache.lucene.search.similarities.AfterEffect} and * {@link org.apache.lucene.search.similarities.Normalization}. Instead of * subclassing the Similarity, one can simply introduce a new basic model and tell * {@link org.apache.lucene.search.similarities.DFRSimilarity} to use it. * */ package org.apache.lucene.search.similarities;





© 2015 - 2025 Weber Informatics LLC | Privacy Policy