org.apache.lucene.search.vectorhighlight.package-info Maven / Gradle / Ivy
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Another highlighter implementation based on term vectors.
*
* Features
*
* - fast for large docs
* - support N-gram fields
* - support phrase-unit highlighting with slops
* - support multi-term (includes wildcard, range, regexp, etc) queries
* - highlight fields need to be stored with Positions and Offsets
* - take into account query boost and/or IDF-weight to score fragments
* - support colored highlight tags
* - pluggable FragListBuilder / FieldFragList
* - pluggable FragmentsBuilder
*
*
* Algorithm
* To explain the algorithm, let's use the following sample text
* (to be highlighted) and user query:
*
*
*
* Sample Text
* Lucene is a search engine library.
*
*
* User Query
* Lucene^2 OR "search library"~1
*
*
*
* The user query is a BooleanQuery that consists of TermQuery("Lucene")
* with boost of 2 and PhraseQuery("search library") with slop of 1.
* For your convenience, here is the offsets and positions info of the
* sample text.
*
*
* +--------+-----------------------------------+
* | | 1111111111222222222233333|
* | offset|01234567890123456789012345678901234|
* +--------+-----------------------------------+
* |document|Lucene is a search engine library. |
* +--------*-----------------------------------+
* |position|0 1 2 3 4 5 |
* +--------*-----------------------------------+
*
*
* Step 1.
* In Step 1, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldQuery.QueryPhraseMap} from the user query.
* QueryPhraseMap
consists of the following members:
*
* public class QueryPhraseMap {
* boolean terminal;
* int slop; // valid if terminal == true and phraseHighlight == true
* float boost; // valid if terminal == true
* Map<String, QueryPhraseMap> subMap;
* }
*
* QueryPhraseMap
has subMap. The key of the subMap is a term
* text in the user query and the value is a subsequent QueryPhraseMap
.
* If the query is a term (not phrase), then the subsequent QueryPhraseMap
* is marked as terminal. If the query is a phrase, then the subsequent QueryPhraseMap
* is not a terminal and it has the next term text in the phrase.
*
* From the sample user query, the following QueryPhraseMap
* will be generated:
*
* QueryPhraseMap
* +--------+-+ +-------+-+
* |"Lucene"|o+->|boost=2|*| * : terminal
* +--------+-+ +-------+-+
*
* +--------+-+ +---------+-+ +-------+------+-+
* |"search"|o+->|"library"|o+->|boost=1|slop=1|*|
* +--------+-+ +---------+-+ +-------+------+-+
*
*
* Step 2.
* In Step 2, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldTermStack}. Fast Vector Highlighter uses term vector data
* (must be stored {@link org.apache.lucene.document.FieldType#setStoreTermVectorOffsets(boolean)} and {@link org.apache.lucene.document.FieldType#setStoreTermVectorPositions(boolean)})
* to generate it. FieldTermStack
keeps the terms in the user query.
* Therefore, in this sample case, Fast Vector Highlighter generates the following FieldTermStack
:
*
* FieldTermStack
* +------------------+
* |"Lucene"(0,6,0) |
* +------------------+
* |"search"(12,18,3) |
* +------------------+
* |"library"(26,33,5)|
* +------------------+
* where : "termText"(startOffset,endOffset,position)
*
* Step 3.
* In Step 3, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldPhraseList}
* by reference to QueryPhraseMap
and FieldTermStack
.
*
* FieldPhraseList
* +----------------+-----------------+---+
* |"Lucene" |[(0,6)] |w=2|
* +----------------+-----------------+---+
* |"search library"|[(12,18),(26,33)]|w=1|
* +----------------+-----------------+---+
*
* The type of each entry is WeightedPhraseInfo
that consists of
* an array of terms offsets and weight.
*
* Step 4.
* In Step 4, Fast Vector Highlighter creates FieldFragList
by reference to
* FieldPhraseList
. In this sample case, the following
* FieldFragList
will be generated:
*
* FieldFragList
* +---------------------------------+
* |"Lucene"[(0,6)] |
* |"search library"[(12,18),(26,33)]|
* |totalBoost=3 |
* +---------------------------------+
*
*
*
* The calculation for each FieldFragList.WeightedFragInfo.totalBoost
(weight)
* depends on the implementation of FieldFragList.add( ... )
:
*
* public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) {
* float totalBoost = 0;
* List<SubInfo> subInfos = new ArrayList<SubInfo>();
* for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
* subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) );
* totalBoost += phraseInfo.getBoost();
* }
* getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) );
* }
*
*
* The used implementation of FieldFragList
is noted in BaseFragListBuilder.createFieldFragList( ... )
:
*
* public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ){
* return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( fragCharSize ), fragCharSize );
* }
*
*
* Currently there are basically to approaches available:
*
*
* SimpleFragListBuilder using SimpleFieldFragList
: sum-of-boosts-approach. The totalBoost is calculated by summarizing the query-boosts per term. Per default a term is boosted by 1.0
* WeightedFragListBuilder using WeightedFieldFragList
: sum-of-distinct-weights-approach. The totalBoost is calculated by summarizing the IDF-weights of distinct terms.
*
* Comparison of the two approaches:
*
*
* query = das alte testament (The Old Testament)
*
* Terms in fragment sum-of-distinct-weights sum-of-boosts
* das alte testament 5.339621 3.0
* das alte testament 5.339621 3.0
* das testament alte 5.339621 3.0
* das alte testament 5.339621 3.0
* das testament 2.9455688 2.0
* das alte 2.4759595 2.0
* das das das das 1.5015357 4.0
* das das das 1.3003681 3.0
* das das 1.061746 2.0
* alte 1.0 1.0
* alte 1.0 1.0
* das 0.7507678 1.0
* das 0.7507678 1.0
* das 0.7507678 1.0
* das 0.7507678 1.0
* das 0.7507678 1.0
*
*
* Step 5.
* In Step 5, by using FieldFragList
and the field stored data,
* Fast Vector Highlighter creates highlighted snippets!
*/
package org.apache.lucene.search.vectorhighlight;