org.apache.lucene.search.grouping.package-info Maven / Gradle / Ivy
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Grouping.
*
* This module enables search result grouping with Lucene, where hits
* with the same value in the specified single-valued group field are
* grouped together. For example, if you group by the author
* field, then all documents with the same value in the author
* field fall into a single group.
*
*
* Grouping requires a number of inputs:
*
*
* groupField
: this is the field used for grouping.
* For example, if you use the author
field then each
* group has all books by the same author. Documents that don't
* have this field are grouped under a single group with
* a null
group value.
*
* groupSort
: how the groups are sorted. For sorting
* purposes, each group is "represented" by the highest-sorted
* document according to the groupSort
within it. For
* example, if you specify "price" (ascending) then the first group
* is the one with the lowest price book within it. Or if you
* specify relevance group sort, then the first group is the one
* containing the highest scoring book.
*
* topNGroups
: how many top groups to keep. For
* example, 10 means the top 10 groups are computed.
*
* groupOffset
: which "slice" of top groups you want to
* retrieve. For example, 3 means you'll get 7 groups back
* (assuming topNGroups
is 10). This is useful for
* paging, where you might show 5 groups per page.
*
* withinGroupSort
: how the documents within each group
* are sorted. This can be different from the group sort.
*
* maxDocsPerGroup
: how many top documents within each
* group to keep.
*
* withinGroupOffset
: which "slice" of top
* documents you want to retrieve from each group.
*
*
*
* The implementation is two-pass: the first pass ({@link
* org.apache.lucene.search.grouping.term.TermFirstPassGroupingCollector})
* gathers the top groups, and the second pass ({@link
* org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector})
* gathers documents within those groups. If the search is costly to
* run you may want to use the {@link
* org.apache.lucene.search.CachingCollector} class, which
* caches hits and can (quickly) replay them for the second pass. This
* way you only run the query once, but you pay a RAM cost to (briefly)
* hold all hits. Results are returned as a {@link
* org.apache.lucene.search.grouping.TopGroups} instance.
*
*
* This module abstracts away what defines group and how it is collected. All grouping collectors
* are abstract and have currently term based implementations. One can implement
* collectors that for example group on multiple fields.
*
*
* Known limitations:
*
* - For the two-pass grouping search, the group field must be a
* indexed as a {@link org.apache.lucene.document.SortedDocValuesField}).
*
- Although Solr support grouping by function and this module has abstraction of what a group is, there are currently only
* implementations for grouping based on terms.
*
- Sharding is not directly supported, though is not too
* difficult, if you can merge the top groups and top documents per
* group yourself.
*
*
* Typical usage for the generic two-pass grouping search looks like this using the grouping convenience utility
* (optionally using caching for the second pass search):
*
*
* GroupingSearch groupingSearch = new GroupingSearch("author");
* groupingSearch.setGroupSort(groupSort);
* groupingSearch.setFillSortFields(fillFields);
*
* if (useCache) {
* // Sets cache in MB
* groupingSearch.setCachingInMB(4.0, true);
* }
*
* if (requiredTotalGroupCount) {
* groupingSearch.setAllGroups(true);
* }
*
* TermQuery query = new TermQuery(new Term("content", searchTerm));
* TopGroups<BytesRef> result = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);
*
* // Render groupsResult...
* if (requiredTotalGroupCount) {
* int totalGroupCount = result.totalGroupCount;
* }
*
*
* To use the single-pass BlockGroupingCollector
,
* first, at indexing time, you must ensure all docs in each group
* are added as a block, and you have some way to find the last
* document of each group. One simple way to do this is to add a
* marker binary field:
*
*
* // Create Documents from your source:
* List<Document> oneGroup = ...;
*
* Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED);
* groupEndField.setIndexOptions(IndexOptions.DOCS_ONLY);
* groupEndField.setOmitNorms(true);
* oneGroup.get(oneGroup.size()-1).add(groupEndField);
*
* // You can also use writer.updateDocuments(); just be sure you
* // replace an entire previous doc block with this new one. For
* // example, each group could have a "groupID" field, with the same
* // value for all docs in this group:
* writer.addDocuments(oneGroup);
*
*
* Then, at search time, do this up front:
*
*
* // Set this once in your app & save away for reusing across all queries:
* Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x"))));
*
*
* Finally, do this per search:
*
*
* // Per search:
* BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs);
* s.search(new TermQuery(new Term("content", searchTerm)), c);
* TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields);
*
* // Render groupsResult...
*
*
* Or alternatively use the GroupingSearch
convenience utility:
*
*
* // Per search:
* GroupingSearch groupingSearch = new GroupingSearch(groupEndDocs);
* groupingSearch.setGroupSort(groupSort);
* groupingSearch.setIncludeScores(needsScores);
* TermQuery query = new TermQuery(new Term("content", searchTerm));
* TopGroups groupsResult = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);
*
* // Render groupsResult...
*
*
* Note that the groupValue
of each GroupDocs
* will be null
, so if you need to present this value you'll
* have to separately retrieve it (for example using stored
* fields, FieldCache
, etc.).
*
* Another collector is the TermAllGroupHeadsCollector
that can be used to retrieve all most relevant
* documents per group. Also known as group heads. This can be useful in situations when one wants to compute group
* based facets / statistics on the complete query result. The collector can be executed during the first or second
* phase. This collector can also be used with the GroupingSearch
convenience utility, but when if one only
* wants to compute the most relevant documents per group it is better to just use the collector as done here below.
*
*
* AbstractAllGroupHeadsCollector c = TermAllGroupHeadsCollector.create(groupField, sortWithinGroup);
* s.search(new TermQuery(new Term("content", searchTerm)), c);
* // Return all group heads as int array
* int[] groupHeadsArray = c.retrieveGroupHeads()
* // Return all group heads as FixedBitSet.
* int maxDoc = s.maxDoc();
* FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc)
*
*
* For each of the above collector types there is also a variant that works with ValueSource
instead of
* of fields. Concretely this means that these variants can work with functions. These variants are slower than
* there term based counter parts. These implementations are located in the
* org.apache.lucene.search.grouping.function
package, but can also be used with the
* GroupingSearch
convenience utility
*
*/
package org.apache.lucene.search.grouping;