All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.solr.legacy.LegacyNumericRangeQuery Maven / Gradle / Ivy

There is a newer version: 9.7.0
Show newest version
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.solr.legacy;

import java.io.IOException;
import java.util.ArrayDeque;
import java.util.Objects;
import org.apache.lucene.document.DoublePoint;
import org.apache.lucene.document.FloatPoint;
import org.apache.lucene.document.IntPoint;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.index.FilteredTermsEnum;
import org.apache.lucene.index.PointValues;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryVisitor;
import org.apache.lucene.search.TermRangeQuery;
import org.apache.lucene.util.AttributeSource;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.NumericUtils;

/**
 * A {@link Query} that matches numeric values within a specified range. To use this, you must first
 * index the numeric values using {@link org.apache.solr.legacy.LegacyIntField}, {@link
 * org.apache.solr.legacy.LegacyFloatField}, {@link org.apache.solr.legacy.LegacyLongField} or
 * {@link org.apache.solr.legacy.LegacyDoubleField} (expert: {@link
 * org.apache.solr.legacy.LegacyNumericTokenStream}). If your terms are instead textual, you should
 * use {@link TermRangeQuery}.
 *
 * 

You create a new LegacyNumericRangeQuery with the static factory methods, eg: * *

 * Query q = LegacyNumericRangeQuery.newFloatRange("weight", 0.03f, 0.10f, true, true);
 * 
* * matches all documents whose float valued "weight" field ranges from 0.03 to 0.10, inclusive. * *

The performance of LegacyNumericRangeQuery is much better than the corresponding {@link * TermRangeQuery} because the number of terms that must be searched is usually far fewer, thanks to * trie indexing, described below. * *

You can optionally specify a precisionStep when * creating this query. This is necessary if you've changed this configuration from its default (4) * during indexing. Lower values consume more disk space but speed up searching. Suitable values are * between 1 and 8. A good starting point to test is 4, which is the default * value for all Numeric* classes. See below for * details. * *

This query defaults to {@linkplain MultiTermQuery#CONSTANT_SCORE_REWRITE}. With precision * steps of ≤4, this query can be run with one of the BooleanQuery rewrite methods without * changing BooleanQuery's default max clause count.
* *

How it works

* *

See the publication about panFMP, where * this algorithm was described (referred to as TrieRangeQuery): * *

* * Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for Metadata * Portals. Computers & Geosciences 34 (12), 1947-1955. doi:10.1016/j.cageo.2008.02.023 * *
* *

A quote from this paper: Because Apache Lucene is a full-text search engine and not a * conventional database, it cannot handle numerical ranges (e.g., field value is inside user * defined bounds, even dates are numerical values). We have developed an extension to Apache Lucene * that stores the numerical values in a special string-encoded format with variable precision (all * numerical values like doubles, longs, floats, and ints are converted to lexicographic sortable * string representations and stored with different precisions (for a more detailed description of * how the values are stored, see {@link org.apache.solr.legacy.LegacyNumericUtils}). A range is * then divided recursively into multiple intervals for searching: The center of the range is * searched only with the lowest possible precision in the trie, while the boundaries are * matched more exactly. This reduces the number of terms dramatically. * *

For the variant that stores long values in 8 different precisions (each reduced by 8 bits) * that uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values * in the lowest precision. Overall, a range could consist of a theoretical maximum of * 7*255*2 + 255 = 3825 distinct terms (when there is a term for every distinct value of an * 8-byte-number in the index and the range covers almost all of them; a maximum of 255 distinct * values is used because it would always be possible to reduce the full 256 values to one term with * degraded precision). In practice, we have seen up to 300 terms in most cases (index with 500,000 * metadata records and a uniform value distribution). * *

Precision Step

* *

You can choose any precisionStep when encoding values. Lower step values mean * more precisions and so more terms in index (and index gets larger). The number of indexed terms * per value is (those are generated by {@link org.apache.solr.legacy.LegacyNumericTokenStream}): * *

  indexedTermsPerValue = ceil(bitsPerValue / precisionStep) As the lower precision terms are shared by many values, the additional terms * only slightly grow the term dictionary (approx. 7% for precisionStep=4), but have a * larger impact on the postings (the postings file will have more entries, as every document is * linked to indexedTermsPerValue terms instead of one). The formula to estimate the * growth of the term dictionary in comparison to one term per value: * *

* *   \mathrm{termDictOverhead} =
 * \sum\limits_{i=0}^{\mathrm{indexedTermsPerValue}-1} \frac{1}{2^{\mathrm{precisionStep}\cdot i}} * *

On the other hand, if the precisionStep is smaller, the maximum number of terms * to match reduces, which optimizes query speed. The formula to calculate the maximum number of * terms that will be visited while executing the query is: * *

* *   \mathrm{maxQueryTerms} = \left[ \left(
 * \mathrm{indexedTermsPerValue} - 1 \right) \cdot \left(2^\mathrm{precisionStep} - 1 \right) \cdot
 * 2 \right] + \left( 2^\mathrm{precisionStep} - 1 \right) * *

For longs stored using a precision step of 4, maxQueryTerms = 15*15*2 + 15 = 465, * and for a precision step of 2, maxQueryTerms = 31*3*2 + 3 = 189. But the faster * search speed is reduced by more seeking in the term enum of the index. Because of this, the ideal * precisionStep value can only be found out by testing. Important: You can * index with a lower precision step value and test search speed using a multiple of the original * step value. * *

Good values for precisionStep are depending on usage and data type: * *

    *
  • The default for all data types is 4, which is used, when no precisionStep * is given. *
  • Ideal value in most cases for 64 bit data types (long, double) is * 6 or 8. *
  • Ideal value in most cases for 32 bit data types (int, float) is 4. *
  • For low cardinality fields larger precision steps are good. If the cardinality is < 100, * it is fair to use {@link Integer#MAX_VALUE} (see below). *
  • Steps ≥64 for long/double and ≥32 for int/float * produces one token per value in the index and querying is as slow as a conventional {@link * TermRangeQuery}. But it can be used to produce fields, that are solely used for sorting (in * this case simply use {@link Integer#MAX_VALUE} as precisionStep). Using {@link * org.apache.solr.legacy.LegacyIntField}, {@link org.apache.solr.legacy.LegacyLongField}, * {@link org.apache.solr.legacy.LegacyFloatField} or {@link * org.apache.solr.legacy.LegacyDoubleField} for sorting is ideal, because building the field * cache is much faster than with text-only numbers. These fields have one term per value and * therefore also work with term enumeration for building distinct lists (e.g. facets / * preselected values to search for). Sorting is also possible with range query optimized * fields using one of the above precisionSteps. *
* *

Comparisons of the different types of RangeQueries on an index with about 500,000 docs showed * that {@link TermRangeQuery} in boolean rewrite mode (with raised {@link BooleanQuery} clause * count) took about 30-40 secs to complete, {@link TermRangeQuery} in constant score filter rewrite * mode took 5 secs and executing this class took <100ms to complete (on an Opteron64 machine, * Java 1.5, 8 bit precision step). This query type was developed for a geographic portal, where the * performance for e.g. bounding boxes or exact date/time stamps is important. * * @deprecated Instead index with {@link IntPoint}, {@link LongPoint}, {@link FloatPoint}, {@link * DoublePoint}, and create range queries with {@link IntPoint#newRangeQuery(String, int, int) * IntPoint.newRangeQuery()}, {@link LongPoint#newRangeQuery(String, long, long) * LongPoint.newRangeQuery()}, {@link FloatPoint#newRangeQuery(String, float, float) * FloatPoint.newRangeQuery()}, {@link DoublePoint#newRangeQuery(String, double, double) * DoublePoint.newRangeQuery()} respectively. See {@link PointValues} for background information * on Points. * @since 2.9 */ @Deprecated public final class LegacyNumericRangeQuery extends MultiTermQuery { private LegacyNumericRangeQuery( final String field, final int precisionStep, final LegacyNumericType dataType, T min, T max, final boolean minInclusive, final boolean maxInclusive) { super(field, MultiTermQuery.CONSTANT_SCORE_REWRITE); if (precisionStep < 1) throw new IllegalArgumentException("precisionStep must be >=1"); this.precisionStep = precisionStep; this.dataType = Objects.requireNonNull(dataType, "LegacyNumericType must not be null"); this.min = min; this.max = max; this.minInclusive = minInclusive; this.maxInclusive = maxInclusive; } /** * Factory that creates a LegacyNumericRangeQuery, that queries a long * range using the given precisionStep. You can have * half-open ranges (which are in fact </≤ or >/≥ queries) by setting the min or max * value to null. By setting inclusive to false, it will match all documents * excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static LegacyNumericRangeQuery newLongRange( final String field, final int precisionStep, Long min, Long max, final boolean minInclusive, final boolean maxInclusive) { return new LegacyNumericRangeQuery<>( field, precisionStep, LegacyNumericType.LONG, min, max, minInclusive, maxInclusive); } /** * Factory that creates a LegacyNumericRangeQuery, that queries a long * range using the default precisionStep {@link * org.apache.solr.legacy.LegacyNumericUtils#PRECISION_STEP_DEFAULT} (16). You can have half-open * ranges (which are in fact </≤ or >/≥ queries) by setting the min or max value to * null. By setting inclusive to false, it will match all documents excluding the * bounds, with inclusive on, the boundaries are hits, too. */ public static LegacyNumericRangeQuery newLongRange( final String field, Long min, Long max, final boolean minInclusive, final boolean maxInclusive) { return new LegacyNumericRangeQuery<>( field, LegacyNumericUtils.PRECISION_STEP_DEFAULT, LegacyNumericType.LONG, min, max, minInclusive, maxInclusive); } /** * Factory that creates a LegacyNumericRangeQuery, that queries a int * range using the given precisionStep. You can have * half-open ranges (which are in fact </≤ or >/≥ queries) by setting the min or max * value to null. By setting inclusive to false, it will match all documents * excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static LegacyNumericRangeQuery newIntRange( final String field, final int precisionStep, Integer min, Integer max, final boolean minInclusive, final boolean maxInclusive) { return new LegacyNumericRangeQuery<>( field, precisionStep, LegacyNumericType.INT, min, max, minInclusive, maxInclusive); } /** * Factory that creates a LegacyNumericRangeQuery, that queries a int * range using the default precisionStep {@link * org.apache.solr.legacy.LegacyNumericUtils#PRECISION_STEP_DEFAULT_32} (8). You can have * half-open ranges (which are in fact </≤ or >/≥ queries) by setting the min or max * value to null. By setting inclusive to false, it will match all documents * excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static LegacyNumericRangeQuery newIntRange( final String field, Integer min, Integer max, final boolean minInclusive, final boolean maxInclusive) { return new LegacyNumericRangeQuery<>( field, LegacyNumericUtils.PRECISION_STEP_DEFAULT_32, LegacyNumericType.INT, min, max, minInclusive, maxInclusive); } /** * Factory that creates a LegacyNumericRangeQuery, that queries a double * range using the given precisionStep. You can have * half-open ranges (which are in fact </≤ or >/≥ queries) by setting the min or max * value to null. {@link Double#NaN} will never match a half-open range, to hit * {@code NaN} use a query with {@code min == max == Double.NaN}. By setting inclusive to false, * it will match all documents excluding the bounds, with inclusive on, the boundaries are hits, * too. */ public static LegacyNumericRangeQuery newDoubleRange( final String field, final int precisionStep, Double min, Double max, final boolean minInclusive, final boolean maxInclusive) { return new LegacyNumericRangeQuery<>( field, precisionStep, LegacyNumericType.DOUBLE, min, max, minInclusive, maxInclusive); } /** * Factory that creates a LegacyNumericRangeQuery, that queries a double * range using the default precisionStep {@link * org.apache.solr.legacy.LegacyNumericUtils#PRECISION_STEP_DEFAULT} (16). You can have half-open * ranges (which are in fact </≤ or >/≥ queries) by setting the min or max value to * null. {@link Double#NaN} will never match a half-open range, to hit {@code NaN} * use a query with {@code min == max == Double.NaN}. By setting inclusive to false, it will match * all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static LegacyNumericRangeQuery newDoubleRange( final String field, Double min, Double max, final boolean minInclusive, final boolean maxInclusive) { return new LegacyNumericRangeQuery<>( field, LegacyNumericUtils.PRECISION_STEP_DEFAULT, LegacyNumericType.DOUBLE, min, max, minInclusive, maxInclusive); } /** * Factory that creates a LegacyNumericRangeQuery, that queries a float * range using the given precisionStep. You can have * half-open ranges (which are in fact </≤ or >/≥ queries) by setting the min or max * value to null. {@link Float#NaN} will never match a half-open range, to hit {@code * NaN} use a query with {@code min == max == Float.NaN}. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static LegacyNumericRangeQuery newFloatRange( final String field, final int precisionStep, Float min, Float max, final boolean minInclusive, final boolean maxInclusive) { return new LegacyNumericRangeQuery<>( field, precisionStep, LegacyNumericType.FLOAT, min, max, minInclusive, maxInclusive); } /** * Factory that creates a LegacyNumericRangeQuery, that queries a float * range using the default precisionStep {@link * org.apache.solr.legacy.LegacyNumericUtils#PRECISION_STEP_DEFAULT_32} (8). You can have * half-open ranges (which are in fact </≤ or >/≥ queries) by setting the min or max * value to null. {@link Float#NaN} will never match a half-open range, to hit {@code * NaN} use a query with {@code min == max == Float.NaN}. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static LegacyNumericRangeQuery newFloatRange( final String field, Float min, Float max, final boolean minInclusive, final boolean maxInclusive) { return new LegacyNumericRangeQuery<>( field, LegacyNumericUtils.PRECISION_STEP_DEFAULT_32, LegacyNumericType.FLOAT, min, max, minInclusive, maxInclusive); } @Override @SuppressWarnings("unchecked") protected TermsEnum getTermsEnum(final Terms terms, AttributeSource atts) throws IOException { // very strange: java.lang.Number itself is not Comparable, but all subclasses used here are if (min != null && max != null && ((Comparable) min).compareTo(max) > 0) { return TermsEnum.EMPTY; } return new NumericRangeTermsEnum(terms.iterator()); } /** Returns true if the lower endpoint is inclusive */ public boolean includesMin() { return minInclusive; } /** Returns true if the upper endpoint is inclusive */ public boolean includesMax() { return maxInclusive; } /** Returns the lower value of this range query */ public T getMin() { return min; } /** Returns the upper value of this range query */ public T getMax() { return max; } /** Returns the precision step. */ public int getPrecisionStep() { return precisionStep; } @Override public String toString(final String field) { final StringBuilder sb = new StringBuilder(); if (!getField().equals(field)) sb.append(getField()).append(':'); return sb.append(minInclusive ? '[' : '{') .append((min == null) ? "*" : min.toString()) .append(" TO ") .append((max == null) ? "*" : max.toString()) .append(maxInclusive ? ']' : '}') .toString(); } @Override public final boolean equals(final Object o) { if (o == this) return true; if (!super.equals(o)) return false; if (o instanceof LegacyNumericRangeQuery) { final LegacyNumericRangeQuery q = (LegacyNumericRangeQuery) o; return Objects.equals(q.min, min) && Objects.equals(q.max, max) && minInclusive == q.minInclusive && maxInclusive == q.maxInclusive && precisionStep == q.precisionStep; } return false; } @Override public int hashCode() { int hash = super.hashCode(); hash = 31 * hash + precisionStep; hash = 31 * hash + Objects.hashCode(min); hash = 31 * hash + Objects.hashCode(max); hash = 31 * hash + Boolean.hashCode(minInclusive); hash = 31 * hash + Boolean.hashCode(maxInclusive); return hash; } // members (package private, to be also fast accessible by NumericRangeTermEnum) final int precisionStep; final LegacyNumericType dataType; final T min, max; final boolean minInclusive, maxInclusive; // used to handle float/double infinity correcty static final long LONG_NEGATIVE_INFINITY = NumericUtils.doubleToSortableLong(Double.NEGATIVE_INFINITY); static final long LONG_POSITIVE_INFINITY = NumericUtils.doubleToSortableLong(Double.POSITIVE_INFINITY); static final int INT_NEGATIVE_INFINITY = NumericUtils.floatToSortableInt(Float.NEGATIVE_INFINITY); static final int INT_POSITIVE_INFINITY = NumericUtils.floatToSortableInt(Float.POSITIVE_INFINITY); /** * Subclass of FilteredTermsEnum for enumerating all terms that match the sub-ranges for trie * range queries, using flex API. * *

WARNING: This term enumeration is not guaranteed to be always ordered by {@link * Term#compareTo}. The ordering depends on how {@link * org.apache.solr.legacy.LegacyNumericUtils#splitLongRange} and {@link * org.apache.solr.legacy.LegacyNumericUtils#splitIntRange} generates the sub-ranges. For {@link * MultiTermQuery} ordering is not relevant. */ private final class NumericRangeTermsEnum extends FilteredTermsEnum { private BytesRef currentLowerBound, currentUpperBound; private final ArrayDeque rangeBounds = new ArrayDeque<>(); NumericRangeTermsEnum(final TermsEnum tenum) { super(tenum); switch (dataType) { case LONG: case DOUBLE: { // lower long minBound; if (dataType == LegacyNumericType.LONG) { minBound = (min == null) ? Long.MIN_VALUE : min.longValue(); } else { assert dataType == LegacyNumericType.DOUBLE; minBound = (min == null) ? LONG_NEGATIVE_INFINITY : NumericUtils.doubleToSortableLong(min.doubleValue()); } if (!minInclusive && min != null) { if (minBound == Long.MAX_VALUE) break; minBound++; } // upper long maxBound; if (dataType == LegacyNumericType.LONG) { maxBound = (max == null) ? Long.MAX_VALUE : max.longValue(); } else { assert dataType == LegacyNumericType.DOUBLE; maxBound = (max == null) ? LONG_POSITIVE_INFINITY : NumericUtils.doubleToSortableLong(max.doubleValue()); } if (!maxInclusive && max != null) { if (maxBound == Long.MIN_VALUE) break; maxBound--; } LegacyNumericUtils.splitLongRange( new LegacyNumericUtils.LongRangeBuilder() { @Override public final void addRange(BytesRef minPrefixCoded, BytesRef maxPrefixCoded) { rangeBounds.add(minPrefixCoded); rangeBounds.add(maxPrefixCoded); } }, precisionStep, minBound, maxBound); break; } case INT: case FLOAT: { // lower int minBound; if (dataType == LegacyNumericType.INT) { minBound = (min == null) ? Integer.MIN_VALUE : min.intValue(); } else { assert dataType == LegacyNumericType.FLOAT; minBound = (min == null) ? INT_NEGATIVE_INFINITY : NumericUtils.floatToSortableInt(min.floatValue()); } if (!minInclusive && min != null) { if (minBound == Integer.MAX_VALUE) break; minBound++; } // upper int maxBound; if (dataType == LegacyNumericType.INT) { maxBound = (max == null) ? Integer.MAX_VALUE : max.intValue(); } else { assert dataType == LegacyNumericType.FLOAT; maxBound = (max == null) ? INT_POSITIVE_INFINITY : NumericUtils.floatToSortableInt(max.floatValue()); } if (!maxInclusive && max != null) { if (maxBound == Integer.MIN_VALUE) break; maxBound--; } LegacyNumericUtils.splitIntRange( new LegacyNumericUtils.IntRangeBuilder() { @Override public final void addRange(BytesRef minPrefixCoded, BytesRef maxPrefixCoded) { rangeBounds.add(minPrefixCoded); rangeBounds.add(maxPrefixCoded); } }, precisionStep, minBound, maxBound); break; } default: // should never happen throw new IllegalArgumentException("Invalid LegacyNumericType"); } } private void nextRange() { assert rangeBounds.size() % 2 == 0; currentLowerBound = rangeBounds.removeFirst(); assert currentUpperBound == null || currentUpperBound.compareTo(currentLowerBound) <= 0 : "The current upper bound must be <= the new lower bound"; currentUpperBound = rangeBounds.removeFirst(); } @Override protected final BytesRef nextSeekTerm(BytesRef term) { while (rangeBounds.size() >= 2) { nextRange(); // if the new upper bound is before the term parameter, the sub-range is never a hit if (term != null && term.compareTo(currentUpperBound) > 0) continue; // never seek backwards, so use current term if lower bound is smaller return (term != null && term.compareTo(currentLowerBound) > 0) ? term : currentLowerBound; } // no more sub-range enums available assert rangeBounds.isEmpty(); currentLowerBound = currentUpperBound = null; return null; } @Override protected final AcceptStatus accept(BytesRef term) { while (currentUpperBound == null || term.compareTo(currentUpperBound) > 0) { if (rangeBounds.isEmpty()) return AcceptStatus.END; // peek next sub-range, only seek if the current term is smaller than next lower bound if (term.compareTo(rangeBounds.getFirst()) < 0) return AcceptStatus.NO_AND_SEEK; // step forward to next range without seeking, as next lower range bound is less or equal // current term nextRange(); } return AcceptStatus.YES; } } @Override public void visit(QueryVisitor visitor) { visitor.visitLeaf(this); } }





© 2015 - 2024 Weber Informatics LLC | Privacy Policy