org.apache.solr.legacy.LegacyNumericRangeQuery Maven / Gradle / Ivy

Show more of this group Show more artifacts with this name
Show all versions of solr-core Show documentation
Apache Solr (module: core)
There is a newer version: 9.7.0
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.solr.legacy;


import java.io.IOException;
import java.util.LinkedList;
import java.util.Objects;

import org.apache.lucene.document.DoublePoint;
import org.apache.lucene.document.FloatPoint;
import org.apache.lucene.document.IntPoint;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.index.FilteredTermsEnum;
import org.apache.lucene.index.PointValues;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryVisitor;
import org.apache.lucene.search.TermRangeQuery;
import org.apache.lucene.util.AttributeSource;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.NumericUtils;

/**
 * A {@link Query} that matches numeric values within a
 * specified range.  To use this, you must first index the
 * numeric values using {@link org.apache.solr.legacy.LegacyIntField}, {@link
 * org.apache.solr.legacy.LegacyFloatField}, {@link org.apache.solr.legacy.LegacyLongField} or {@link org.apache.solr.legacy.LegacyDoubleField} (expert: {@link
 * org.apache.solr.legacy.LegacyNumericTokenStream}).  If your terms are instead textual,
 * you should use {@link TermRangeQuery}.
 *
 * You create a new LegacyNumericRangeQuery with the static
 * factory methods, eg:
 *
 * 
 * Query q = LegacyNumericRangeQuery.newFloatRange("weight", 0.03f, 0.10f, true, true);
 * 
 *
 * matches all documents whose float valued "weight" field
 * ranges from 0.03 to 0.10, inclusive.
 *
 * The performance of LegacyNumericRangeQuery is much better
 * than the corresponding {@link TermRangeQuery} because the
 * number of terms that must be searched is usually far
 * fewer, thanks to trie indexing, described below.
 *
 * You can optionally specify a precisionStep
 * when creating this query.  This is necessary if you've
 * changed this configuration from its default (4) during
 * indexing.  Lower values consume more disk space but speed
 * up searching.  Suitable values are between 1 and
 * 8. A good starting point to test is 4,
 * which is the default value for all Numeric*
 * classes.  See below for
 * details.
 *
 * 
This query defaults to {@linkplain
 * MultiTermQuery#CONSTANT_SCORE_REWRITE}.
 * With precision steps of ≤4, this query can be run with
 * one of the BooleanQuery rewrite methods without changing
 * BooleanQuery's default max clause count.
 *
 * 
How it works
 *
 * See the publication about panFMP,
 * where this algorithm was described (referred to as TrieRangeQuery):
 *
 * 
Schindler, U, Diepenbroek, M, 2008.
 * Generic XML-based Framework for Metadata Portals.
 * Computers & Geosciences 34 (12), 1947-1955.
 * doi:10.1016/j.cageo.2008.02.023
 *
 * A quote from this paper: Because Apache Lucene is a full-text
 * search engine and not a conventional database, it cannot handle numerical ranges
 * (e.g., field value is inside user defined bounds, even dates are numerical values).
 * We have developed an extension to Apache Lucene that stores
 * the numerical values in a special string-encoded format with variable precision
 * (all numerical values like doubles, longs, floats, and ints are converted to
 * lexicographic sortable string representations and stored with different precisions
 * (for a more detailed description of how the values are stored,
 * see {@link org.apache.solr.legacy.LegacyNumericUtils}). A range is then divided recursively into multiple intervals for searching:
 * The center of the range is searched only with the lowest possible precision in the trie,
 * while the boundaries are matched more exactly. This reduces the number of terms dramatically.
 *
 * For the variant that stores long values in 8 different precisions (each reduced by 8 bits) that
 * uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values in the
 * lowest precision. Overall, a range could consist of a theoretical maximum of
 * 7*255*2 + 255 = 3825 distinct terms (when there is a term for every distinct value of an
 * 8-byte-number in the index and the range covers almost all of them; a maximum of 255 distinct values is used
 * because it would always be possible to reduce the full 256 values to one term with degraded precision).
 * In practice, we have seen up to 300 terms in most cases (index with 500,000 metadata records
 * and a uniform value distribution).
 *
 * Precision Step
 * You can choose any precisionStep when encoding values.
 * Lower step values mean more precisions and so more terms in index (and index gets larger). The number
 * of indexed terms per value is (those are generated by {@link org.apache.solr.legacy.LegacyNumericTokenStream}):
 * 

 *   indexedTermsPerValue = ceil(bitsPerValue / precisionStep)
 * 
 * As the lower precision terms are shared by many values, the additional terms only
 * slightly grow the term dictionary (approx. 7% for precisionStep=4), but have a larger
 * impact on the postings (the postings file will have  more entries, as every document is linked to
 * indexedTermsPerValue terms instead of one). The formula to estimate the growth
 * of the term dictionary in comparison to one term per value:
 * 
 * 
 *   
 * 
 * On the other hand, if the precisionStep is smaller, the maximum number of terms to match reduces,
 * which optimizes query speed. The formula to calculate the maximum number of terms that will be visited while
 * executing the query is:
 * 

 * 
 *   
 * 
 * For longs stored using a precision step of 4, maxQueryTerms = 15*15*2 + 15 = 465, and for a precision
 * step of 2, maxQueryTerms = 31*3*2 + 3 = 189. But the faster search speed is reduced by more seeking
 * in the term enum of the index. Because of this, the ideal precisionStep value can only
 * be found out by testing. Important: You can index with a lower precision step value and test search speed
 * using a multiple of the original step value.
 *
 * Good values for precisionStep are depending on usage and data type:
 * 

 *  The default for all data types is 4, which is used, when no precisionStep is given.
 *  
Ideal value in most cases for 64 bit data types (long, double) is 6 or 8.
 *  
Ideal value in most cases for 32 bit data types (int, float) is 4.
 *  
For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is
 *  fair to use {@link Integer#MAX_VALUE} (see below).
 *  
Steps ≥64 for long/double and ≥32 for int/float produces one token
 *  per value in the index and querying is as slow as a conventional {@link TermRangeQuery}. But it can be used
 *  to produce fields, that are solely used for sorting (in this case simply use {@link Integer#MAX_VALUE} as
 *  precisionStep). Using {@link org.apache.solr.legacy.LegacyIntField},
 *  {@link org.apache.solr.legacy.LegacyLongField}, {@link org.apache.solr.legacy.LegacyFloatField} or {@link org.apache.solr.legacy.LegacyDoubleField} for sorting
 *  is ideal, because building the field cache is much faster than with text-only numbers.
 *  These fields have one term per value and therefore also work with term enumeration for building distinct lists
 *  (e.g. facets / preselected values to search for).
 *  Sorting is also possible with range query optimized fields using one of the above precisionSteps.
 * 
 *
 * Comparisons of the different types of RangeQueries on an index with about 500,000 docs showed
 * that {@link TermRangeQuery} in boolean rewrite mode (with raised {@link BooleanQuery} clause count)
 * took about 30-40 secs to complete, {@link TermRangeQuery} in constant score filter rewrite mode took 5 secs
 * and executing this class took <100ms to complete (on an Opteron64 machine, Java 1.5, 8 bit
 * precision step). This query type was developed for a geographic portal, where the performance for
 * e.g. bounding boxes or exact date/time stamps is important.
 *
 * @deprecated Instead index with {@link IntPoint}, {@link LongPoint}, {@link FloatPoint}, {@link DoublePoint}, and
 *             create range queries with {@link IntPoint#newRangeQuery(String, int, int) IntPoint.newRangeQuery()},
 *             {@link LongPoint#newRangeQuery(String, long, long) LongPoint.newRangeQuery()},
 *             {@link FloatPoint#newRangeQuery(String, float, float) FloatPoint.newRangeQuery()},
 *             {@link DoublePoint#newRangeQuery(String, double, double) DoublePoint.newRangeQuery()} respectively.
 *             See {@link PointValues} for background information on Points.
 *
 * @since 2.9
 **/

@Deprecated
public final class LegacyNumericRangeQuery extends MultiTermQuery {

  private LegacyNumericRangeQuery(final String field, final int precisionStep, final LegacyNumericType dataType,
                                  T min, T max, final boolean minInclusive, final boolean maxInclusive) {
    super(field);
    if (precisionStep < 1)
      throw new IllegalArgumentException("precisionStep must be >=1");
    this.precisionStep = precisionStep;
    this.dataType = Objects.requireNonNull(dataType, "LegacyNumericType must not be null");
    this.min = min;
    this.max = max;
    this.minInclusive = minInclusive;
    this.maxInclusive = maxInclusive;
  }
  
  /**
   * Factory that creates a LegacyNumericRangeQuery, that queries a long
   * range using the given precisionStep.
   * You can have half-open ranges (which are in fact </≤ or >/≥ queries)
   * by setting the min or max value to null. By setting inclusive to false, it will
   * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
   */
  public static LegacyNumericRangeQuery newLongRange(final String field, final int precisionStep,
    Long min, Long max, final boolean minInclusive, final boolean maxInclusive
  ) {
    return new LegacyNumericRangeQuery<>(field, precisionStep, LegacyNumericType.LONG, min, max, minInclusive, maxInclusive);
  }
  
  /**
   * Factory that creates a LegacyNumericRangeQuery, that queries a long
   * range using the default precisionStep {@link org.apache.solr.legacy.LegacyNumericUtils#PRECISION_STEP_DEFAULT} (16).
   * You can have half-open ranges (which are in fact </≤ or >/≥ queries)
   * by setting the min or max value to null. By setting inclusive to false, it will
   * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
   */
  public static LegacyNumericRangeQuery newLongRange(final String field,
    Long min, Long max, final boolean minInclusive, final boolean maxInclusive
  ) {
    return new LegacyNumericRangeQuery<>(field, LegacyNumericUtils.PRECISION_STEP_DEFAULT, LegacyNumericType.LONG, min, max, minInclusive, maxInclusive);
  }
  
  /**
   * Factory that creates a LegacyNumericRangeQuery, that queries a int
   * range using the given precisionStep.
   * You can have half-open ranges (which are in fact </≤ or >/≥ queries)
   * by setting the min or max value to null. By setting inclusive to false, it will
   * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
   */
  public static LegacyNumericRangeQuery newIntRange(final String field, final int precisionStep,
    Integer min, Integer max, final boolean minInclusive, final boolean maxInclusive
  ) {
    return new LegacyNumericRangeQuery<>(field, precisionStep, LegacyNumericType.INT, min, max, minInclusive, maxInclusive);
  }
  
  /**
   * Factory that creates a LegacyNumericRangeQuery, that queries a int
   * range using the default precisionStep {@link org.apache.solr.legacy.LegacyNumericUtils#PRECISION_STEP_DEFAULT_32} (8).
   * You can have half-open ranges (which are in fact </≤ or >/≥ queries)
   * by setting the min or max value to null. By setting inclusive to false, it will
   * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
   */
  public static LegacyNumericRangeQuery newIntRange(final String field,
    Integer min, Integer max, final boolean minInclusive, final boolean maxInclusive
  ) {
    return new LegacyNumericRangeQuery<>(field, LegacyNumericUtils.PRECISION_STEP_DEFAULT_32, LegacyNumericType.INT, min, max, minInclusive, maxInclusive);
  }
  
  /**
   * Factory that creates a LegacyNumericRangeQuery, that queries a double
   * range using the given precisionStep.
   * You can have half-open ranges (which are in fact </≤ or >/≥ queries)
   * by setting the min or max value to null.
   * {@link Double#NaN} will never match a half-open range, to hit {@code NaN} use a query
   * with {@code min == max == Double.NaN}.  By setting inclusive to false, it will
   * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
   */
  public static LegacyNumericRangeQuery newDoubleRange(final String field, final int precisionStep,
    Double min, Double max, final boolean minInclusive, final boolean maxInclusive
  ) {
    return new LegacyNumericRangeQuery<>(field, precisionStep, LegacyNumericType.DOUBLE, min, max, minInclusive, maxInclusive);
  }
  
  /**
   * Factory that creates a LegacyNumericRangeQuery, that queries a double
   * range using the default precisionStep {@link org.apache.solr.legacy.LegacyNumericUtils#PRECISION_STEP_DEFAULT} (16).
   * You can have half-open ranges (which are in fact </≤ or >/≥ queries)
   * by setting the min or max value to null.
   * {@link Double#NaN} will never match a half-open range, to hit {@code NaN} use a query
   * with {@code min == max == Double.NaN}.  By setting inclusive to false, it will
   * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
   */
  public static LegacyNumericRangeQuery newDoubleRange(final String field,
    Double min, Double max, final boolean minInclusive, final boolean maxInclusive
  ) {
    return new LegacyNumericRangeQuery<>(field, LegacyNumericUtils.PRECISION_STEP_DEFAULT, LegacyNumericType.DOUBLE, min, max, minInclusive, maxInclusive);
  }
  
  /**
   * Factory that creates a LegacyNumericRangeQuery, that queries a float
   * range using the given precisionStep.
   * You can have half-open ranges (which are in fact </≤ or >/≥ queries)
   * by setting the min or max value to null.
   * {@link Float#NaN} will never match a half-open range, to hit {@code NaN} use a query
   * with {@code min == max == Float.NaN}.  By setting inclusive to false, it will
   * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
   */
  public static LegacyNumericRangeQuery newFloatRange(final String field, final int precisionStep,
    Float min, Float max, final boolean minInclusive, final boolean maxInclusive
  ) {
    return new LegacyNumericRangeQuery<>(field, precisionStep, LegacyNumericType.FLOAT, min, max, minInclusive, maxInclusive);
  }
  
  /**
   * Factory that creates a LegacyNumericRangeQuery, that queries a float
   * range using the default precisionStep {@link org.apache.solr.legacy.LegacyNumericUtils#PRECISION_STEP_DEFAULT_32} (8).
   * You can have half-open ranges (which are in fact </≤ or >/≥ queries)
   * by setting the min or max value to null.
   * {@link Float#NaN} will never match a half-open range, to hit {@code NaN} use a query
   * with {@code min == max == Float.NaN}.  By setting inclusive to false, it will
   * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
   */
  public static LegacyNumericRangeQuery newFloatRange(final String field,
    Float min, Float max, final boolean minInclusive, final boolean maxInclusive
  ) {
    return new LegacyNumericRangeQuery<>(field, LegacyNumericUtils.PRECISION_STEP_DEFAULT_32, LegacyNumericType.FLOAT, min, max, minInclusive, maxInclusive);
  }

  @Override @SuppressWarnings("unchecked")
  protected TermsEnum getTermsEnum(final Terms terms, AttributeSource atts) throws IOException {
    // very strange: java.lang.Number itself is not Comparable, but all subclasses used here are
    if (min != null && max != null && ((Comparable) min).compareTo(max) > 0) {
      return TermsEnum.EMPTY;
    }
    return new NumericRangeTermsEnum(terms.iterator());
  }

  /** Returns true if the lower endpoint is inclusive */
  public boolean includesMin() { return minInclusive; }
  
  /** Returns true if the upper endpoint is inclusive */
  public boolean includesMax() { return maxInclusive; }

  /** Returns the lower value of this range query */
  public T getMin() { return min; }

  /** Returns the upper value of this range query */
  public T getMax() { return max; }
  
  /** Returns the precision step. */
  public int getPrecisionStep() { return precisionStep; }
  
  @Override
  public String toString(final String field) {
    final StringBuilder sb = new StringBuilder();
    if (!getField().equals(field)) sb.append(getField()).append(':');
    return sb.append(minInclusive ? '[' : '{')
      .append((min == null) ? "*" : min.toString())
      .append(" TO ")
      .append((max == null) ? "*" : max.toString())
      .append(maxInclusive ? ']' : '}')
      .toString();
  }

  @Override
  @SuppressWarnings({"unchecked","rawtypes"})
  public final boolean equals(final Object o) {
    if (o==this) return true;
    if (!super.equals(o))
      return false;
    if (o instanceof LegacyNumericRangeQuery) {
      final LegacyNumericRangeQuery q=(LegacyNumericRangeQuery)o;
      return (
        (q.min == null ? min == null : q.min.equals(min)) &&
        (q.max == null ? max == null : q.max.equals(max)) &&
        minInclusive == q.minInclusive &&
        maxInclusive == q.maxInclusive &&
        precisionStep == q.precisionStep
      );
    }
    return false;
  }

  @Override
  public final int hashCode() {
    int hash = super.hashCode();
    hash = 31 * hash + precisionStep;
    hash = 31 * hash + Objects.hashCode(min);
    hash = 31 * hash + Objects.hashCode(max);
    hash = 31 * hash + Objects.hashCode(minInclusive);
    hash = 31 * hash + Objects.hashCode(maxInclusive);
    return hash;
  }

  // members (package private, to be also fast accessible by NumericRangeTermEnum)
  final int precisionStep;
  final LegacyNumericType dataType;
  final T min, max;
  final boolean minInclusive,maxInclusive;

  // used to handle float/double infinity correcty
  static final long LONG_NEGATIVE_INFINITY =
    NumericUtils.doubleToSortableLong(Double.NEGATIVE_INFINITY);
  static final long LONG_POSITIVE_INFINITY =
    NumericUtils.doubleToSortableLong(Double.POSITIVE_INFINITY);
  static final int INT_NEGATIVE_INFINITY =
    NumericUtils.floatToSortableInt(Float.NEGATIVE_INFINITY);
  static final int INT_POSITIVE_INFINITY =
    NumericUtils.floatToSortableInt(Float.POSITIVE_INFINITY);

  /**
   * Subclass of FilteredTermsEnum for enumerating all terms that match the
   * sub-ranges for trie range queries, using flex API.
   * 
   * WARNING: This term enumeration is not guaranteed to be always ordered by
   * {@link Term#compareTo}.
   * The ordering depends on how {@link org.apache.solr.legacy.LegacyNumericUtils#splitLongRange} and
   * {@link org.apache.solr.legacy.LegacyNumericUtils#splitIntRange} generates the sub-ranges. For
   * {@link MultiTermQuery} ordering is not relevant.
   */
  private final class NumericRangeTermsEnum extends FilteredTermsEnum {

    private BytesRef currentLowerBound, currentUpperBound;

    private final LinkedList rangeBounds = new LinkedList<>();

    NumericRangeTermsEnum(final TermsEnum tenum) {
      super(tenum);
      switch (dataType) {
        case LONG:
        case DOUBLE: {
          // lower
          long minBound;
          if (dataType == LegacyNumericType.LONG) {
            minBound = (min == null) ? Long.MIN_VALUE : min.longValue();
          } else {
            assert dataType == LegacyNumericType.DOUBLE;
            minBound = (min == null) ? LONG_NEGATIVE_INFINITY
              : NumericUtils.doubleToSortableLong(min.doubleValue());
          }
          if (!minInclusive && min != null) {
            if (minBound == Long.MAX_VALUE) break;
            minBound++;
          }
          
          // upper
          long maxBound;
          if (dataType == LegacyNumericType.LONG) {
            maxBound = (max == null) ? Long.MAX_VALUE : max.longValue();
          } else {
            assert dataType == LegacyNumericType.DOUBLE;
            maxBound = (max == null) ? LONG_POSITIVE_INFINITY
              : NumericUtils.doubleToSortableLong(max.doubleValue());
          }
          if (!maxInclusive && max != null) {
            if (maxBound == Long.MIN_VALUE) break;
            maxBound--;
          }
          
          LegacyNumericUtils.splitLongRange(new LegacyNumericUtils.LongRangeBuilder() {
            @Override
            public final void addRange(BytesRef minPrefixCoded, BytesRef maxPrefixCoded) {
              rangeBounds.add(minPrefixCoded);
              rangeBounds.add(maxPrefixCoded);
            }
          }, precisionStep, minBound, maxBound);
          break;
        }
          
        case INT:
        case FLOAT: {
          // lower
          int minBound;
          if (dataType == LegacyNumericType.INT) {
            minBound = (min == null) ? Integer.MIN_VALUE : min.intValue();
          } else {
            assert dataType == LegacyNumericType.FLOAT;
            minBound = (min == null) ? INT_NEGATIVE_INFINITY
              : NumericUtils.floatToSortableInt(min.floatValue());
          }
          if (!minInclusive && min != null) {
            if (minBound == Integer.MAX_VALUE) break;
            minBound++;
          }
          
          // upper
          int maxBound;
          if (dataType == LegacyNumericType.INT) {
            maxBound = (max == null) ? Integer.MAX_VALUE : max.intValue();
          } else {
            assert dataType == LegacyNumericType.FLOAT;
            maxBound = (max == null) ? INT_POSITIVE_INFINITY
              : NumericUtils.floatToSortableInt(max.floatValue());
          }
          if (!maxInclusive && max != null) {
            if (maxBound == Integer.MIN_VALUE) break;
            maxBound--;
          }
          
          LegacyNumericUtils.splitIntRange(new LegacyNumericUtils.IntRangeBuilder() {
            @Override
            public final void addRange(BytesRef minPrefixCoded, BytesRef maxPrefixCoded) {
              rangeBounds.add(minPrefixCoded);
              rangeBounds.add(maxPrefixCoded);
            }
          }, precisionStep, minBound, maxBound);
          break;
        }
          
        default:
          // should never happen
          throw new IllegalArgumentException("Invalid LegacyNumericType");
      }
    }
    
    private void nextRange() {
      assert rangeBounds.size() % 2 == 0;

      currentLowerBound = rangeBounds.removeFirst();
      assert currentUpperBound == null || currentUpperBound.compareTo(currentLowerBound) <= 0 :
        "The current upper bound must be <= the new lower bound";
      
      currentUpperBound = rangeBounds.removeFirst();
    }
    
    @Override
    protected final BytesRef nextSeekTerm(BytesRef term) {
      while (rangeBounds.size() >= 2) {
        nextRange();
        
        // if the new upper bound is before the term parameter, the sub-range is never a hit
        if (term != null && term.compareTo(currentUpperBound) > 0)
          continue;
        // never seek backwards, so use current term if lower bound is smaller
        return (term != null && term.compareTo(currentLowerBound) > 0) ?
          term : currentLowerBound;
      }
      
      // no more sub-range enums available
      assert rangeBounds.isEmpty();
      currentLowerBound = currentUpperBound = null;
      return null;
    }
    
    @Override
    protected final AcceptStatus accept(BytesRef term) {
      while (currentUpperBound == null || term.compareTo(currentUpperBound) > 0) {
        if (rangeBounds.isEmpty())
          return AcceptStatus.END;
        // peek next sub-range, only seek if the current term is smaller than next lower bound
        if (term.compareTo(rangeBounds.getFirst()) < 0)
          return AcceptStatus.NO_AND_SEEK;
        // step forward to next range without seeking, as next lower range bound is less or equal current term
        nextRange();
      }
      return AcceptStatus.YES;
    }

  }

  @Override
  public void visit(QueryVisitor visitor) {
    visitor.visitLeaf(this);
  }
  
}