org.apache.lucene.queryparser.classic.package-info Maven / Gradle / Ivy
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* A simple query parser implemented with JavaCC.
*
* Note that JavaCC defines lots of public classes, methods and fields that do not need to be
* public. These clutter the documentation. Sorry.
*
*
Note that because JavaCC defines a class named Token
,
* org.apache.lucene.analysis.Token
must always be fully qualified in source code in this
* package.
*
*
NOTE: {@link org.apache.lucene.queryparser.flexible.standard} has an alternative
* queryparser that matches the syntax of this one, but is more modular, enabling substantial
* customization to how a query is created.
*
*
Query Parser Syntax
*
*
*
*
* - Overview
*
- Terms
*
- Fields
*
- Term Modifiers
*
*
- Boolean Operators
*
*
- Grouping
*
- Field Grouping
*
- Escaping Special Characters
*
*
*
*
* Overview
*
*
*
* Although Lucene provides the ability to create your own queries through its API, it also
* provides a rich query language through the Query Parser, a lexer which interprets a string into a
* Lucene Query using JavaCC.
*
*
Generally, the query parser syntax may change from release to release. This page describes the
* syntax as of the current release. If you are using a different version of Lucene, please consult
* the copy of docs/queryparsersyntax.html that was distributed with
* the version you are using.
*
*
Before choosing to use the provided Query Parser, please consider the following:
*
*
* - If you are programmatically generating a query string and then parsing it with the query
* parser then you should seriously consider building your queries directly with the query
* API. In other words, the query parser is designed for human-entered text, not for
* program-generated text.
*
- Untokenized fields are best added directly to queries, and not through the query parser. If
* a field's values are generated programmatically by the application, then so should query
* clauses for this field. An analyzer, which the query parser uses, is designed to convert
* human-entered text to terms. Program-generated values, like dates, keywords, etc., should
* be consistently program-generated.
*
- In a query form, fields which are general text should use the query parser. All others,
* such as date ranges, keywords, etc. are better added directly through the query API. A
* field with a limit set of values, that can be specified with a pull-down menu should not be
* added to a query string which is subsequently parsed, but rather added as a TermQuery
* clause.
*
*
*
*
* Terms
*
*
*
* A query is broken up into terms and operators. There are two types of terms: Single Terms and
* Phrases.
*
*
A Single Term is a single word such as "test" or "hello".
*
*
A Phrase is a group of words surrounded by double quotes such as "hello dolly".
*
*
Multiple terms can be combined together with Boolean operators to form a more complex query
* (see below).
*
*
Note: The analyzer used to create the index will be used on the terms and phrases in the query
* string. So it is important to choose an analyzer that will not interfere with the terms used in
* the query string.
*
* Fields
*
*
*
* Lucene supports fielded data. When performing a search you can either specify a field, or use
* the default field. The field names and default field is implementation specific.
*
*
You can search any field by typing the field name followed by a colon ":" and then the term
* you are looking for.
*
*
As an example, let's assume a Lucene index contains two fields, title and text and text is the
* default field. If you want to find the document entitled "The Right Way" which contains the text
* "don't go this way", you can enter:
*
*
title:"The Right Way" AND text:go
*
* or
*
*
title:"The Right Way" AND go
*
* Since text is the default field, the field indicator is not required.
*
*
Note: The field is only valid for the term that it directly precedes, so the query
*
*
title:The Right Way
*
* Will only find "The" in the title field. It will find "Right" and "Way" in the default field
* (in this case the text field).
*
* Term Modifiers
*
*
*
* Lucene supports modifying query terms to provide a wide range of searching options.
*
*
Wildcard Searches
*
* Lucene supports single and multiple character wildcard searches within single terms (not
* within phrase queries).
*
*
To perform a single character wildcard search use the "?" symbol.
*
*
To perform a multiple character wildcard search use the "*" symbol.
*
*
The single character wildcard search looks for terms that match that with the single character
* replaced. For example, to search for "text" or "test" you can use the search:
*
*
te?t
*
* Multiple character wildcard searches looks for 0 or more characters. For example, to search
* for test, tests or tester, you can use the search:
*
*
test*
*
* You can also use the wildcard searches in the middle of a term.
*
*
te*t
*
* Note: You cannot use a * or ? symbol as the first character of a search.
*
*
Regular Expression Searches
*
* Lucene supports regular expression searches matching a pattern between forward slashes "/".
* The syntax may change across releases, but the current supported syntax is documented in the
* {@link org.apache.lucene.util.automaton.RegExp RegExp} class. For example to find documents
* containing "moat" or "boat":
*
*
/[mb]oat/
*
*
*
* Fuzzy Searches
*
* Lucene supports fuzzy searches based on Damerau-Levenshtein Distance. To do a fuzzy search use
* the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar
* in spelling to "roam" use the fuzzy search:
*
*
roam~
*
* This search will find terms like foam and roams.
*
*
An additional (optional) parameter can specify the maximum number of edits allowed. The value
* is between 0 and 2, For example:
*
*
roam~1
*
* The default that is used if the parameter is not given is 2 edit distances.
*
*
Previously, a floating point value was allowed here. This syntax is considered deprecated and
* will be removed in Lucene 5.0
*
*
Proximity Searches
*
* Lucene supports finding words are a within a specific distance away. To do a proximity search
* use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and
* "jakarta" within 10 words of each other in a document use the search:
*
*
"jakarta apache"~10
*
*
*
* Range Searches
*
* Range Queries allow one to match documents whose field(s) values are between the lower and
* upper bound specified by the Range Query. Range Queries can be inclusive or exclusive of the
* upper and lower bounds. Sorting is done lexicographically.
*
*
mod_date:[20020101 TO 20030101]
*
* This will find documents whose mod_date fields have values between 20020101 and 20030101,
* inclusive. Note that Range Queries are not reserved for date fields. You could also use range
* queries with non-date fields:
*
*
title:{Aida TO Carmen}
*
* This will find all documents whose titles are between Aida and Carmen, but not including Aida
* and Carmen.
*
*
Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by
* curly brackets.
*
*
Boosting a Term
*
* Lucene provides the relevance level of matching documents based on the terms found. To boost a
* term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are
* searching. The higher the boost factor, the more relevant the term will be.
*
*
Boosting allows you to control the relevance of a document by boosting its term. For example,
* if you are searching for
*
*
jakarta apache
*
* and you want the term "jakarta" to be more relevant boost it using the ^ symbol along with the
* boost factor next to the term. You would type:
*
*
jakarta^4 apache
*
* This will make documents with the term jakarta appear more relevant. You can also boost Phrase
* Terms as in the example:
*
*
"jakarta apache"^4 "Apache Lucene"
*
* By default, the boost factor is 1. Although the boost factor must be positive, it can be less
* than 1 (e.g. 0.2)
*
* Boolean Operators
*
*
*
* Boolean operators allow terms to be combined through logic operators. Lucene supports AND,
* "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators must be ALL CAPS).
*
*
OR
*
* The OR operator is the default conjunction operator. This means that if there is no Boolean
* operator between two terms, the OR operator is used. The OR operator links two terms and finds a
* matching document if either of the terms exist in a document. This is equivalent to a union using
* sets. The symbol || can be used in place of the word OR.
*
*
To search for documents that contain either "jakarta apache" or just "jakarta" use the query:
*
*
"jakarta apache" jakarta
*
* or
*
*
"jakarta apache" OR jakarta
*
*
*
* AND
*
* The AND operator matches documents where both terms exist anywhere in the text of a single
* document. This is equivalent to an intersection using sets. The symbol && can be used in
* place of the word AND.
*
*
To search for documents that contain "jakarta apache" and "Apache Lucene" use the query:
*
*
"jakarta apache" AND "Apache Lucene"
*
*
*
* +
*
* The "+" or required operator requires that the term after the "+" symbol exist somewhere in a
* the field of a single document.
*
*
To search for documents that must contain "jakarta" and may contain "lucene" use the query:
*
*
+jakarta lucene
*
*
*
* NOT
*
* The NOT operator excludes documents that contain the term after NOT. This is equivalent to a
* difference using sets. The symbol ! can be used in place of the word NOT.
*
*
To search for documents that contain "jakarta apache" but not "Apache Lucene" use the query:
*
*
"jakarta apache" NOT "Apache Lucene"
*
* Note: The NOT operator cannot be used with just one term. For example, the following search
* will return no results:
*
*
NOT "jakarta apache"
*
*
*
* -
*
* The "-" or prohibit operator excludes documents that contain the term after the "-" symbol.
*
*
To search for documents that contain "jakarta apache" but not "Apache Lucene" use the query:
*
*
"jakarta apache" -"Apache Lucene"
*
*
*
* Grouping
*
*
*
* Lucene supports using parentheses to group clauses to form sub queries. This can be very
* useful if you want to control the boolean logic for a query.
*
*
To search for either "jakarta" or "apache" and "website" use the query:
*
*
(jakarta OR apache) AND website
*
* This eliminates any confusion and makes sure you that website must exist and either term
* jakarta or apache may exist.
*
* Field Grouping
*
*
*
* Lucene supports using parentheses to group multiple clauses to a single field.
*
*
To search for a title that contains both the word "return" and the phrase "pink panther" use
* the query:
*
*
title:(+return +"pink panther")
*
*
*
* Escaping Special Characters
*
*
*
* Lucene supports escaping special characters that are part of the query syntax. The current
* list special characters are
*
*
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
*
*
To escape these character use the \ before the character. For example to search for (1+1):2
* use the query:
*
*
\(1\+1\)\:2
*
*
*/
package org.apache.lucene.queryparser.classic;