![JAR search and dependency download from the Maven repository](/logo.png)
src.it.unimi.di.mg4j.query.help.velocity Maven / Gradle / Ivy
Show all versions of mg4j Show documentation
MG4J Help
Searching with MG4J
Basic syntax
Basic MG4J syntax is pretty straightforward.
If you type some words MG4J will search for them in all indexed fields
(unless you force a single field—read below).
All words must be present in at least one of
the fields a document to be listed in the output.
First of all, if you are accessing multiple fields (e.g., “title”
and “body”) you can
ask for a subquery (e.g., a specific term) to appear in just one index by prefixing the subquery
with the index name, followed by a colon. Thus, title:web
searches for “web” just in titles, and nowhere else.
There is a stemming operator * that searches for all
terms starting with a given prefix: page* will search for
“page” as well as “pages”, “paged” and so on (there is a limit on
the amount of stemming you can do, however).
You can create more complex queries using the AND operator
(a.k.a. &, or just juxtaposition of terms) or using the OR operator (a.k.a. |).
Thus, page | pages will search for any of the words “page” and
“pages” in all indices, whereas title:page text:pages
searches for “page” in titles and
“pages” in the description text. You can also
surround lists of words with double quotes, and words will be searched for in that sequence:
"one page" will search exactly for the two-word sequence “one page”. Inside
quotes, $ means “any word”, so "one $ page" will match “one nice page”,
“one ugly page”, and so on.
If you do not want some query to be satisfied, just add an exclamation mark
(or the keyword NOT) in front of it. ! pages finds pages where the word “pages” does not appear.
page* ! automatic searches for “page”, “pages”, “paged”, etc. but
just where “automatic” does not appear.
You can always use parentheses to group correctly parts of a query. For instance,
if you want to search for “page” or “pages” in titles, just use title:(page OR pages)
A more sophisticated modifier is proximity restriction: if you surround a query
with ()~k, where k is an integer, only documents
satisfying the query within regions of text shorter than k will be returned.
Now, if the query is a query combining with AND several
terms, the result is clear. But MG4J can make a sense of it for any query. So
title:(one page | two pages)~5 will search for documents in whose
title either “one” and “page” or “two” and
“pages” appear at most at distance five (i.e., with at most three words in the middle).
Advanced features
Using ordered conjunction, denoted by <, you can search for several terms in a specified order.
So line < break will search for “line” and “break”, but only in this order.
A document containing just one occurrence of “break” followed by one occurrence of “line” won't match.
There is a query expansion operator + whose meaning in terms of retrieved documents is exactly
equivalent to disjunction (OR). However, for scoring purposes all terms in a query expansion are treated
as if they where identical (in term of frequency) to the first term. This behaviour makes the scoring more precise, at
it avoids that very rare variants of an expanded common term get too high a score.
To each query, MG4J associates a set of regions of text that satisfy the query. For instance, in the
simple case of a conjunction of terms, this set is composed by the set of minimal regions of text
containing all terms. Brouwerian difference, denoted by -, makes it possible to delete from
the set of regions associated to the minuend any region containing some of the regions associated to the subtrahend. If the resulting
set is empty, the document won't match. For instance, searching for "romeo $ juliet" - and will
return all documents containing “romeo something juliet”, where however something cannot be “and”.
Complex queries
As we already discussed, MG4J associates regions of text to each query. This regions are used subsequently to give
meaning to composed queries. Thus, whenever for simplicity we used “term” in the above explanations, we could have always
used “subquery”. Some examples follow:
- "one page" < "two pages" will search for the exact sequence “one page” followed
(with possibly words inbetween) by the exact sequence “two pages”. ;
- (we | you) < "go home" will search for the term “we” or the term “you” followed (with possibly words inbetween) by
the exact sequence “go home”;
- "(we | you) (go < home)" will search for for the term “we” or the term “you” immediately followed by the term “go”
followed (with possibly words inbetween) by the term “home”;
- (cheap "computer appliances")~5 will search for for the term “cheap” and the exact sequence “computer appliances”, but
they must appear with at most two words inbetween;
- "romeo $ juliet" will search for “romeo something juliet”.
For the most complete and up-to-date information on MG4J query syntax, please refer to the
Javadoc documentation.