All Downloads are FREE. Search and download functionalities are using the official Maven repository.

src.it.unimi.di.mg4j.query.help.velocity Maven / Gradle / Ivy

Go to download

MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java.

There is a newer version: 5.2.2
Show newest version




MG4J Help



Searching with MG4J

Basic syntax

Basic MG4J syntax is pretty straightforward. If you type some words MG4J will search for them in all indexed fields (unless you force a single field—read below). All words must be present in at least one of the fields a document to be listed in the output.

First of all, if you are accessing multiple fields (e.g., “title” and “body”) you can ask for a subquery (e.g., a specific term) to appear in just one index by prefixing the subquery with the index name, followed by a colon. Thus, title:web searches for “web” just in titles, and nowhere else.

There is a stemming operator * that searches for all terms starting with a given prefix: page* will search for “page” as well as “pages”, “paged” and so on (there is a limit on the amount of stemming you can do, however).

You can create more complex queries using the AND operator (a.k.a. &, or just juxtaposition of terms) or using the OR operator (a.k.a. |). Thus, page | pages will search for any of the words “page” and “pages” in all indices, whereas title:page text:pages searches for “page” in titles and “pages” in the description text. You can also surround lists of words with double quotes, and words will be searched for in that sequence: "one page" will search exactly for the two-word sequence “one page”. Inside quotes, $ means “any word”, so "one $ page" will match “one nice page”, “one ugly page”, and so on.

If you do not want some query to be satisfied, just add an exclamation mark (or the keyword NOT) in front of it. ! pages finds pages where the word “pages” does not appear. page* ! automatic searches for “page”, “pages”, “paged”, etc. but just where “automatic” does not appear.

You can always use parentheses to group correctly parts of a query. For instance, if you want to search for “page” or “pages” in titles, just use title:(page OR pages)

A more sophisticated modifier is proximity restriction: if you surround a query with ()~k, where k is an integer, only documents satisfying the query within regions of text shorter than k will be returned. Now, if the query is a query combining with AND several terms, the result is clear. But MG4J can make a sense of it for any query. So title:(one page | two pages)~5 will search for documents in whose title either “one” and “page” or “two” and “pages” appear at most at distance five (i.e., with at most three words in the middle).

Advanced features

Using ordered conjunction, denoted by <, you can search for several terms in a specified order. So line < break will search for “line” and “break”, but only in this order. A document containing just one occurrence of “break” followed by one occurrence of “line” won't match.

There is a query expansion operator + whose meaning in terms of retrieved documents is exactly equivalent to disjunction (OR). However, for scoring purposes all terms in a query expansion are treated as if they where identical (in term of frequency) to the first term. This behaviour makes the scoring more precise, at it avoids that very rare variants of an expanded common term get too high a score.

To each query, MG4J associates a set of regions of text that satisfy the query. For instance, in the simple case of a conjunction of terms, this set is composed by the set of minimal regions of text containing all terms. Brouwerian difference, denoted by -, makes it possible to delete from the set of regions associated to the minuend any region containing some of the regions associated to the subtrahend. If the resulting set is empty, the document won't match. For instance, searching for "romeo $ juliet" - and will return all documents containing “romeo something juliet”, where however something cannot be “and”.

Complex queries

As we already discussed, MG4J associates regions of text to each query. This regions are used subsequently to give meaning to composed queries. Thus, whenever for simplicity we used “term” in the above explanations, we could have always used “subquery”. Some examples follow:

  • "one page" < "two pages" will search for the exact sequence “one page” followed (with possibly words inbetween) by the exact sequence “two pages”. ;
  • (we | you) < "go home" will search for the term “we” or the term “you” followed (with possibly words inbetween) by the exact sequence “go home”;
  • "(we | you) (go < home)" will search for for the term “we” or the term “you” immediately followed by the term “go” followed (with possibly words inbetween) by the term “home”;
  • (cheap "computer appliances")~5 will search for for the term “cheap” and the exact sequence “computer appliances”, but they must appear with at most two words inbetween;
  • "romeo $ juliet" will search for “romeo something juliet”.

For the most complete and up-to-date information on MG4J query syntax, please refer to the Javadoc documentation.





© 2015 - 2025 Weber Informatics LLC | Privacy Policy