src.overview.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of mg4j Show documentation
MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java.
There is a newer version: 5.2.2


  
    MG4J: Managing Gigabytes for Java
  

  
    MG4J (Managing Gigabytes for Java) is a free full-text search engine
    for large document collections written in Java.

    
MG4J is distributed under the GNU Lesser General Public License.
	

	
Warning
	
	MG4J 5.0 brings several new features, but also source and binary incompatibilities with
  previous releases.
  	

 MG4J is no longer based on gap-based indices. Classical interleaved indices
  are used for incremental index construction and high-performance indices
  are still supported for historical reasons, but all new indices are by
  default built using the new {@linkplain it.unimi.di.mg4j.index.QuasiSuccinctIndexWriter quasi-succinct format}, 
  which brings unprecedented performance and improves compression.
  	
The package prefix of MG4J is now it.unimi.di.*, following the change of
  name of our department, so to ease the transition and making coexistence with previous versions possible.
	
{@link it.unimi.di.mg4j.search.DocumentIterator#nextDocument()}
	now returns {@link it.unimi.di.mg4j.search.DocumentIterator#END_OF_LIST} instead of -1 to
  denote list exhaustion.
	
{@link it.unimi.di.mg4j.search.DocumentIterator} is now strictly lazy; in
  particular, it does not implement {@link java.util.Iterator}. Please replace
  	calls to hasNext() with a check to
  {@link it.unimi.di.mg4j.search.DocumentIterator#nextDocument()} != {@link it.unimi.di.mg4j.search.DocumentIterator#END_OF_LIST}, or try
  whether the semantics of {@link it.unimi.di.mg4j.search.DocumentIterator#mayHaveNext()} suits you.

 
The plethora of methods that accessed the positions of a term in an
  {@link it.unimi.di.mg4j.index.IndexIterator} have been replaced by the single lazy {@link it.unimi.di.mg4j.index.IndexIterator#nextPosition()} call,
  which returns {@link it.unimi.di.mg4j.index.IndexIterator#END_OF_POSITIONS} when the positions are
  exhausted. Some static methods in {@link it.unimi.di.mg4j.index.IndexIterators} should help with the
  transition.
	


	 Roadmap

	 MG4J is vast. Some of its component are the result of longtime research efforts, and
	 are not easy to describe in full detail. Here we give a roadmap to the documentation,
	 so that you do not have to wander recklessly through dozens of package descriptions.

	 First of all, MG4J comes with a manual that describes how to build
	 indices, and how to access them from the command line or from the web. It is a good idea
	 to start from the manual, build and play with a few indices, and then come back to package documentation,
	 as the latter often refers to artifacts created by index construction.

	 If you want to interface MG4J with your own data, you must read 
	 the package documentation of {@link it.unimi.di.mg4j.document}, which describes document
	 sequences, collections and factories.

	 If you want to load and query an index, you must read 
	 the package documentation of {@link it.unimi.di.mg4j.index}, which describes indices and
	 index readers. The package contains also the documentation about
	 {@linkplain it.unimi.di.mg4j.index.TermProcessor term processors}, which transform terms
	 before they are actually indexed; they are fundamental to customise the indexing process.

	 If you want to have a look at your index, the package
	 {@link it.unimi.di.mg4j.query} contains many useful classes that can help. In particular,
	 a simple {@linkplain it.unimi.di.mg4j.query.Query command-line tool} let you query an index using a standard syntax. The
	 tool makes it also possible to query the index using a browser (if you plan on using the command-line
	 frequency, we suggest a utility such as rlwrap
	 to provide command-line history and editing).
	 
	 
In a real applications, you might want to customise the index querying process. First
	 of all, you must decide which syntax you want to use. A good starting point is described
	 in the package {@link it.unimi.di.mg4j.query.parser}, which contains a simple parser generated
	 with JavaCC. The parser generates an abstract query
	 describe by a composite object whose description is given in {@link it.unimi.di.mg4j.query.nodes}. The
	 query can then be turned into a {@link it.unimi.di.mg4j.search.DocumentIterator}, which will return
	 the documents matching the query and also the document intervals satisfying the query: the 
	 minimal-interval semantics
	 used by MG4J is described in detail in {@link it.unimi.di.mg4j.search}, which also contains
	 a description of the syntax used by the
	 {@linkplain it.unimi.di.mg4j.query.Query command-line tool}.
	 
	 Once a document iterator returning the matching documents is available, it is usually necessary
	 to rank the documents. MG4J provides an abstract notion of {@link it.unimi.di.mg4j.search.score.Scorer}
	 and provides several examples. Scoring is a very sophisticated issue, and a lot of research has
	 been devoted to this subject. MG4J provides implementation for some state-of-the-art scorers
	 such as {@linkplain it.unimi.di.mg4j.search.score.BM25Scorer BM25}, and also new scorers based
	 on minimal-interval semantics such as {@link it.unimi.di.mg4j.search.score.VignaScorer}.
	 
	 All these pieces come together in the {@link it.unimi.di.mg4j.query.QueryEngine}, which takes one 
	 or more queries, scores their results using one or more scorers, and returns only a certain part of
	 the results themselves, decorated with suitably selected intervals that can be used to
	 generate snippets. The query engine has several tunable parameters, so you can adapt it to your application.
	 We suggest that you play with the {@linkplain it.unimi.di.mg4j.query.Query command-line tool} and
	 the associated web interface to become familiar with the query-engine inner workings.

    Package Dependencies

    MG4J requires Java ≥6 and relies on the DSI utilities and two packages providing high-performance containers and
    algorithms, that is, fastutil 6.4 or greater, and Sux4J. 
    Command-line parsing and support requires JSAP. Factories and collections use
    pdfbox and a Javamail
    implementation. The HTTP interface uses the Jetty 6 HTTP server,
    velocity, 
    velocity-tools and the servlet APIs.
    MG4J uses also a number of useful libraries from the Jakarta commons project, 
    including collections,
    lang,
    configuration and
    io.
    All logging is performed using log4j.
    Compiling MG4J requires javacc and jars from Tika (and related dependencies).