All Downloads are FREE. Search and download functionalities are using the official Maven repository.

src.overview.html Maven / Gradle / Ivy

Go to download

MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. The big version is a fork of the original MG4J that can handle more than 2^31 terms and documents.

The newest version!


  
    MG4J (big): Managing Gigabytes for Java
  

  
    

MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. The big version is a fork of the original MG4J that can handle more than 231 terms and documents.

MG4J is distributed under the GNU Lesser General Public License.

Why Java?

Writing in Java code that (essentially) has to roll bits over and over may seem a Bad Thing™. However, one should take into consideration the following points:

  • Improvements in JVMs makes low-level code written in Java faster and faster; often, the performance penalty w.r.t. an equivalent C/C++ application is relatively small.
  • Compression techniques can be mixed in several different ways, and an object-oriented language makes it very easy to play with different implementations of the same interface.
  • Most of the time, in particular in real-world applications, you will need to rewrite all or part of the code. In this case, as a learning tool a clean implementation in an object-oriented language is certainly better than a C implementation.
  • Usually, you need very fast lookups, but you can relax during index construction. Since MG4J writes completely documented bit streams, it is very easy to read its output from optimised C code.

Roadmap

MG4J is vast. Some of its component are the result of longtime research efforts, and are not easy to describe in full detail. Here we give a roadmap to the documentation, so that you do not have to wander recklessly through dozens of package descriptions.

First of all, MG4J comes with a manual that describes how to build indices, and how to access them from the command line or from the web. It is a good idea to start from the manual, build and play with a few indices, and then come back to package documentation, as the latter often refers to artifacts created by index construction.

If you want to interface MG4J with your own data, you must read the package documentation of {@link it.unimi.dsi.big.mg4j.document}, which describes document sequences, collections and factories.

If you want to load and query an index, you must read the package documentation of {@link it.unimi.dsi.big.mg4j.index}, which describes indices and index readers. The package contains also the documentation about {@linkplain it.unimi.dsi.big.mg4j.index.TermProcessor term processors}, which transform terms before they are actually indexed; they are fundamental to customise the indexing process.

If you want to have a look at your index, the package {@link it.unimi.dsi.big.mg4j.query} contains many useful classes that can help. In particular, a simple {@linkplain it.unimi.dsi.big.mg4j.query.Query command-line tool} let you query an index using a standard syntax. The tool makes it also possible to query the index using a browser (if you plan on using the command-line frequency, we suggest a utility such as rlwrap to provide command-line history and editing).

In a real applications, you might want to customise the index querying process. First of all, you must decide which syntax you want to use. A good starting point is described in the package {@link it.unimi.dsi.big.mg4j.query.parser}, which contains a simple parser generated with JavaCC. The parser generates an abstract query describe by a composite object whose description is given in {@link it.unimi.dsi.big.mg4j.query.nodes}. The query can then be turned into a {@link it.unimi.dsi.big.mg4j.search.DocumentIterator}, which will return the documents matching the query and also the document intervals satisfying the query: the minimal-interval semantics used by MG4J is described in detail in {@link it.unimi.dsi.big.mg4j.search}, which also contains a description of the syntax used by the {@linkplain it.unimi.dsi.big.mg4j.query.Query command-line tool}.

Once a document iterator returning the matching documents is available, it is usually necessary to rank the documents. MG4J provides an abstract notion of {@link it.unimi.dsi.big.mg4j.search.score.Scorer} and provides several examples. Scoring is a very sophisticated issue, and a lot of research has been devoted to this subject. MG4J provides implementation for some state-of-the-art scorers such as {@linkplain it.unimi.dsi.big.mg4j.search.score.BM25Scorer BM25}, and also new scorers based on minimal-interval semantics such as {@link it.unimi.dsi.big.mg4j.search.score.VignaScorer}.

All these pieces come together in the {@link it.unimi.dsi.big.mg4j.query.QueryEngine}, which takes one or more queries, scores their results using one or more scorers, and returns only a certain part of the results themselves, decorated with suitably selected intervals that can be used to generate snippets. The query engine has several tunable parameters, so you can adapt it to your application. We suggest that you play with the {@linkplain it.unimi.dsi.big.mg4j.query.Query command-line tool} and the associated web interface to become familiar with the query-engine inner workings.

Package Dependencies

MG4J (big) requires Java ≥6 and relies on the DSI utilities and two packages providing high-performance containers and algorithms, that is, fastutil 6.4 or greater, and Sux4J. Command-line parsing and support requires JSAP. Factories and collections use pdfbox and a Javamail implementation. The HTTP interface uses the Jetty 6 HTTP server, velocity, velocity-tools and the servlet APIs. MG4J uses also a number of useful libraries from the Jakarta commons project, including collections, lang, configuration and io. All logging is performed using log4j. Compiling MG4J requires javacc and jars from Tika (and related dependencies).





© 2015 - 2025 Weber Informatics LLC | Privacy Policy