src.it.unimi.dsi.big.mg4j.index.package.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of mg4j-big Show documentation
MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. The big version is a fork of the original MG4J that can handle more than 2^31 terms and documents.
The newest version!


  
    MG4J: Managing Gigabytes for Java
  

  

    Index generation and access.

    
This package contains the classes that handle index generation and
    access.  The interval iterators defined in {@link it.unimi.dsi.big.mg4j.search}
	build upon the classes of this package to provide answer to queries using interval semantics,
	but it is also possible to access an index directly.
	
	
You can easily build indices using the tools in {@link it.unimi.dsi.big.mg4j.tool}. Once an index
	has been built, it can be opened using an {@link it.unimi.dsi.big.mg4j.index.Index} object, which
	gathers metadata that is necessary to access the index. You do not create an  {@link it.unimi.dsi.big.mg4j.index.Index}
	with a constructor: rather, you use the static factory {@link  
	it.unimi.dsi.big.mg4j.index.Index#getInstance(CharSequence)} (or one of its variants) to create an instance.
	This is necessary so that different kind of indices can be treated transparently: for example, the factory
	may return a {@link it.unimi.dsi.big.mg4j.index.cluster.IndexCluster} if the index is actually a cluster,
	but you do not need to know that.
	
	
From an {@link it.unimi.dsi.big.mg4j.index.Index},
	you can easily obtain either an {@link it.unimi.dsi.big.mg4j.index.IndexReader}, which allows to
	scan sequentially or randomly the index. In turn from an {@link it.unimi.dsi.big.mg4j.index.IndexReader}
	you can obtain a {@link  it.unimi.dsi.big.mg4j.index.IndexIterator}
	returning the documents containing a certain term and the position of the term within the document.

	
But there is more: an {@link  it.unimi.dsi.big.mg4j.index.IndexIterator}
		is a kind of {@link  it.unimi.dsi.big.mg4j.search.DocumentIterator}, and 
		{@link  it.unimi.dsi.big.mg4j.search.DocumentIterator}s can be combined in several ways
	using the classes of the package {@link it.unimi.dsi.big.mg4j.search}: for instance, you can combine
	document iterators using AND/OR. Note that you can combine document iterators on different
	indices, but of course the operation is meaningful only if the two indices contain different information
	about the same document collection (e.g., title and main text).
	
	
More importantly, if the index is full text (the default) for each document containing the term you can get
		interval iterators that return intervals representing extents of text satisfying the query: for 
		instance, in case of an AND of two terms, the intervals will contain both terms.
		
		
    
Structure of an inverted index
    
    An inverted index is made by a sequence of inverted lists (one inverted
    list for each term). Inverted lists are made by document records: each
    document record contains information about the occurrences of the term
    within a certain document.

    
More precisely, each inverted list starts with a suitably encoded
    integer, called the frequency, which is the number of document
    records that will follow (i.e., the number of documents in which the term
    appears). After that, there are exactly as many document records as the
    frequency.

    
Each document record is made by two parts:
    

      a suitably encoded integer, called the (document) pointer,
      which identifies the document within the collection;

      
a (possibly empty) sequence of bits, called the data; the
      data have no special structure per se: the only assumption is that they
      are a self-delimiting bit sequence (i.e., one knows when the sequence is
      over).
    

    As a basic and fundamental implementation, the classes of this package provide methods
    that write and read document data in a default form. In this default
    structure, each document data is a suitable coding of a (strictly
    increasing) sequence of integers, that correspond to the positions
    where the term occurs within the document. The length of the sequence
    (i.e., the number of positions in at which the term appears) is called the
    count (it is also common to call it “within-document frequency”, but we find this
    usage confusing).