
org.terrier.indexing.package.html Maven / Gradle / Ivy
org.terrier.indexing package
Provides classes and interfaces related to the indexing of documents.
There are three main abstract concepts that are related to the code
of this package.
The first is the concept of a Collection of documents. This can be
a standard TREC test collection, or a connection to a database from
where the documents are extracted.
The second abstraction is the concept of a Document. An implementation
of a collection should iterate through the documents in the collection
and return one at a time. The document encapsulates the parser required
to extract the information to index. Implementations of documents are
provided for TREC documents, PDF documents and standard Microsoft Office
formats, such as MS Word, MS Powerpoint and MS Excel.
The third abstraction is related to the Indexer, the process that
iterates through the documents of a collection and creates the
necessary data structures. There are several implemented indexers:
- BasicIndexer - indexes a Collection without recording position information.
A DirectIndex is also built.
- BlockIndexer - as BasicIndexer, but also records position information.
- BasicSinglePassIndexer - creates an index without building a DirectIndex. This is approach is inherently more scalable than BasicIndexer.
- BlockSinglePassIndexer - as BasicSinglePassIndexer, but also records position information.
- Hadoop_BasicSinglePassIndexer - a distributed singlepass indexer that makes use of a Hadoop MapReduce cluster.
- Hadoop_BasicSinglePassIndexer - as Hadoop_BasicSinglePassIndexer, but also records position information.