All Downloads are FREE. Search and download functionalities are using the official Maven repository.

edu.umd.cloud9.collection.trec.package.html Maven / Gradle / Ivy

The newest version!






Provides classes for working with the TREC collection (particularly disks 4 and 5). TREC disks 4 and 5 represent one of the standard collections used in information retrieval research. There are two common "views" of the collection:

  • TREC disks 4 and 5, minus CR (Congressional Record): a total of 528,155 documents. A complete listing of all files that comprise this configuration.
  • TREC disks 4 and 5, minus CR (Congressional Record) and FR (Federal Register): a total of 472,525 documents. A complete listing of all files that comprise this configuration.

Here are the two steps for preparing the collection for processing with Hadoop:

  1. The distribution of the collection consists of many individual small files (listed above). Since Hadoop works better with large files, it is advisable to cat the individual files together (e.g., with a simple Perl script).
  2. Since many information retrieval algorithms require a sequential numbering of documents, it is necessary to build a mapping between docids (e.g., LA123190-0134) and docnos (sequentially-numbered ints). The class NumberTrecDocuments accomplishes this. Here is a sample invocation:
  3. hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.NumberTrecDocuments \
    /umd/collections/trec/trec4-5_noCRFR.xml \
    /user/jimmylin/trec-docid-tmp \
    /user/jimmylin/docno.mapping 100
    

After the corpus has been prepared, it is ready for processing with Hadoop. The class DemoCountTrecDocuments is a simple demo program that counts all documents in the collection. It provides a skeleton for MapReduce programs that process the collection. Here is a sample invocation:

hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.DemoCountTrecDocuments \
/umd/collections/trec/trec4-5_noCRFR.xml \
/user/jimmylin/count-tmp \
/user/jimmylin/docno.mapping 100

The output key-value pairs in this sample program are the docid to docno mappings.





© 2015 - 2025 Weber Informatics LLC | Privacy Policy