
edu.umd.cloud9.collection.trec.package.html Maven / Gradle / Ivy
Go to download
Show more of this group Show more artifacts with this name
Show all versions of cloud9 Show documentation
Show all versions of cloud9 Show documentation
University of Maryland's Hadoop Library
The newest version!
Provides classes for working with the TREC collection (particularly
disks 4 and 5). TREC disks 4 and 5 represent one of the standard
collections used in information retrieval research. There are two
common "views" of the collection:
- TREC disks 4 and 5, minus CR (Congressional Record): a total of
528,155 documents. A complete
listing of all files that comprise this configuration.
- TREC disks 4 and 5, minus CR (Congressional Record) and FR
(Federal Register): a total of 472,525
documents. A complete listing
of all files that comprise this configuration.
Here are the two steps for preparing the collection for processing
with Hadoop:
- The distribution of the collection consists of many individual
small files (listed above). Since Hadoop works better with large
files, it is advisable to cat the individual files together (e.g.,
with a simple Perl script).
- Since many information retrieval algorithms require a sequential
numbering of documents, it is necessary to build a mapping between
docids (e.g.,
LA123190-0134
) and docnos
(sequentially-numbered ints). The
class NumberTrecDocuments
accomplishes this. Here is a sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.NumberTrecDocuments \
/umd/collections/trec/trec4-5_noCRFR.xml \
/user/jimmylin/trec-docid-tmp \
/user/jimmylin/docno.mapping 100
After the corpus has been prepared, it is ready for processing with
Hadoop. The
class DemoCountTrecDocuments
is a simple demo program that counts all documents in the collection.
It provides a skeleton for MapReduce programs that process the
collection. Here is a sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.DemoCountTrecDocuments \
/umd/collections/trec/trec4-5_noCRFR.xml \
/user/jimmylin/count-tmp \
/user/jimmylin/docno.mapping 100
The output key-value pairs in this sample program are the docid to
docno mappings.
© 2015 - 2025 Weber Informatics LLC | Privacy Policy