edu.stanford.nlp.ie.package.html Maven / Gradle / Ivy



  

This package implements various subpackages for information extraction.
Some examples of use appear later in this description.
At the moment, three types of information extraction are supported
      (where some of these have internal variants):


Regular expression based matching: These extractors are hand-written
      and match whatever the regular expression matches.
Conditional Random Fields classifier: A sequence tagger based on 
CRF model that can be used for NER tagging and other sequence labeling tasks.
Conditional Markov Model classifier: A classifier based on 
CMM model that can be used for NER tagging and other labeling tasks.
Hidden Markov model based extractors:  These can be either single
	field extractors or two level HMMs where the individual
	component models and how they are glued together is trained
	separately.  These models are trained automatically, but require tagged
	training data.
Description extractor: This does higher level NLP analysis of
	sentences (using a POS tagger and chunker) to find sentences
	that describe an object.  This might be a biography of a person,
	or a description of an animal.  This module is fixed: there is
	nothing to write or train (unless one wants to start to change
	its internal behavior).


There are some demonstrations of the stuff here which you can run (and several
    other classes have main() methods which exhibit their
    functionality):


NERGUI is a simple GUI front-end to the NER tagging
	components.
crf/NERGUI is a simple GUI front-end to the CRF-based NER tagging
	components.  This version only supports the CRF-based NER tagger.
demo/NERDemo is a simple class examplifying the programmatical use
of the CRF-based NER tagger.


Usage examples

0. Setup: For all of these examples except 3., you need to be
connected to the Internet, and for the application's web search module
to be
able to connect to search engines.  The web search
functionality is provided by the supplied edu.stanford.nlp.web
package.  How web search works is controlled
by a websearch.init file in your current directory (or if
      none is present, you will get search results from AltaVista).  If
      you are registered to use the GoogleAPI, you should probably edit
      this file so web queries can be done to Google using their SOAP
      interface.  Even if not, you can specify additional or different
      search engines to access in websearch.init.
      A copy of this file is supplied in the distribution.  The
DescExtractor in 4. also requires another init file so that
it can use the include part-of-speech tagger.

1. Corporate Contact Information.  This illustrates simple information
extraction from a web page.
Using the included
      ExtractDemo.bat or by hand run:
java edu.stanford.nlp.ie.ExtractDemo


Select as Extractor Directory the folder:
serialized-extractors/companycontact
Select as an Ontology the one in
serialized-extractors/companycontact/Corporation-Information.kaon

Enter Corporation as the Concept to extract.
You can then do various searches:

You can enter a URL, click Extract, and look at the results:

http://www.ziatech.com/
http://www.cs.stanford.edu/
http://www.ananova.com/business/story/sm_635565.html

The components will work reasonably well on clean-ish text pages like
    this.  They work even better on text such as newswire or press
releases, as one can demonstrate either over the web or using the
    command line extractor
You can do a search for a term and get extraction from the top
search hits, by entering a term in the "Search for words" box and
	    pressing "Extract":

Audiovox Corporation

Extraction is done over a number of pages from a search engine, and the
	    results from each are shown.  Typically some of these pages
	    will have suitable content to extract, and some just won't.



2. Corporate Contact Information merged.  This illustrates the addition
of information merger across web pages.  Using the included
      MergeExtractDemo.bat or similarly do:
java edu.stanford.nlp.ie.ExtractDemo -m

The ExtractDemo screen is similar, but adds a button to
    Select a Merger.


Select an Extractor Directory and Ontology as
    above.
Click on "Select Merger" and then navigate to
serialized-extractors/mergers and Select the file
unscoredmerger.obj.
Enter the concept "Corporation" as before.
One can now do search as above, by URL or search, but Merger is only
	appropriate to a word search with multiple results.   Try Search
	for words:

Audiovox Corporation

and press "Extract".  Results gradually appear.  After all results have
	been processed (this may take a few seconds), a Merged best
	extracted information result will be produced and displayed as
	the first of the results.  "Merged Instance" will appear on the
	bottom line corresponding to it, rather than a URL.



3. Company names via direct use of an HMM information extractor.
One can also train, load, and use HMM information extractors directly,
	  without using any of the RDF-based KAON framework
(http://kaon.semanticweb.org/) used by ExtractDemo.


The file edu.stanford.nlp.ie.hmm.Tester illustrates the use
	  of a pretrained HMM on data via the command line interface:

cd serialized-extractors/companycontact/
java edu.stanford.nlp.ie.hmm.Tester cisco.txt company
	      company-name.hmm
java edu.stanford.nlp.ie.hmm.Tester EarningsReports.txt
	      company company-name.hmm
java edu.stanford.nlp.ie.hmm.Tester companytest.txt
	      company company-name.hmm

  
The first shows the HMM running on an unmarked up file with a single
	document.  The second shows a Corpus of several
	documents, separated with ENDOFDOC, used as a document delimiter
	inside a Corpus.  This second use of Tester expects to
normally have an annotated corpus on which it can score its answers.
Here, the corpus is unannotated, and so some of the output is
	inappropriate, but it shows what is selected as the company name
	for each document (it's mostly correct...).
The final example shows it running on a corpus that does have answers
marked in it.  It does the testing with the XML elements stripped, but
	then uses them to evaluate correctness.
  

To train one's own HMM, one needs data where one or
	    more fields is annotated in the data in the style of an XML
	    element, with all the documents in one file, separated by
	    lines with ENDOFDOC on them.  Then one can
	    train (and then test) as follows.   Training an HMM
	    (optimizing all its probabilities) takes a long time
	    (it depends on the speed of the computer, but 10 minutes or
	so to adjust probabilities for a fixed structure, and often
	hours if one additionally attempts structure learning).

cd edu/stanford/nlp/ie/training/
java -server edu.stanford.nlp.ie.hmm.Trainer companydata.txt
		  company mycompany.hmm
java edu.stanford.nlp.ie.hmm.HMMSingleFieldExtractor Company
		  mycompany.hmm mycompany.obj
java edu.stanford.nlp.ie.hmm.Tester testdoc.txt company
		  mycompany.hmm

The third step converts a serialized HMM into the serialized objects used
	    in ExtractDemo.  Note that company
	    in the second line must match the element name in the
	    marked-up data that you will train on, while
	    Company in the third line must match the
	    relation name in the ontology over which you will extract with
	    mycompany.obj.  These two names need not be the
	    same.  The last step then runs the trained HMM on a file.



4. Extraction of descriptions (such as biographical information about
	  a person or a description of an animal).
This does extraction of such descriptions
from a web page.  This component uses a POS tagger, and looks for where
	  to find a path to it in the file
	  descextractor.init in the current directory.  So,
	  you should be in the root directory of the current archive,
	  which has such a file.  Double click on the included
      MergeExtractDemo.bat in that directory, or by hand
	  one can equivalently do:
java edu.stanford.nlp.ie.ExtractDemo -m


Select as Extractor Directory the folder:
serialized-extractors/description
Select as an Ontology the one in
serialized-extractors/description/Entity-NameDescription.kaon

Click on "Select Merger" and then navigate to
serialized-extractors/mergers and Select the file
unscoredmerger.obj.
Enter Entity as the Concept to extract.
You can then do various searches for people or animals by entering
	    words in the "Search for words" box and pressing Extract:

Gareth Evans
Tawny Frogmouth
Christopher Manning
Joshua Nkomo

The first search will be slower than subsequent searches, as it takes a
	    while to load the part of speech tagger.