edu.nyu.jet.ne.DocumentToSentenceIterator Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of jet Show documentation

Information extraction is the process of identifying specified classes of entities, relations, and events in natural language text – creating structured data from unstructured input. JET, the Java Extraction Toolkit, developed at New York University over the past fifteen years, provides a rich set of tools for research and education in information extraction from English text. These include standard language processing tools such as a tokenizer, sentence segmenter, part-of-speech tagger, name tagger, regular-expression pattern matcher, and dependency parser. Also provided are relation and event extractors based on the specifications of the U.S. Government's ACE [Automatic Content Extraction] program. The program is provided under an Apache 2.0 license.

The newest version!

// -*- tab-width: 4 -*-
package edu.nyu.jet.ne;

import java.util.Iterator;
import java.util.List;

import edu.nyu.jet.tipster.Annotation;
import edu.nyu.jet.tipster.Document;
import edu.umass.cs.mallet.base.types.Instance;
import edu.umass.cs.mallet.base.pipe.iterator.AbstractPipeInputIterator;

public class DocumentToSentenceIterator extends AbstractPipeInputIterator {
	private static final String SENTENCE = "sentence";

	private Document doc;

	private List sentences;

	private Iterator sentenceIter;

	private int index;

	public DocumentToSentenceIterator(Document doc, String textSegmentName,
			int firstIndex) {
		this.doc = doc;

		sentences = doc.annotationsOfType(SENTENCE);
		sentenceIter = sentences.iterator();
		this.index = firstIndex;
	}

	public DocumentToSentenceIterator(Document doc, String textSegmentName) {
		this(doc, textSegmentName, 1);
	}

	@Override
	public boolean hasNext() {
		return sentenceIter.hasNext();
	}

	@Override
	public Instance nextInstance() {
		Annotation sentence = sentenceIter.next();
		Instance carrier = new Instance(sentence.span(), null, "sentence" + index, doc);
		index++;
		return carrier;
	}
}