All Downloads are FREE. Search and download functionalities are using the official Maven repository.

edu.nyu.jet.ne.DocumentToSentenceIterator Maven / Gradle / Ivy

Go to download

Information extraction is the process of identifying specified classes of entities, relations, and events in natural language text – creating structured data from unstructured input. JET, the Java Extraction Toolkit, developed at New York University over the past fifteen years, provides a rich set of tools for research and education in information extraction from English text. These include standard language processing tools such as a tokenizer, sentence segmenter, part-of-speech tagger, name tagger, regular-expression pattern matcher, and dependency parser. Also provided are relation and event extractors based on the specifications of the U.S. Government's ACE [Automatic Content Extraction] program. The program is provided under an Apache 2.0 license.

The newest version!
// -*- tab-width: 4 -*-
package edu.nyu.jet.ne;

import java.util.Iterator;
import java.util.List;

import edu.nyu.jet.tipster.Annotation;
import edu.nyu.jet.tipster.Document;
import edu.umass.cs.mallet.base.types.Instance;
import edu.umass.cs.mallet.base.pipe.iterator.AbstractPipeInputIterator;

public class DocumentToSentenceIterator extends AbstractPipeInputIterator {
	private static final String SENTENCE = "sentence";

	private Document doc;

	private List sentences;

	private Iterator sentenceIter;

	private int index;

	public DocumentToSentenceIterator(Document doc, String textSegmentName,
			int firstIndex) {
		this.doc = doc;

		sentences = doc.annotationsOfType(SENTENCE);
		sentenceIter = sentences.iterator();
		this.index = firstIndex;
	}

	public DocumentToSentenceIterator(Document doc, String textSegmentName) {
		this(doc, textSegmentName, 1);
	}

	@Override
	public boolean hasNext() {
		return sentenceIter.hasNext();
	}

	@Override
	public Instance nextInstance() {
		Annotation sentence = sentenceIter.next();
		Instance carrier = new Instance(sentence.span(), null, "sentence" + index, doc);
		index++;
		return carrier;
	}
}




© 2015 - 2024 Weber Informatics LLC | Privacy Policy