All Downloads are FREE. Search and download functionalities are using the official Maven repository.

de.tudarmstadt.ukp.jwktl.parser.WiktionaryEntryParser Maven / Gradle / Ivy

Go to download

JWKTL (Java Wiktionary Library) is a Java-based API that enables efficient and structured access to the information encoded in the English and the German Wiktionary edition, including sense definitions, part of speech tags, etymology, example sentences, translations, semantic relations and many other lexical information types.

The newest version!
/*******************************************************************************
 * Copyright 2013
 * Ubiquitous Knowledge Processing (UKP) Lab
 * Technische Universität Darmstadt
 * 
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 * 
 *   http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 ******************************************************************************/
package de.tudarmstadt.ukp.jwktl.parser;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.LinkedList;
import java.util.List;
import java.util.logging.Logger;
import java.util.regex.Pattern;

import de.tudarmstadt.ukp.jwktl.api.IWiktionaryEntry;
import de.tudarmstadt.ukp.jwktl.api.IWiktionarySense;
import de.tudarmstadt.ukp.jwktl.api.entry.WiktionaryPage;
import de.tudarmstadt.ukp.jwktl.api.util.ILanguage;
import de.tudarmstadt.ukp.jwktl.parser.util.IBlockHandler;
import de.tudarmstadt.ukp.jwktl.parser.util.ParsingContext;

/**
 * Base implementation for parsing the textual contents of an article page in
 * order to construct {@link IWiktionaryEntry} and {@link IWiktionarySense}
 * instances. The parser is based on a finite state machine using a set
 * of block handlers that are being asked if they want to process the current 
 * line of text. If so, the handler is in a position to process the subsequent 
 * lines until the entire block has been processed and the next line is
 * subject to initialize a different block handler. Since there are large
 * differences between the individual Wiktionary language editions, there
 * should be one subclass of this parser for each language edition, which
 * cares about language-specific adaptation and the selection of the 
 * block handlers used.
 * @author Christian M. Meyer
 * @author Christof Müller
 */
public abstract class WiktionaryEntryParser implements IWiktionaryEntryParser {

	private static enum ParseStatus{
		IN_BODY,
		IN_HEAD
	}
	
	private static Logger logger = Logger.getLogger(WiktionaryEntryParser.class.getName());
	
	protected static final Pattern COMMENT_PATTERN = Pattern.compile("\\