All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.lucene.analysis.compound.package-info Maven / Gradle / Ivy

There is a newer version: 10.1.0
Show newest version
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/**
 * A filter that decomposes compound words you find in many Germanic languages into the word parts.
 * This example shows what it does:
 *
 * 
 *  
 *  
 *   
 *  
 *  
 *   
 *  
 * 
example input stream
Input token stream
Rindfleischüberwachungsgesetz Drahtschere abba
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
example output stream
Output token stream
(Rindfleischüberwachungsgesetz,0,29)
(Rind,0,4,posIncr=0)
(fleisch,4,11,posIncr=0)
(überwachung,11,22,posIncr=0)
(gesetz,23,29,posIncr=0)
(Drahtschere,30,41)
(Draht,30,35,posIncr=0)
(schere,35,41,posIncr=0)
(abba,42,46)
* * The input token is always preserved and the filters do not alter the case of word parts. There * are two variants of the filter available: * *
    *
  • HyphenationCompoundWordTokenFilter: it uses a hyphenation grammar based approach to * find potential word parts of a given word. *
  • DictionaryCompoundWordTokenFilter: it uses a brute-force dictionary-only based * approach to find the word parts of a given word. *
* *

Compound word token filters

* *

HyphenationCompoundWordTokenFilter

* * The {@link org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter * HyphenationCompoundWordTokenFilter} uses hyphenation grammars to find potential subwords that a * worth to check against the dictionary. It can be used without a dictionary as well but then * produces a lot of "nonword" tokens. The quality of the output tokens is directly connected to the * quality of the grammar file you use. For languages like German they are quite good. * *

Grammar file

* * Unfortunately we cannot bundle the hyphenation grammar files with Lucene because they do not use * an ASF compatible license (they use the LaTeX Project Public License instead). You can find the * XML based grammar files at the Objects For Formatting Objects * (OFFO) Sourceforge project (direct link to download the pattern files: http://downloads.sourceforge.net/offo/offo-hyphenation.zip * ). The files you need are in the subfolder offo-hyphenation/hyph/ .
* Credits for the hyphenation code go to the Apache * FOP project . * *

DictionaryCompoundWordTokenFilter

* * The {@link org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter * DictionaryCompoundWordTokenFilter} uses a dictionary-only approach to find subwords in a compound * word. It is much slower than the one that uses the hyphenation grammars. You can use it as a * first start to see if your dictionary is good or not because it is much simpler in design. * *

Dictionary

* * The output quality of both token filters is directly connected to the quality of the dictionary * you use. They are language dependent of course. You always should use a dictionary that fits to * the text you want to index. If you index medical text for example then you should use a * dictionary that contains medical words. A good start for general text are the dictionaries you * find at the OpenOffice * dictionaries Wiki. * *

Which variant should I use?

* * This decision matrix should help you: * * * * * * * * * * * * * * * * * * *
comparison of dictionary and hyphenation based decompounding
Token filterOutput qualityPerformance
HyphenationCompoundWordTokenFiltergood if grammar file is good – acceptable otherwisefast
DictionaryCompoundWordTokenFiltergoodslow
* *

Examples

* *
 *   public void testHyphenationCompoundWordsDE() throws Exception {
 *     String[] dict = { "Rind", "Fleisch", "Draht", "Schere", "Gesetz",
 *         "Aufgabe", "Überwachung" };
 *
 *     Reader reader = new FileReader("de_DR.xml");
 *
 *     HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter
 *         .getHyphenationTree(reader);
 *
 *     HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter(
 *         new WhitespaceTokenizer(new StringReader(
 *             "Rindfleischüberwachungsgesetz Drahtschere abba")), hyphenator,
 *         dict, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
 *         CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE,
 *         CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false);
 *
 *     CharTermAttribute t = tf.addAttribute(CharTermAttribute.class);
 *     while (tf.incrementToken()) {
 *        System.out.println(t);
 *     }
 *   }
 *
 *   public void testHyphenationCompoundWordsWithoutDictionaryDE() throws Exception {
 *     Reader reader = new FileReader("de_DR.xml");
 *
 *     HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter
 *         .getHyphenationTree(reader);
 *
 *     HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter(
 *         new WhitespaceTokenizer(new StringReader(
 *             "Rindfleischüberwachungsgesetz Drahtschere abba")), hyphenator);
 *
 *     CharTermAttribute t = tf.addAttribute(CharTermAttribute.class);
 *     while (tf.incrementToken()) {
 *        System.out.println(t);
 *     }
 *   }
 *
 *   public void testDumbCompoundWordsSE() throws Exception {
 *     String[] dict = { "Bil", "Dörr", "Motor", "Tak", "Borr", "Slag", "Hammar",
 *         "Pelar", "Glas", "Ögon", "Fodral", "Bas", "Fiol", "Makare", "Gesäll",
 *         "Sko", "Vind", "Rute", "Torkare", "Blad" };
 *
 *     DictionaryCompoundWordTokenFilter tf = new DictionaryCompoundWordTokenFilter(
 *         new WhitespaceTokenizer(
 *             new StringReader(
 *                 "Bildörr Bilmotor Biltak Slagborr Hammarborr Pelarborr Glasögonfodral Basfiolsfodral Basfiolsfodralmakaregesäll Skomakare Vindrutetorkare Vindrutetorkarblad abba")),
 *         dict);
 *     CharTermAttribute t = tf.addAttribute(CharTermAttribute.class);
 *     while (tf.incrementToken()) {
 *        System.out.println(t);
 *     }
 *   }
 * 
*/ package org.apache.lucene.analysis.compound;




© 2015 - 2025 Weber Informatics LLC | Privacy Policy