Many resources are needed to download a project. Please understand that we have to compensate our server costs. Thank you in advance. Project price only 1 $
You can buy this project and download/modify it how often you want.
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* A filter that decomposes compound words you find in many Germanic
* languages into the word parts. This example shows what it does:
*
*
*
Input token stream
*
*
*
Rindfleischüberwachungsgesetz Drahtschere abba
*
*
*
*
*
*
Output token stream
*
*
*
(Rindfleischüberwachungsgesetz,0,29)
*
*
*
(Rind,0,4,posIncr=0)
*
*
*
(fleisch,4,11,posIncr=0)
*
*
*
(überwachung,11,22,posIncr=0)
*
*
*
(gesetz,23,29,posIncr=0)
*
*
*
(Drahtschere,30,41)
*
*
*
(Draht,30,35,posIncr=0)
*
*
*
(schere,35,41,posIncr=0)
*
*
*
(abba,42,46)
*
*
*
* The input token is always preserved and the filters do not alter the case of word parts. There are two variants of the
* filter available:
*
*
HyphenationCompoundWordTokenFilter: it uses a
* hyphenation grammar based approach to find potential word parts of a
* given word.
*
DictionaryCompoundWordTokenFilter: it uses a
* brute-force dictionary-only based approach to find the word parts of a given
* word.
*
*
*
Compound word token filters
*
HyphenationCompoundWordTokenFilter
* The {@link
* org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
* HyphenationCompoundWordTokenFilter} uses hyphenation grammars to find
* potential subwords that a worth to check against the dictionary. It can be used
* without a dictionary as well but then produces a lot of "nonword" tokens.
* The quality of the output tokens is directly connected to the quality of the
* grammar file you use. For languages like German they are quite good.
*
Grammar file
* Unfortunately we cannot bundle the hyphenation grammar files with Lucene
* because they do not use an ASF compatible license (they use the LaTeX
* Project Public License instead). You can find the XML based grammar
* files at the
* Objects
* For Formatting Objects
* (OFFO) Sourceforge project (direct link to download the pattern files:
* http://downloads.sourceforge.net/offo/offo-hyphenation.zip
* ). The files you need are in the subfolder
* offo-hyphenation/hyph/
* .
*
* Credits for the hyphenation code go to the
* Apache FOP project
* .
*
*
DictionaryCompoundWordTokenFilter
* The {@link
* org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
* DictionaryCompoundWordTokenFilter} uses a dictionary-only approach to
* find subwords in a compound word. It is much slower than the one that
* uses the hyphenation grammars. You can use it as a first start to
* see if your dictionary is good or not because it is much simpler in design.
*
*
Dictionary
* The output quality of both token filters is directly connected to the
* quality of the dictionary you use. They are language dependent of course.
* You always should use a dictionary
* that fits to the text you want to index. If you index medical text for
* example then you should use a dictionary that contains medical words.
* A good start for general text are the dictionaries you find at the
* OpenOffice
* dictionaries
* Wiki.
*
*
Which variant should I use?
* This decision matrix should help you:
*
*
*
Token filter
*
Output quality
*
Performance
*
*
*
HyphenationCompoundWordTokenFilter
*
good if grammar file is good – acceptable otherwise