org.apache.lucene.analysis.standard.package.html Maven / Gradle / Ivy
Go to download
Show more of this group Show more artifacts with this name
Show all versions of aem-sdk-api Show documentation
Show all versions of aem-sdk-api Show documentation
The Adobe Experience Manager SDK
The newest version!
Fast, general-purpose grammar-based tokenizers.
The org.apache.lucene.analysis.standard
package contains three
fast grammar-based tokenizers constructed with JFlex:
- {@link org.apache.lucene.analysis.standard.StandardTokenizer}:
as of Lucene 3.1, implements the Word Break rules from the Unicode Text
Segmentation algorithm, as specified in
Unicode Standard Annex #29.
Unlike
UAX29URLEmailTokenizer
, URLs and email addresses are
not tokenized as single tokens, but are instead split up into
tokens according to the UAX#29 word break rules.
{@link org.apache.lucene.analysis.standard.StandardAnalyzer StandardAnalyzer} includes
{@link org.apache.lucene.analysis.standard.StandardTokenizer StandardTokenizer},
{@link org.apache.lucene.analysis.standard.StandardFilter StandardFilter},
{@link org.apache.lucene.analysis.core.LowerCaseFilter LowerCaseFilter}
and {@link org.apache.lucene.analysis.core.StopFilter StopFilter}.
When the Version
specified in the constructor is lower than
3.1, the {@link org.apache.lucene.analysis.standard.ClassicTokenizer ClassicTokenizer}
implementation is invoked.
- {@link org.apache.lucene.analysis.standard.ClassicTokenizer ClassicTokenizer}:
this class was formerly (prior to Lucene 3.1) named
StandardTokenizer
. (Its tokenization rules are not
based on the Unicode Text Segmentation algorithm.)
{@link org.apache.lucene.analysis.standard.ClassicAnalyzer ClassicAnalyzer} includes
{@link org.apache.lucene.analysis.standard.ClassicTokenizer ClassicTokenizer},
{@link org.apache.lucene.analysis.standard.StandardFilter StandardFilter},
{@link org.apache.lucene.analysis.core.LowerCaseFilter LowerCaseFilter}
and {@link org.apache.lucene.analysis.core.StopFilter StopFilter}.
- {@link org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer UAX29URLEmailTokenizer}:
implements the Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29.
URLs and email addresses are also tokenized according to the relevant RFCs.
{@link org.apache.lucene.analysis.standard.UAX29URLEmailAnalyzer UAX29URLEmailAnalyzer} includes
{@link org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer UAX29URLEmailTokenizer},
{@link org.apache.lucene.analysis.standard.StandardFilter StandardFilter},
{@link org.apache.lucene.analysis.core.LowerCaseFilter LowerCaseFilter}
and {@link org.apache.lucene.analysis.core.StopFilter StopFilter}.