org.apache.lucene.analysis.standard.package.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of aem-sdk-api Show documentation
The Adobe Experience Manager SDK
The newest version!




    



Fast, general-purpose grammar-based tokenizers.

The org.apache.lucene.analysis.standard package contains three
    fast grammar-based tokenizers constructed with JFlex:

    {@link org.apache.lucene.analysis.standard.StandardTokenizer}:
        as of Lucene 3.1, implements the Word Break rules from the Unicode Text 
        Segmentation algorithm, as specified in 
        Unicode Standard Annex #29.
        Unlike UAX29URLEmailTokenizer, URLs and email addresses are
        not tokenized as single tokens, but are instead split up into 
        tokens according to the UAX#29 word break rules.
        

        {@link org.apache.lucene.analysis.standard.StandardAnalyzer StandardAnalyzer} includes
        {@link org.apache.lucene.analysis.standard.StandardTokenizer StandardTokenizer},
        {@link org.apache.lucene.analysis.standard.StandardFilter StandardFilter}, 
        {@link org.apache.lucene.analysis.core.LowerCaseFilter LowerCaseFilter}
        and {@link org.apache.lucene.analysis.core.StopFilter StopFilter}.
        When the Version specified in the constructor is lower than 
        3.1, the {@link org.apache.lucene.analysis.standard.ClassicTokenizer ClassicTokenizer}
        implementation is invoked.
    {@link org.apache.lucene.analysis.standard.ClassicTokenizer ClassicTokenizer}:
        this class was formerly (prior to Lucene 3.1) named 
        StandardTokenizer.  (Its tokenization rules are not
        based on the Unicode Text Segmentation algorithm.)
        {@link org.apache.lucene.analysis.standard.ClassicAnalyzer ClassicAnalyzer} includes
        {@link org.apache.lucene.analysis.standard.ClassicTokenizer ClassicTokenizer},
        {@link org.apache.lucene.analysis.standard.StandardFilter StandardFilter}, 
        {@link org.apache.lucene.analysis.core.LowerCaseFilter LowerCaseFilter}
        and {@link org.apache.lucene.analysis.core.StopFilter StopFilter}.
    
    {@link org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer UAX29URLEmailTokenizer}:
        implements the Word Break rules from the Unicode Text Segmentation
        algorithm, as specified in 
        Unicode Standard Annex #29.
        URLs and email addresses are also tokenized according to the relevant RFCs.
        

        {@link org.apache.lucene.analysis.standard.UAX29URLEmailAnalyzer UAX29URLEmailAnalyzer} includes
        {@link org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer UAX29URLEmailTokenizer},
        {@link org.apache.lucene.analysis.standard.StandardFilter StandardFilter},
        {@link org.apache.lucene.analysis.core.LowerCaseFilter LowerCaseFilter}
        and {@link org.apache.lucene.analysis.core.StopFilter StopFilter}.