All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.lucene.analysis.standard.package.html Maven / Gradle / Ivy

There is a newer version: 2024.11.18751.20241128T090041Z-241100
Show newest version




    


Fast, general-purpose grammar-based tokenizers.

The org.apache.lucene.analysis.standard package contains three fast grammar-based tokenizers constructed with JFlex:

  • {@link org.apache.lucene.analysis.standard.StandardTokenizer}: as of Lucene 3.1, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Unlike UAX29URLEmailTokenizer, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.
    {@link org.apache.lucene.analysis.standard.StandardAnalyzer StandardAnalyzer} includes {@link org.apache.lucene.analysis.standard.StandardTokenizer StandardTokenizer}, {@link org.apache.lucene.analysis.standard.StandardFilter StandardFilter}, {@link org.apache.lucene.analysis.core.LowerCaseFilter LowerCaseFilter} and {@link org.apache.lucene.analysis.core.StopFilter StopFilter}. When the Version specified in the constructor is lower than 3.1, the {@link org.apache.lucene.analysis.standard.ClassicTokenizer ClassicTokenizer} implementation is invoked.
  • {@link org.apache.lucene.analysis.standard.ClassicTokenizer ClassicTokenizer}: this class was formerly (prior to Lucene 3.1) named StandardTokenizer. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.) {@link org.apache.lucene.analysis.standard.ClassicAnalyzer ClassicAnalyzer} includes {@link org.apache.lucene.analysis.standard.ClassicTokenizer ClassicTokenizer}, {@link org.apache.lucene.analysis.standard.StandardFilter StandardFilter}, {@link org.apache.lucene.analysis.core.LowerCaseFilter LowerCaseFilter} and {@link org.apache.lucene.analysis.core.StopFilter StopFilter}.
  • {@link org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer UAX29URLEmailTokenizer}: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. URLs and email addresses are also tokenized according to the relevant RFCs.
    {@link org.apache.lucene.analysis.standard.UAX29URLEmailAnalyzer UAX29URLEmailAnalyzer} includes {@link org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer UAX29URLEmailTokenizer}, {@link org.apache.lucene.analysis.standard.StandardFilter StandardFilter}, {@link org.apache.lucene.analysis.core.LowerCaseFilter LowerCaseFilter} and {@link org.apache.lucene.analysis.core.StopFilter StopFilter}.




© 2015 - 2024 Weber Informatics LLC | Privacy Policy