cene.lucene-core.8.6.1.source-code.overview.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of lucene-core Show documentation
Apache Lucene Java Core
There is a newer version: 9.11.1



   Apache Lucene API



Apache Lucene is a high-performance, full-featured text search engine library.
Here's a simple example how to use Lucene for indexing and searching (using JUnit
to check if the results are what we expect):


    Analyzer analyzer = new StandardAnalyzer();

    Path indexPath = Files.createTempDirectory("tempIndex");
    Directory directory = FSDirectory.open(indexPath)
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
    iwriter.addDocument(doc);
    iwriter.close();
    
    // Now search the index:
    DirectoryReader ireader = DirectoryReader.open(directory);
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser("fieldname", analyzer);
    Query query = parser.parse("text");
    ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
    assertEquals(1, hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {
      Document hitDoc = isearcher.doc(hits[i].doc);
      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
    ireader.close();
    directory.close();
    IOUtils.rm(indexPath);




The Lucene API is divided into several packages:



{@link org.apache.lucene.analysis}
defines an abstract {@link org.apache.lucene.analysis.Analyzer Analyzer}
API for converting text from a {@link java.io.Reader}
into a {@link org.apache.lucene.analysis.TokenStream TokenStream},
an enumeration of token {@link org.apache.lucene.util.Attribute Attribute}s. 
A TokenStream can be composed by applying {@link org.apache.lucene.analysis.TokenFilter TokenFilter}s
to the output of a {@link org.apache.lucene.analysis.Tokenizer Tokenizer}. 
Tokenizers and TokenFilters are strung together and applied with an {@link org.apache.lucene.analysis.Analyzer Analyzer}. 
analyzers-common provides a number of Analyzer implementations, including 
StopAnalyzer
and the grammar-based StandardAnalyzer.


{@link org.apache.lucene.codecs}
provides an abstraction over the encoding and decoding of the inverted index structure,
as well as different implementations that can be chosen depending upon application needs.


{@link org.apache.lucene.document}
provides a simple {@link org.apache.lucene.document.Document Document}
class.  A Document is simply a set of named {@link org.apache.lucene.document.Field Field}s,
whose values may be strings or instances of {@link java.io.Reader}.


{@link org.apache.lucene.index}
provides two primary classes: {@link org.apache.lucene.index.IndexWriter IndexWriter},
which creates and adds documents to indices; and {@link org.apache.lucene.index.IndexReader},
which accesses the data in the index.


{@link org.apache.lucene.search}
provides data structures to represent queries (ie {@link org.apache.lucene.search.TermQuery TermQuery}
for individual words, {@link org.apache.lucene.search.PhraseQuery PhraseQuery} 
for phrases, and {@link org.apache.lucene.search.BooleanQuery BooleanQuery} 
for boolean combinations of queries) and the {@link org.apache.lucene.search.IndexSearcher IndexSearcher}
which turns queries into {@link org.apache.lucene.search.TopDocs TopDocs}.
A number of QueryParsers are provided for producing
query structures from strings or xml.


{@link org.apache.lucene.store}
defines an abstract class for storing persistent data, the {@link org.apache.lucene.store.Directory Directory},
which is a collection of named files written by an {@link org.apache.lucene.store.IndexOutput IndexOutput}
and read by an {@link org.apache.lucene.store.IndexInput IndexInput}. 
Multiple implementations are provided, but {@link org.apache.lucene.store.FSDirectory FSDirectory} is generally
recommended as it tries to use operating system disk buffer caches efficiently.


{@link org.apache.lucene.util}
contains a few handy data structures and util classes, ie {@link org.apache.lucene.util.FixedBitSet FixedBitSet}
and {@link org.apache.lucene.util.PriorityQueue PriorityQueue}.

To use Lucene, an application should:


Create {@link org.apache.lucene.document.Document Document}s by
adding
{@link org.apache.lucene.document.Field Field}s;


Create an {@link org.apache.lucene.index.IndexWriter IndexWriter}
and add documents to it with {@link org.apache.lucene.index.IndexWriter#addDocument(Iterable) addDocument()};


Call QueryParser.parse()
to build a query from a string; and


Create an {@link org.apache.lucene.search.IndexSearcher IndexSearcher}
and pass the query to its {@link org.apache.lucene.search.IndexSearcher#search(org.apache.lucene.search.Query, int) search()}
method.

Some simple examples of code which does this are:


 IndexFiles.java creates an
index for all the files contained in a directory.


 SearchFiles.java prompts for
queries and searches an index.

To demonstrate these, try something like:
> java -cp lucene-core.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.IndexFiles -index index -docs rec.food.recipes/soups

adding rec.food.recipes/soups/abalone-chowder

  [ ... ]

> java -cp lucene-core.jar:lucene-demo.jar:lucene-queryparser.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles

Query: chowder

Searching for: chowder

34 total matching documents

1. rec.food.recipes/soups/spam-chowder

  [ ... thirty-four documents contain the word "chowder" ... ]

Query: "clam chowder" AND Manhattan

Searching for: +"clam chowder" +manhattan

2 total matching documents

1. rec.food.recipes/soups/clam-chowder

  [ ... two documents contain the phrase "clam chowder"
and the word "manhattan" ... ]

    [ Note: "+" and "-" are canonical, but "AND", "OR"
and "NOT" may be used. ]