
org.apache.lucene.document.package-info Maven / Gradle / Ivy
Show all versions of org.apache.servicemix.bundles.lucene
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* The logical representation of a {@link org.apache.lucene.document.Document} for indexing and
* searching.
*
* The document package provides the user level logical representation of content to be indexed
* and searched. The package also provides utilities for working with {@link
* org.apache.lucene.document.Document}s and {@link org.apache.lucene.index.IndexableField}s.
*
*
Document and IndexableField
*
* A {@link org.apache.lucene.document.Document} is a collection of {@link
* org.apache.lucene.index.IndexableField}s. A {@link org.apache.lucene.index.IndexableField} is a
* logical representation of a user's content that needs to be indexed or stored. {@link
* org.apache.lucene.index.IndexableField}s have a number of properties that tell Lucene how to
* treat the content (like indexed, tokenized, stored, etc.) See the {@link
* org.apache.lucene.document.Field} implementation of {@link
* org.apache.lucene.index.IndexableField} for specifics on these properties.
*
*
Note: it is common to refer to {@link org.apache.lucene.document.Document}s having {@link
* org.apache.lucene.document.Field}s, even though technically they have {@link
* org.apache.lucene.index.IndexableField}s.
*
*
Working with Documents
*
* First and foremost, a {@link org.apache.lucene.document.Document} is something created by the
* user application. It is your job to create Documents based on the content of the files you are
* working with in your application (Word, txt, PDF, Excel or any other format.) How this is done is
* completely up to you. That being said, there are many tools available in other projects that can
* make the process of taking a file and converting it into a Lucene {@link
* org.apache.lucene.document.Document}.
*
*
How to index ...
*
* Strings
*
* {@link org.apache.lucene.document.TextField} allows indexing tokens from a String so that one
* can perform full-text search on it. The way that the input is tokenized depends on the {@link
* org.apache.lucene.analysis.Analyzer} that is configured on the {@link
* org.apache.lucene.index.IndexWriterConfig}. TextField can also be optionally stored.
*
*
{@link org.apache.lucene.document.KeywordField} indexes whole values as a single term so that
* one can perform exact search on it. It also records doc values to enable sorting or faceting on
* this field. Finally, it also supports optionally storing the value.
*
*
If faceting or sorting are not required, {@link org.apache.lucene.document.StringField} is a
* variant of {@link org.apache.lucene.document.KeywordField} that does not index doc values.
*
*
Numbers
*
* If a numeric field represents an identifier rather than a quantity and is more commonly
* searched on single values than on ranges of values, it is generally recommended to index its
* string representation via {@link org.apache.lucene.document.KeywordField} (or {@link
* org.apache.lucene.document.StringField} if doc values are not necessary).
*
*
{@link org.apache.lucene.document.LongField}, {@link org.apache.lucene.document.IntField},
* {@link org.apache.lucene.document.DoubleField} and {@link org.apache.lucene.document.FloatField}
* index values in a points index for efficient range queries, and also create doc-values for these
* fields for efficient sorting and faceting.
*
*
If the field is aimed at being used to tune the score, {@link
* org.apache.lucene.document.FeatureField} helps internally store numeric data as term frequencies
* in a way that makes it efficient to influence scoring at search time.
*
*
Other types of structured data
*
* It is recommended to index dates as a {@link org.apache.lucene.document.LongField} that stores
* the number of milliseconds since Epoch.
*
*
IP fields can be indexed via {@link org.apache.lucene.document.InetAddressPoint} in addition
* to a {@link org.apache.lucene.document.SortedDocValuesField} (if the field is single-valued) or
* {@link org.apache.lucene.document.SortedSetDocValuesField} that stores the result of {@link
* org.apache.lucene.document.InetAddressPoint#encode}.
*
*
Dense numeric vectors
*
* Dense numeric vectors can be indexed with {@link
* org.apache.lucene.document.KnnFloatVectorField} if its dimensions are floating-point numbers or
* {@link org.apache.lucene.document.KnnByteVectorField} if its dimensions are bytes. This allows
* searching for nearest neighbors at search time.
*
*
Sparse numeric vectors
*
* To perform nearest-neighbor search on sparse vectors rather than dense vectors, each dimension
* of the sparse vector should be indexed as a {@link org.apache.lucene.document.FeatureField}.
* Queries can then be constructed as a {@link org.apache.lucene.search.BooleanQuery} with {@link
* org.apache.lucene.document.FeatureField#newLinearQuery(String, String, float) linear queries} as
* {@link org.apache.lucene.search.BooleanClause.Occur#SHOULD} clauses.
*/
package org.apache.lucene.document;