All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.eclipse.rdf4j.sail.lucene.LuceneSail Maven / Gradle / Ivy

The newest version!
/*******************************************************************************
 * Copyright (c) 2015 Eclipse RDF4J contributors, Aduna, and others.
 *
 * All rights reserved. This program and the accompanying materials
 * are made available under the terms of the Eclipse Distribution License v1.0
 * which accompanies this distribution, and is available at
 * http://www.eclipse.org/org/documents/edl-v10.php.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 *******************************************************************************/
package org.eclipse.rdf4j.sail.lucene;

import java.io.File;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.Properties;
import java.util.Set;
import java.util.concurrent.atomic.AtomicBoolean;

import org.apache.commons.lang3.math.NumberUtils;
import org.eclipse.rdf4j.model.IRI;
import org.eclipse.rdf4j.model.Resource;
import org.eclipse.rdf4j.model.Statement;
import org.eclipse.rdf4j.model.Value;
import org.eclipse.rdf4j.model.ValueFactory;
import org.eclipse.rdf4j.query.BindingSet;
import org.eclipse.rdf4j.query.QueryLanguage;
import org.eclipse.rdf4j.query.TupleQuery;
import org.eclipse.rdf4j.query.TupleQueryResult;
import org.eclipse.rdf4j.query.algebra.evaluation.federation.FederatedServiceResolver;
import org.eclipse.rdf4j.query.algebra.evaluation.function.TupleFunctionRegistry;
import org.eclipse.rdf4j.repository.sail.SailRepository;
import org.eclipse.rdf4j.repository.sail.SailRepositoryConnection;
import org.eclipse.rdf4j.repository.sparql.federation.SPARQLServiceResolver;
import org.eclipse.rdf4j.sail.NotifyingSailConnection;
import org.eclipse.rdf4j.sail.SailException;
import org.eclipse.rdf4j.sail.evaluation.TupleFunctionEvaluationMode;
import org.eclipse.rdf4j.sail.helpers.NotifyingSailWrapper;
import org.eclipse.rdf4j.sail.lucene.util.SearchIndexUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * A LuceneSail wraps an arbitrary existing Sail and extends it with support for full-text search on all Literals.
 * 

Setting up a LuceneSail

LuceneSail works in two modes: storing its data into a directory on the harddisk or * into a RAMDirectory in RAM (which is discarded when the program ends). Example with storage in a folder: * *
 * // create a sesame memory sail
 * MemoryStore memoryStore = new MemoryStore();
 *
 * // create a lucenesail to wrap the memorystore
 * LuceneSail lucenesail = new LuceneSail();
 * // set this parameter to store the lucene index on disk
 * lucenesail.setParameter(LuceneSail.LUCENE_DIR_KEY, "./data/mydirectory");
 *
 * // wrap memorystore in a lucenesail
 * lucenesail.setBaseSail(memoryStore);
 *
 * // create a Repository to access the sails
 * SailRepository repository = new SailRepository(lucenesail);
 * repository.initialize();
 * 
*

* Example with storage in a RAM directory: * *

 * // create a sesame memory sail
 * MemoryStore memoryStore = new MemoryStore();
 *
 * // create a lucenesail to wrap the memorystore
 * LuceneSail lucenesail = new LuceneSail();
 * // set this parameter to let the lucene index store its data in ram
 * lucenesail.setParameter(LuceneSail.LUCENE_RAMDIR_KEY, "true");
 *
 * // wrap memorystore in a lucenesail
 * lucenesail.setBaseSail(memoryStore);
 *
 * // create a Repository to access the sails
 * SailRepository repository = new SailRepository(lucenesail);
 * 
* *

Asking full-text queries

Text queries are expressed using the virtual properties of the LuceneSail. *

* In SPARQL: * *

{@code
 * SELECT ?subject ?score ?snippet ?resource
 * WHERE {
 *   ?subject  [
 *      a  ;
 *       "my Lucene query" ;
 *       ?score ;
 *       ?snippet ;
 *       ?resource
 *   ]
 * }
 * }
* * When defining queries, these properties type and query are mandatory. Also, the matches relation is * mandatory. When one of these misses, the query will not be executed as expected. The failure behavior can be * configured, setting the Sail property "incompletequeryfail" to true will throw a SailException when such patterns are * found, this is the default behavior to help finding inaccurate queries. Set it to false to have warnings logged * instead. Multiple queries can be issued to the sail, the results of the queries will be integrated. Note that * you cannot use the same variable for multiple Text queries, if you want to combine text searches, use Lucenes query * syntax. *

Fields are stored/indexed

All fields are stored and indexed. The "text" fields (gathering * all literals) have to be stored, because when a new literal is added to a document, the previous texts need to be * copied from the existing document to the new Document, this does not work when they are only "indexed". Fields that * are not stored, cannot be retrieved using full-text querying. *

Deleting a Lucene index

At the moment, deleting the lucene index can be done in two ways: *
    *
  • Delete the folder where the data is stored while the application is not running
  • *
  • Call the repository's * {@link org.eclipse.rdf4j.repository.RepositoryConnection#clear(org.eclipse.rdf4j.model.Resource[])} * method with no arguments. clear(). This will delete the index.
  • *
*

Handling of Contexts

Each lucene document contains a field for every contextIDs that contributed to the * document. NULL contexts are marked using the String * {@link org.eclipse.rdf4j.sail.lucene.SearchFields#CONTEXT_NULL} ("null") and stored in the lucene field * {@link org.eclipse.rdf4j.sail.lucene.SearchFields#CONTEXT_FIELD_NAME} ("context"). This means that when * adding/appending to a document, all additional context-uris are added to the document. When deleting individual * triples, the context is ignored. In clear(Resource ...) we make a query on all Lucene-Documents that were possibly * created by this context(s). Given a document D that context C(1-n) contributed to. D' is the new document after * clear(). - if there is only one C then D can be safely removed. There is no D' (I hope this is the standard case: * like in ontologies, where all triples about a resource are in one document) - if there are multiple C, remember the * uri of D, delete D, and query (s,p,o, ?) from the underlying store after committing the operation- this returns the * literals of D', add D' as new document This will probably be both fast in the common case and capable enough in the * multiple-C case. *

Defining the indexed Fields

The property {@link #INDEXEDFIELDS} is to configure * which fields to index and to project a property to another. Syntax: * *
 * # only index label and comment
 * index.1=http://www.w3.org/2000/01/rdf-schema#label
 * index.2=http://www.w3.org/2000/01/rdf-schema#comment
 * # project http://xmlns.com/foaf/0.1/name to rdfs:label
 * http\://xmlns.com/foaf/0.1/name=http\://www.w3.org/2000/01/rdf-schema#label
 * 
* *

Set and select Lucene sail by id

The property {@link #INDEX_ID} is to configure the id * of the index and filter every request without the search:indexid predicate, the request would be: * *
{@code
 * ?subj search:matches [
 * 	      search:indexid my:lucene_index_id;
 * 	      search:query "search terms...";
 * 	      search:property my:property;
 * 	      search:score ?score;
 * 	      search:snippet ?snippet ] .
 * }
*

* If a LuceneSail is using another LuceneSail as a base sail, the evaluation mode should be set to * {@link TupleFunctionEvaluationMode#NATIVE}. * *

Defining the indexed Types/Languages

The properties {@link #INDEXEDTYPES} and * {@link #INDEXEDLANG} are to configure which fields to index by their language or type. {@link #INDEXEDTYPES} Syntax: * *
 * # only index object of rdf:type ex:mytype1, rdf:type ex:mytype2 or ex:mytypedef ex:mytype3
 * http\://www.w3.org/1999/02/22-rdf-syntax-ns#type=http://example.org/mytype1 http://example.org/mytype2
 * http\://example.org/mytypedef=http://example.org/mytype3
 * 
*

* {@link #INDEXEDLANG} Syntax: * *

 * # syntax to index only French(fr) and English(en) literals
 * fr en
 * 
* *

Datatypes

Datatypes are ignored in the LuceneSail. */ public class LuceneSail extends NotifyingSailWrapper { /* * FIXME: Add a proper reference to the ISWC paper in the Javadoc. Gunnar: only when/if the paper is accepted * Enrico: paper was rejected Leo: We need to resubmit it. * * FIXME: Add settings that instruct a LuceneSailConnection or LuceneIndex which properties are to be handled in * which way. This is conceptually similar to Lucene's Field types: should properties be stored in the wrapped Sail * (enabling retrieval through RDF queries), indexed in the LuceneIndex (enabling full-text search using Lucene * queries embedded in RDF graph queries) or both? Gunnar and Leo: we had this in the old version, we might add * later. Enrico: in beagle we set the default setting to index AND store a field, so that when you extend the * ontology you can be sure it is indexed and stored by the lucenesail without touching it. For certain (very rare) * predicates (like the full text of the resource) we then explicitly turned off the store option. That would be a * desired behaviour. In the old version an RDF file was used, but it should be done differently, that is too * hard-coded! can't that information be stored in the wrapped sail itself? Annotate a predicate with the proper * lucene values (store / index / storeAndIndex), if nothing is given, take the default, and read this on starting * the lucenesail. Leo: ok, default = index and store, agreed. Leo: about configuration: RDF config is agreed, if * passed as file, inside the wrapped sail, or in an extra sail should all be possible. */ /* * FIXME: This code can only handle RDF queries containing a single "Lucene expression" (i.e. a combination of * matches, query and optionally other predicates from the LuceneSail's namespace), the other expressions are * ignored. Extending this to support an arbitrary number of search expressions is theoretically possible but easier * said then done, especially because of the number of different cases that need to be handled: variable subject vs. * specified subject, expressions operating on the same subject vs. expressions operating on different subjects, * etc. Gunnar: I would we restrict this to one. Enrico might have other requirements? Enrico: we need 1) an * arbitrary number of lucene expressions and 2) an arbitrary combination with ordinary structured queries (see * lucenesail paper, fig. 1 on page 6) Leo: combining lucene query with normal query is required, having multiple * lucene queries in one SPARQL query is a good idea, which should be doable. Lower priority. * * FIXME: We should escape those chars in predicates/field names that have a special meaning in Lucene's query * syntax, using ":" in a field name might lead to problems (it will when you start to query on these fields). * Enrico: yes, we escaped those : sucessfully with a simple \, the only difficuilty was to figure out how many \ * are needed (how often they get unescaped until they arrive at Lucene) Leo noticed this. Gunnar asks: Does lucene * not have a escape syntax? * * FIXME: The getScore method is a convenient and efficient way of testing whether a given document matches a query, * as it adds the document URI to the Lucene query instead of firing the query and looping over the result set. The * problem with this method is that I am not sure whether adding the URI to the Lucene query will lead to a * different score for that document. For most applications this is probably not a problem as you either will use * the search method with the scores reposted to its listener, or the getScore method, but not both. The order of * matching documents will probably be the same when sorting on score (field is indexed without normalization + only * unique values). Still, it is counterintuitive when a particular document is returned with a given score and a * getScore for that same URI gives a different score. * * FIXME: the code is very much NOT thread-safe, especially when you are changing the index and querying it with * LuceneSailConnection at the same time: the IndexReaders/Searchers are closed after each statement addition or * removal but they must also remain open while we are looping over search results. Also, internal document numbers * are used in the communication between LuceneIndex and LuceneSailConnection, which is not a good idea. Some * mechanism has to be introduced to support external querying while the index is being modified (basically: make * sure that a single search process keeps using the same IndexSearcher). Gunnar and Leo: we are not sure if the * original lucenesail was 100% threadsafe, but at least it had "synchronized" everywhere :) * http://gnowsis.opendfki.de/repos/gnowsis/trunk/lucenesail/src/java/org/openrdf/sesame/sailimpl/ * lucenesail/LuceneIndex.java This might be a big issue in Nepomuk... Enrico: do we have multiple threads? do we * need separate threads? Leo: we have separate threads, but we don't care much for now. */ final static private Logger logger = LoggerFactory.getLogger(LuceneSail.class); /** * Set the parameter "reindexQuery=" to configure the statements to index over. Default value is "SELECT ?s ?p ?o ?c * WHERE {{?s ?p ?o} UNION {GRAPH ?c {?s ?p ?o.}}} ORDER BY ?s" . NB: the query must contain the bindings ?s, ?p, ?o * and ?c and must be ordered by ?s. */ public static final String REINDEX_QUERY_KEY = "reindexQuery"; /** * Set the parameter "indexedfields=..." to configure a selection of fields to index, and projections of properties. * Only the configured fields will be indexed. A property P projected to Q will cause the index to contain Q instead * of P, when triples with P were indexed. Syntax of indexedfields - see above */ public static final String INDEXEDFIELDS = "indexedfields"; /** * Set the parameter "indexedtypes=..." to configure a selection of field type to index. Only the fields with the * specific type will be indexed. Syntax of indexedtypes - see above */ public static final String INDEXEDTYPES = "indexedtypes"; /** * Set the parameter "indexedlang=..." to configure a selection of field language to index. Only the fields with the * specific language will be indexed. Syntax of indexedlang - see above */ public static final String INDEXEDLANG = "indexedlang"; /** * See {@link org.eclipse.rdf4j.sail.lucene.TypeBacktraceMode} */ public static final String INDEX_TYPE_BACKTRACE_MODE = "indexBacktraceMode"; /** * Set the key "lucenedir=<path>" as sail parameter to configure the Lucene Directory on the filesystem where * to store the lucene index. */ public static final String LUCENE_DIR_KEY = "lucenedir"; /** * Set the default directory of the Lucene index files. The value is always relational to the {@code dataDir} * location as a parent directory. */ public static final String DEFAULT_LUCENE_DIR = ".index"; /** * Set the key "useramdir=true" as sail parameter to let the LuceneSail store its Lucene index in RAM. This is not * intended for production environments. */ public static final String LUCENE_RAMDIR_KEY = "useramdir"; /** * Set the key "defaultNumDocs=<n>" as sail parameter to limit the maximum number of documents to return from * a search query. The default is to return all documents. NB: this may involve extra cost for some SearchIndex * implementations as they may have to determine this number. */ public static final String DEFAULT_NUM_DOCS_KEY = "defaultNumDocs"; /** * Set the key "maxDocuments=<n>" as sail parameter to limit the maximum number of documents the user can * query at a time to return from a search query. The default is the value of the {@link #DEFAULT_NUM_DOCS_KEY} * parameter. */ public static final String MAX_DOCUMENTS_KEY = "maxDocuments"; /** * Set this key to configure which fields contain WKT and should be spatially indexed. The value should be a * space-separated list of URIs. Default is http://www.opengis.net/ont/geosparql#asWKT. */ public static final String WKT_FIELDS = "wktFields"; /** * Set this key to configure the SearchIndex class implementation. Default is * org.eclipse.rdf4j.sail.lucene.LuceneIndex. */ public static final String INDEX_CLASS_KEY = "index"; /** * Set this key to configure the filtering of queries, if this parameter is set, the match object should contain the * search:indexid parameter, see the syntax above */ public static final String INDEX_ID = "indexid"; public static final String DEFAULT_INDEX_CLASS = "org.eclipse.rdf4j.sail.lucene.impl.LuceneIndex"; /** * Set this key as sail parameter to configure the Lucene analyzer class implementation to use for text analysis. */ public static final String ANALYZER_CLASS_KEY = "analyzer"; /** * Set this key as sail parameter to configure the Lucene analyzer class implementation used for query analysis. In * most cases this should be set to the same value as {@link #ANALYZER_CLASS_KEY} */ public static final String QUERY_ANALYZER_CLASS_KEY = "queryAnalyzer"; /** * Set this key as sail parameter to configure {@link org.apache.lucene.search.similarities.Similarity} class * implementation to use for text analysis. */ public static final String SIMILARITY_CLASS_KEY = "similarity"; /** * Set this key as sail parameter to influence whether incomplete queries are treated as failure (Malformed queries) * or whether they are ignored. Set to either "true" or "false". When omitted in the properties, true is default * (failure on incomplete queries). see {@link #isIncompleteQueryFails()} */ public static final String INCOMPLETE_QUERY_FAIL_KEY = "incompletequeryfail"; /** * See {@link TupleFunctionEvaluationMode}. */ public static final String EVALUATION_MODE_KEY = "evaluationMode"; /** * Set this key as sail parameter to influence the fuzzy prefix length. */ public static final String FUZZY_PREFIX_LENGTH_KEY = "fuzzyPrefixLength"; /** * The LuceneIndex holding the indexed literals. */ private volatile SearchIndex luceneIndex; protected final Properties parameters = new Properties(); private volatile String reindexQuery = "SELECT ?s ?p ?o ?c WHERE {{?s ?p ?o} UNION {GRAPH ?c {?s ?p ?o.}}} ORDER BY ?s"; private volatile boolean incompleteQueryFails = true; private volatile TupleFunctionEvaluationMode evaluationMode = TupleFunctionEvaluationMode.TRIPLE_SOURCE; private volatile TypeBacktraceMode indexBacktraceMode = TypeBacktraceMode.DEFAULT_TYPE_BACKTRACE_MODE; private TupleFunctionRegistry tupleFunctionRegistry = TupleFunctionRegistry.getInstance(); private FederatedServiceResolver serviceResolver = new SPARQLServiceResolver(); private Set indexedFields; private Map indexedFieldsMapping; private IRI indexId = null; private IndexableStatementFilter filter = null; private final AtomicBoolean closed = new AtomicBoolean(false); public void setLuceneIndex(SearchIndex luceneIndex) { this.luceneIndex = luceneIndex; } public SearchIndex getLuceneIndex() { return luceneIndex; } @Override public NotifyingSailConnection getConnection() throws SailException { if (!closed.get()) { return new LuceneSailConnection(super.getConnection(), luceneIndex, this); } else { throw new SailException("Sail is shut down or not initialized"); } } @Override public void shutDown() throws SailException { if (closed.compareAndSet(false, true)) { logger.debug("LuceneSail shutdown"); try { SearchIndex toShutDownLuceneIndex = luceneIndex; luceneIndex = null; if (toShutDownLuceneIndex != null) { toShutDownLuceneIndex.shutDown(); } } catch (IOException e) { throw new SailException(e); } finally { // ensure that super is also invoked when the LuceneIndex causes an // IOException super.shutDown(); } } } @Override public void setDataDir(File dataDir) { Path luceneDir = Paths.get(parameters.getProperty(LuceneSail.LUCENE_DIR_KEY, DEFAULT_LUCENE_DIR), ""); String luceneDirAbsolute = dataDir.getAbsoluteFile().toPath().resolve(luceneDir).toString(); this.setParameter(LuceneSail.LUCENE_DIR_KEY, luceneDirAbsolute); logger.debug("Absolute path to lucene index dir: {}", luceneDirAbsolute); this.getBaseSail().setDataDir(dataDir); } @Override public void init() throws SailException { super.init(); if (parameters.containsKey(INDEXEDFIELDS)) { String indexedfieldsString = parameters.getProperty(INDEXEDFIELDS); Properties prop = new Properties(); try { try (Reader reader = new StringReader(indexedfieldsString)) { prop.load(reader); } } catch (IOException e) { throw new SailException("Could read " + INDEXEDFIELDS + ": " + indexedfieldsString, e); } ValueFactory vf = getValueFactory(); indexedFields = new HashSet<>(); indexedFieldsMapping = new HashMap<>(); for (Object key : prop.keySet()) { String keyStr = key.toString(); if (keyStr.startsWith("index.")) { indexedFields.add(vf.createIRI(prop.getProperty(keyStr))); } else { indexedFieldsMapping.put(vf.createIRI(keyStr), vf.createIRI(prop.getProperty(keyStr))); } } } if (parameters.containsKey(INDEX_ID)) { indexId = getValueFactory().createIRI(parameters.getProperty(INDEX_ID)); } try { if (parameters.containsKey(REINDEX_QUERY_KEY)) { setReindexQuery(parameters.getProperty(REINDEX_QUERY_KEY)); } if (parameters.containsKey(INCOMPLETE_QUERY_FAIL_KEY)) { setIncompleteQueryFails(Boolean.parseBoolean(parameters.getProperty(INCOMPLETE_QUERY_FAIL_KEY))); } if (parameters.containsKey(EVALUATION_MODE_KEY)) { setEvaluationMode(TupleFunctionEvaluationMode.valueOf(parameters.getProperty(EVALUATION_MODE_KEY))); } if (parameters.containsKey(FUZZY_PREFIX_LENGTH_KEY)) { setFuzzyPrefixLength(NumberUtils.toInt(parameters.getProperty(FUZZY_PREFIX_LENGTH_KEY), 0)); } if (luceneIndex == null) { initializeLuceneIndex(); } } catch (Exception e) { throw new SailException("Could not initialize LuceneSail: " + e.getMessage(), e); } } /** * The method is relocated to {@link SearchIndexUtils#createSearchIndex(java.util.Properties) }. * * @param parameters * @return search index * @throws Exception * @deprecated */ @Deprecated protected static SearchIndex createSearchIndex(Properties parameters) throws Exception { return SearchIndexUtils.createSearchIndex(parameters); } protected void initializeLuceneIndex() throws Exception { SearchIndex index = SearchIndexUtils.createSearchIndex(parameters); setLuceneIndex(index); } public void setParameter(String key, String value) { parameters.setProperty(key, value); } public String getParameter(String key) { return parameters.getProperty(key); } public Set getParameterNames() { return parameters.stringPropertyNames(); } /** * See REINDEX_QUERY_KEY parameter. */ public String getReindexQuery() { return reindexQuery; } /** * See REINDEX_QUERY_KEY parameter. */ public void setReindexQuery(String query) { this.setParameter(REINDEX_QUERY_KEY, query); this.reindexQuery = query; } /** * When this is true, incomplete queries will trigger a SailException. You can set this value either using * {@link #setIncompleteQueryFails(boolean)} or using the parameter "incompletequeryfail" * * @return Returns the incompleteQueryFails. */ public boolean isIncompleteQueryFails() { return incompleteQueryFails; } /** * Set this to true, so that incomplete queries will trigger a SailException. Otherwise, incomplete queries will be * logged with level WARN. Default is true. You can set this value also using the parameter "incompletequeryfail". * * @param incompleteQueryFails true or false */ public void setIncompleteQueryFails(boolean incompleteQueryFails) { this.setParameter(INCOMPLETE_QUERY_FAIL_KEY, Boolean.toString(incompleteQueryFails)); this.incompleteQueryFails = incompleteQueryFails; } /** * See EVALUATION_MODE_KEY parameter. */ public TupleFunctionEvaluationMode getEvaluationMode() { return evaluationMode; } /** * See EVALUATION_MODE_KEY parameter. */ public void setEvaluationMode(TupleFunctionEvaluationMode mode) { Objects.requireNonNull(mode); this.setParameter(EVALUATION_MODE_KEY, mode.name()); this.evaluationMode = mode; } /** * See {@link #INDEX_TYPE_BACKTRACE_MODE} parameter. */ public TypeBacktraceMode getIndexBacktraceMode() { return indexBacktraceMode; } /** * See {@link #INDEX_TYPE_BACKTRACE_MODE} parameter. */ public void setIndexBacktraceMode(TypeBacktraceMode mode) { Objects.requireNonNull(mode); this.setParameter(INDEX_TYPE_BACKTRACE_MODE, mode.name()); this.indexBacktraceMode = mode; } public void setFuzzyPrefixLength(int fuzzyPrefixLength) { setParameter(FUZZY_PREFIX_LENGTH_KEY, String.valueOf(fuzzyPrefixLength)); } public TupleFunctionRegistry getTupleFunctionRegistry() { return tupleFunctionRegistry; } public void setTupleFunctionRegistry(TupleFunctionRegistry registry) { this.tupleFunctionRegistry = registry; } public FederatedServiceResolver getFederatedServiceResolver() { return serviceResolver; } @Override public void setFederatedServiceResolver(FederatedServiceResolver resolver) { serviceResolver = resolver; super.setFederatedServiceResolver(resolver); } /** * Starts a reindexation process of the whole sail. Basically, this will delete and add all data again, a * long-lasting process. * * @throws SailException If the Sail could not be reindex */ public void reindex() throws SailException { try { // clear logger.info("Reindexing sail: clearing..."); luceneIndex.clear(); logger.info("Reindexing sail: adding..."); try { luceneIndex.begin(); // iterate SailRepository repo = new SailRepository(new NotifyingSailWrapper(getBaseSail()) { @Override public void init() { // don't re-initialize the Sail when we initialize the repo } @Override public void shutDown() { // don't shutdown the underlying sail // when we shutdown the repo. } }); try (SailRepositoryConnection connection = repo.getConnection()) { TupleQuery query = connection.prepareTupleQuery(QueryLanguage.SPARQL, reindexQuery); try (TupleQueryResult res = query.evaluate()) { Resource current = null; ValueFactory vf = getValueFactory(); List statements = new ArrayList<>(); while (res.hasNext()) { BindingSet set = res.next(); Resource r = (Resource) set.getValue("s"); IRI p = (IRI) set.getValue("p"); Value o = set.getValue("o"); Resource c = (Resource) set.getValue("c"); if (current == null) { current = r; } else if (!current.equals(r)) { if (logger.isDebugEnabled()) { logger.debug("reindexing resource " + current); } // commit luceneIndex.addDocuments(current, statements); // re-init current = r; statements.clear(); } statements.add(vf.createStatement(r, p, o, c)); } // make sure to index statements for last resource if (current != null && !statements.isEmpty()) { if (logger.isDebugEnabled()) { logger.debug("reindexing resource " + current); } // commit luceneIndex.addDocuments(current, statements); } } } finally { repo.shutDown(); } // commit the changes luceneIndex.commit(); logger.info("Reindexing sail: done."); } catch (Exception e) { logger.error("Rolling back", e); luceneIndex.rollback(); throw e; } } catch (Exception e) { throw new SailException("Could not reindex LuceneSail: " + e.getMessage(), e); } } /** * Sets a filter which determines whether a statement should be considered for indexing when performing complete * reindexing. */ public void registerStatementFilter(IndexableStatementFilter filter) { this.filter = filter; } protected boolean acceptStatementToIndex(Statement s) { IndexableStatementFilter nextFilter = filter; return (nextFilter != null) ? nextFilter.accept(s) : true; } public Statement mapStatement(Statement statement) { IRI p = statement.getPredicate(); boolean predicateChanged = false; Map nextIndexedFieldsMapping = indexedFieldsMapping; if (nextIndexedFieldsMapping != null) { IRI res = nextIndexedFieldsMapping.get(p); if (res != null) { p = res; predicateChanged = true; } } Set nextIndexedFields = indexedFields; if (nextIndexedFields != null && !nextIndexedFields.contains(p)) { return null; } if (predicateChanged) { return getValueFactory().createStatement(statement.getSubject(), p, statement.getObject(), statement.getContext()); } else { return statement; } } protected Collection getSearchQueryInterpreters() { return Arrays.asList(new QuerySpecBuilder(incompleteQueryFails, indexId), new DistanceQuerySpecBuilder(luceneIndex), new GeoRelationQuerySpecBuilder(luceneIndex)); } } /* * ********************************************************************* BELOW FIXMES are assumed to be fixed or an * agreement was reached. They can be removed in Oct 2007. */ /* * FIXME: The LuceneSail does not alter the datadir (i.e., passes it as-is to the wrapped Sail) and requires you to * specify a LuceneIndex. This means more work on the side of the integrator but allows for fine-grained control over * the type of storage used by the LuceneIndex: file-based, memory-based, db-based, etc. An alternative method is to * give the wrapped Sail a subdir in the datadir and let the LuceneSail take care of creating the LuceneIndex and * associated index dir. This gives the LuceneSail/Index more freedom in how it organizes data, e.g. when one wants to * store non-committed information in a temporary index without having to use the system's tmp dir. Which method is to * be preferred or whether both approaches can be combined has yet to be determined. Gunnar and Leo: Added a * sail-parameter, the intialize method will create the luceneindex with sensible defaults if not set. Enrico: sounds * good! FIXME: In light of all the issues mentioned in LuceneIndex and given the fact that in most applications, * integrators are able to provide statements in a more structured manner that randomly sorted triples, it may be a good * idea to provide some extension points that allow integrators to "do their own thing". In a way this is already * possible, as they are able to set the LuceneIndex. More sophisticated ways are e.g. an API for updating all * statements with the same subject at once. Gunnar and Leo: Proper transaction handling in LuceneIndex shoudl be all we * need, or? FIXME: The SailConnectionListener wraps IOExceptions in RuntimeException so that they can be rethrown. This * is a temporary fix until we have decided on the design of the SailConnectionListener API; it may even be extended to * allow throwing of SailExceptions. FIXME: Investigate whether LuceneSailConnection.clear should address the * LuceneIndex directly with a clear command, whether removed statements are reported already through the * SailConnectionListener, or whether the latter API will be extended with a separate clear event. FIXME: Gunnar and * Leo: Why isn't this implemented as a simple connectionwrapper? The connection-wrapper already forwards all calls, we * can just override methods where lucene interaction is needed, or? Do we gain anything by doing it as a listener? * Chris: it's been a while but I think this has to do with the SailConnection.clear accepting a number of contexts. As * context info is not stored in the Lucene index, we have no idea which info to remove. *If* removed statements are * reported to SailConnectionListeners (talk to Arjohn about this), we can use this event to update the index. On the * other hand, if we go with Leo's approach of storing multiple context IDs in a single Document (see LuceneIndex), this * may become a non-issue. Leo: Then I would implement LuceneSailConnection and do it with the multiple contexts. FIXME: * should we use the wrapped Sail's ValueFactory when creating Literals and URIs? Gunnar and Leo: sure, no other * solution. Enrico: yes! FIXME: Lucene's query parsing may result in a TooManyClauses Exception, e.g. when a wildcard * query matches more than 1024 query terms in the index. This default threshold of max. 1024 terms is configurable * through BooleanQuery.setMaxClauseCount but this may lead to very large memory usage (potentially OutOfMemoryErrors) * and is also global for all Lucene indices running in the same JVM. Perhaps a modified QueryParser is a solution, e.g. * by skipping term 1025 and beyond in order to approximate the query result? Leo: This only applies when we have no * "all" field. FIXME: All Literal properties of a Resource are both stored separately as separate Fields, as well as * concatenated and indexed as a single field. By *indexing* the former fields as well, we would be able to easily * support searching for specific predicates, besides only for entire Resources. We may even need this to support * returning snippets, or else we have no idea which property the query matched with. Cons: indexing these fields will * increase index size and decrease upload performance. Also, this way of searching for a specific predicate is a bit * strange for RDF, as the predicate restriction is part of the Lucene query string instead of the RDF graph query. * Gunnar and Leo: index all fields! For proper individual ranking indexing each fields is important. Enrico: yes, index * all fields (not only THE ALL field), we need it! Agreement: we index all fields, later make it configurable FIXME: It * may seem logical at first to set IndexWriter's auto-commit (available in Lucene 2.2) to false when adding triples, as * this could be useful for implementing Sesame's transactions: just commit the IndexWriter whenever the SailConnection * is committed. The main problem with this approach is that you are not able to search for Documents that have not been * committed yet, which is needed in order to update them with new properties for that subject. Consequently, * LuceneIndex' operation is very slow: each change on the IndexWriter is immediately flushed (resulting in disk I/O * when using a FSDirectory) and a new IndexReader is created for every added triple, which does some non-trivial * initialization. Alternative strategies: (1) don't write Documents right away to the IndexWriter but cache them in * main memory and only add them when a commit on the LuceneIndex is issued by the LuceneSailConnection. Potential risk * for out-of-memory errors because you have no idea how much memory this is using. (2) Different mechanism but * conceptually similar: buffer statements to add and process them in order of subject when a commit is issued or the * cache overflows, so that you only need to fetch the Document for that subject once. The size of the cache can be * approximated fairly well by looking at the sizes of the strings in their statements. Gunnar and Leo: We had (1) in * Gnowsis, and we never ran out of memory :) at least not for this reason ... (2) is harded to implement, we suggest * doing (1) and replacing with when it becomes a problem? Gunnar and Leo will do (1) in the next few days. Enrico: we * also suggest to use (1), just keep the lucene doc until the transaction is committed so you can continue filling the * doc and don't need to get it back from the index. Chris: (1) works for applications like Gnowsis and AutoFocus which * probably do a commit after processing every crawled resource, the amount of statements in a transaction is then very * small. Note however that uploading a large RDF file to a Repository (also a common Sesame use case) is a single * transaction, that's where I expect you can easily get into trouble. Leo: ok, with bigger transactions there is * trouble, which we leave to fix once the trouble arises. Chris (in skype-chat): (1) is ok for now, go for it. When * statements arrive more or less in order of subject and we tune the caching a bit (e.g. by each time only processing * half of the cache and selecting those statements whose subjects we haven't seen in a while), this delayed processing * strategy may in some scenarios even lead to the most optimal case where Documents are retrieved and/or written at * most once. Changing the index because of cache overflow still breaks SailConnection's contract though: the index * should only be altered in a permanent way when the SailConnection gets a commit. At first I thought that a * triple-centric Document setup (each triple has its own Document) would solve all this, as opposed to the current * Resource-centric setup (all properties with the same subject in a single Document). However, (1) you still need to * check the index in order to prevent adding duplicates, which cannot be done on uncommitted Documents - perhaps * SailConnectionListener can tell us when a really new triple is added? But even then: probably works for quads, not * for triples). Also, (2) when you *are* storing quads (assuming this leads to a context field in the Document), the * deletion of a statement no longer simply maps on an IndexWriter.deleteDocuments(Term) invocation, so you need to * query again to see which Documents need to be deleted. FIXME: Right now, all literals are stored and indexed, * datatypes are ignored. Should we process some datatypes differently? Does it make sense to index booleans, numbers, * etc.? Enrico: we don't use data type and language for querying anyways, so does not affect us Agreement: Datatypes * are ignored. FIXME: The context of triples is completely ignored at this moment. Perhaps this can simply be solved by * giving each Document a context ID besides the Resource ID? Leo (#1): yes, and multiple contextIDs, to state all * contexts that contributed to the doc (see below, #2) FIXME: The clear(Resource...) is not implemented as we do not * deal with contexts in this LuceneSail implementation and thus do not know which triples to remove. This is * problematic when people do a clear with a specific context on a LuceneSail, as the LuceneIndex will then still keep * legacy triples around. Only a global clear can be implemented, but not a clear on a specific context. To me this * strongly suggests that we add a separate Document for each (Resource, context) pair, even though the objections * raised in the paper (troubles with creating scores) are reasonable, because else we are not able to create a proper * Sail implementation. This only adds to the issue we realized before with ingoring context, namely that full-text * queries cannot be restricted to properties in a certain context. Leo: #2 An optimized approach would be to add * multiple contextIDs, to state all contexts that contributed to the doc (see above #1) This means that when * adding/appending to a document, all additional context-uris are added to the document. When deleting individual * triples, the context is ignored. In clear(Resource ...) we make a query on all Lucene-Documents that were possibly * created by this context(s). Given a document D that context C(1-n) contributed to. D' is the new document after * clear(). - if there is only one C then D can be safely removed. There is no D' (I hope this is the standard case: * like in ontologies, where all triples about a resource are in one document) - if there are multiple C, remember the * uri of D, delete D, and query (s,p,o, ?) from the underlying store after committing the operation- this returns the * literals of D', add D' as new document This will probably be both fast in the common case and capable enough in the * multiple-C case. Any objections? Gunnar? Enrico? Enrico: we dont query contexts at all, so score is better in this * way than habving (resource, context) paired docuemts. So this looks like a working solution that keeps the lucene * index valid. */




© 2015 - 2024 Weber Informatics LLC | Privacy Policy