All Downloads are FREE. Search and download functionalities are using the official Maven repository.

src.it.unimi.dsi.big.mg4j.index.TermProcessor Maven / Gradle / Ivy

Go to download

MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. The big version is a fork of the original MG4J that can handle more than 2^31 terms and documents.

The newest version!
package it.unimi.dsi.big.mg4j.index;

/*		 
 * MG4J: Managing Gigabytes for Java (big)
 *
 * Copyright (C) 2005-2011 Sebastiano Vigna 
 *
 *  This library is free software; you can redistribute it and/or modify it
 *  under the terms of the GNU Lesser General Public License as published by the Free
 *  Software Foundation; either version 3 of the License, or (at your option)
 *  any later version.
 *
 *  This library is distributed in the hope that it will be useful, but
 *  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
 *  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
 *  for more details.
 *
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program; if not, see .
 *
 */

import it.unimi.dsi.lang.FlyweightPrototype;
import it.unimi.dsi.lang.MutableString;

import java.io.Serializable;

/** A term processor, implementing term/prefix transformation and possibly term/prefix filtering.
 * 
 * 

Index contruction requires sometimes modifications of * the given terms: downcasing, stemming, and so on. The same * transformation must be applied to terms in a query. This * interface provides a uniform way to perform arbitrary term * transformations. * *

Index construction requires also term filtering: * {@link #processTerm(MutableString)} may * return false, indicating that the term should not * be processed at all (e.g., because it is a stopword). * *

Additionally, the method {@link #processPrefix(MutableString)} may * process analogously a prefix (used for prefix queries). * *

Implementation are encouraged to expose a singleton, when * possible, by means of the static factory method getInstance(). * * Warning: implementations of this class are not required * to be thread-safe, but they provide {@link it.unimi.dsi.lang.FlyweightPrototype flyweight copies}. * The {@link #copy()} method is strengthened so to return a instance of this class. * *

This interface was originally suggested by Fabien Campagne. */ public interface TermProcessor extends Serializable, FlyweightPrototype { /** Processes the given term, leaving the result in the same mutable string. * * @param term a mutable string containing the term to be processed, * or null. * @return true if the term is not null and should be indexed, false otherwise. */ public boolean processTerm( MutableString term ); /** Processes the given prefix, leaving the result in the same mutable string. * *

This method is not used during the indexing phase, but rather at query * time. If the user wants to specify a prefix query, it is sometimes necessary * to transform the prefix * (e.g., {@linkplain DowncaseTermProcessor#processPrefix(MutableString)} downcasing it). * *

It is of course unlikely that this method returns false, as it is usually not * possible to foresee which are the prefixes of indexable words. In case no natural * transformation applies, this method should leave its argument unchanged. * * @param prefix a mutable string containing a prefix to be processed, * or null. * @return true if the prefix is not null and there might be an indexed * word starting with prefix, false otherwise. */ public boolean processPrefix( MutableString prefix ); public TermProcessor copy(); }





© 2015 - 2025 Weber Informatics LLC | Privacy Policy