All Downloads are FREE. Search and download functionalities are using the official Maven repository.

src.it.unimi.dsi.big.mg4j.query.nodes.package.html Maven / Gradle / Ivy

Go to download

MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. The big version is a fork of the original MG4J that can handle more than 2^31 terms and documents.

The newest version!


  
    MG4J: Managing Gigabytes for Java
  

  

    

Composite representation for queries

This package contains the classes that represent queries as an abstract syntax tree, or, in design-pattern jargon, as a composite.

Warning: Before reading this overview, the reader is encouraged to consult the first part of the {@link it.unimi.dsi.big.mg4j.search} package documentation, where the notion of query is briefly presented.

The basic idea is to perform a complete decoupling of the query construction and resolution into three levels:

  • String level: the query is a string, written in a suitable language where, for example, every operator has a certain syntax, uses a certain keyword etc.
  • Tree level: the query is an abstract tree, that contains one internal node for each operator, but that has nothing to do with the actual way the query was written; for example, one may want to give the user the possibility to write the query by using a visual editor, in which case there will be some way to obtain the tree representation of the query without having any string-level representation for it.
  • Iterator level: the document iterator that lazily solves the query over a certain index or set of indices; this is the semantic counterpart of a query, and it is discussed at large in the overview of the {@link it.unimi.dsi.big.mg4j.search} package.

This package overview shows how to pass from the string to the tree level using the built-in {@link it.unimi.dsi.big.mg4j.query.parser.QueryParser}, and explains what is needed to implement the tree level, as well as the basic instruments needed to convert the tree level into the iterator level.

In MG4J, a query parser is an implementation of {@link it.unimi.dsi.big.mg4j.query.parser.QueryParser}: essentially, an object with a method that transforms strings in {@linkplain it.unimi.dsi.big.mg4j.query.nodes.Query queries}. You can use any parser you like, or build your queries programmatically.

The parser provided with MG4J and whose syntax is described in the manual is currently generated using JavaCC: unfortunately, this tool produces some public classes that somehow clutter the package documentation. All the parsing logic is contained in the {@link it.unimi.dsi.big.mg4j.query.parser.SimpleParser} class, which is generated by the JavaCC source SimpleParser.jj.

Building a query tree out of a query string

MG4J allows one to perform queries on multiple indices; by saying so, we mean that you may have many indices concerning the same document collection, and you want to perform a query that may search on some of them. Think, for example, of a collection of emails. You might have generated a number of indices, to index their subjects, sender, recipient(s), body etc. All these indices can be thought of as individual entities, their only relation being that they index collections with the same number of documents, and that same-numbered documents conceptually "come" from the same source.

To get a query tree from a string representation, you must first build a {@link it.unimi.dsi.big.mg4j.query.parser.QueryParser} object. For instance, if you use the built-in {@link it.unimi.dsi.big.mg4j.query.parser.SimpleParser} you have to provide a map whose keys are index names (called index aliases) and with each index alias mapped to the corresponding {@link it.unimi.dsi.big.mg4j.index.Index}. Moreover, one of the aliases is taken to be the default index alias used for the queries.

In our example, we will assume that we have three index aliases (say, subject, from and body), and that subject is the default index alias. The parser would then be created as follows (here, we are stipulating that index with alias x has basename mail-x):

		Object2ReferenceMap<String,Index> indexAlias2Index = new Object2ReferenceOpenHashMap<String,Index>();
		Index defaultIndex;
		indexAlias2Index.put( "subject", defaultIndex = Index.getInstance( "mail-subject" ) );
		indexAlias2Index.put( "from", Index.getInstance( "mail-from" ) );
		indexAlias2Index.put( "body", Index.getInstance( "mail-body" ) );
		
		QueryParser parser = new QueryParser( indexAlias2Index.keySet(), "subject" );
		
after which you can use it for example as follows:
			Query query = parser.parse( "meeting AND body:(schedule OR urgent)" );
			DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( indexAlias2Index, defaultIndex, 0 );
			DocumentIterator it = query.accept( visitor );
			while ( it.hasNext() ) {
				System.out.println( "Document #: " + it.nextDocument() );
				System.out.print( "\tPositions:" );
				for ( Interval interval: it )
					System.out.print( " " + interval );
				System.out.println();
			}
		

Structure of a {@link it.unimi.dsi.big.mg4j.query.nodes.Query}

As explained above, in this package a {@link it.unimi.dsi.big.mg4j.query.nodes.Query} is simply an abstract tree: its leaves will correspond to ground queries, whereas internal nodes correspond to query operators.

Ground queries can be {@linkplain it.unimi.dsi.big.mg4j.query.nodes.Term term queries}, {@linkplain it.unimi.dsi.big.mg4j.query.nodes.Prefix prefix queries}, {@linkplain it.unimi.dsi.big.mg4j.query.nodes.MultiTerm multiterm queries} (the latter usually generated by some query-expansion mechanism), {@link it.unimi.dsi.big.mg4j.query.nodes.True} and {@link it.unimi.dsi.big.mg4j.query.nodes.False}. Any other query is either a {@linkplain it.unimi.dsi.big.mg4j.query.nodes.Composite composite query} (e.g., a {@linkplain it.unimi.dsi.big.mg4j.query.nodes.And conjunctive query}, a {@linkplain it.unimi.dsi.big.mg4j.query.nodes.Or disjunctive query} etc.), that is, a query composed by other subqueries, or it is obtained by some other operator (e.g., a {@linkplain it.unimi.dsi.big.mg4j.query.nodes.LowPass low-pass query} etc.).

The {@linkplain it.unimi.dsi.big.mg4j.query.nodes.Weight} class allows one to assign a weight to a query (it wraps a query and assigns a weight with it; queries that are not assigned a weight explicitly have a default weight of 1). A special treatment is reserved to weighted terms (that is, to nodes that are made by applying a {@linkplain it.unimi.dsi.big.mg4j.query.nodes.Weight} to a {@linkplain it.unimi.dsi.big.mg4j.query.nodes.Term}: they are considered essentially as variants of terms, and so they can be used to build a multiterm query.

Every query has a method that {@linkplain it.unimi.dsi.big.mg4j.query.nodes.Query#accept(it.unimi.dsi.big.mg4j.query.nodes.QueryBuilderVisitor) accepts} a {@link it.unimi.dsi.big.mg4j.query.nodes.QueryBuilderVisitor} object, through which you can visit the query tree. More precisely, when the accept method is invoked on a query, a recursive visit of the query tree is performed: at every node n of type Q (Q being any class implementing {@link it.unimi.dsi.big.mg4j.query.nodes.Query}), the following steps are performed:

  • the visitor's visitPre(Q n) method is called; if the method returns false, the visit is interrupted; otherwise…
  • the subtrees are recursively visited, obtaining an array of results T[] x; if at a certain point during this visit some call decides that the visit should be interrupted, we stop; otherwise…
  • the visitor's visitPost(Q n, T[] x) method is called, and the result (of type T) is returned.

To make the idea of a visitor easier to understand, consider the following simple example of a visitor:

    	        static class PrinterVisitor extends AbstractQueryBuilderVisitor<String> {
                private static void appendArray( MutableString res, String t[], char sep ) {
                        for ( int i = 0; i < t.length - 1; i++ ) res.append( t[ i ] + sep );
                        res.append( t[ t.length - 1 ] );
                }
                public String[] newArray( int len ) { return new String[ len ]; }
                public String visitPost( OrderedAnd node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "OAND(" );
                        appendArray( res, t, ',' );
                        res.append( ")" );
                        return res.toString();
                }
                public String visitPost( Or node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "OR(" );
                        appendArray( res, t, ',' );
                        res.append( ")" );
                        return res.toString();
                }
                public String visitPost( And node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "AND(" );
                        appendArray( res, t, ',' );
                        res.append( ")" );
                        return res.toString();
                }
                public String visitPost( Consecutive node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "\"" );
                        appendArray( res, t, ' ' );
                        res.append( "\"" );
                        return res.toString();
                }
                public String visitPost( Not node, String t ) {
                        return "NOT(" + t + ")";
                }
                public String visitPost( LowPass node, String t ) {
                        return t + "~" + node.k;
                }
                public String visitPost( Select node, String t ) {
                        return node.index + ":" + t;
                }
                public String visitPost( Weight node, String t ) {
                		return t + "{" + node.weight + "}";
                }
                public String visit( Term node ) { return node.term.toString(); }
                public String visit( Prefix node ) { return node.prefix + "*"; }
                public String visit( MultiTerm node ) { return node.term.toString(); }
        }
   	

Now, you can pass an instance of this visitor to a query, and as a result get a string (linear) representation of the query itself. (Of course, there is a simple way to get the same result, that is, calling the toString() method directly, but for this example is easy enough for illustrative purposes.)

Here is an example of how the above visitor can be used:

          public static void main( String arg[] ) throws Exception {
                Query qa = new Term( "a" );
                Query qb = new Prefix( "b" );
                Query qc = new Term( "c" );
                Query qd = new Term( "d" );

                Query q = new LowPass( new And( qa, new Select( "another_index", qb ), new Or( qc, qd ) ), 20 );

                System.out.println( q.accept( new PrinterVisitor() ) );
        }
  		
that produces AND(a,another_index:b*,OR(c,d))~20.

Building a document iterator out of a query

The interface {@link it.unimi.dsi.big.mg4j.query.nodes.Query} and its implementations provide the basic classes for the tree-level representation of a query. Building an iterator out of it can be thought of as a process of recursive instantiation of a query tree into an iterator tree; this is actually a sort of a copy visit, and it is indeed implemented as a special kind of visitor: the {@link it.unimi.dsi.big.mg4j.search.DocumentIteratorBuilderVisitor}. You can create one such visitor and use it to visit a query tree, obtaining as a result a document iterator that you can use to get the results.

To be more precise, let us recall the example presented in the last part of the {@linkplain it.unimi.dsi.big.mg4j.search overview of the search package}:

			Index subjectIndex = Index.getInstance( "mail-subject" );
			DocumentIterator it = AndDocumentIterator.getInstance( 
				subjectIndex.documents( "meeting" ), 
				subjectIndex.documents( "schedule" ), 
				subjectIndex.documents( "monday" ) 
			);
			while ( it.hasNext() ) {
					System.out.println( "Document #: " + it.nextDocument() );
					System.out.print( "\tPositions:" );
					for ( Interval interval: it )
						System.out.print( " " + interval );
					System.out.println();
			}
		

Here is an equivalent piece of code that uses the idea of decoupling the tree level from the iterator level.

			Index subjectIndex = Index.getInstance( "mail-subject" );
			Query q = new And( new Term( "meeting" ), new Term( "schedule" ), new Term( "monday" ) );
			DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( null, subjectIndex, 0 );
			DocumentIterator it = q.accept( visitor );
			while ( it.hasNext() ) {
				System.out.println( "Document #: " + it.nextDocument() );
				System.out.print( "\tPositions:" );
				for ( Interval interval: it )
					System.out.print( " " + interval );
				System.out.println();
			}
		




© 2015 - 2025 Weber Informatics LLC | Privacy Policy