![JAR search and dependency download from the Maven repository](/logo.png)
g4j.5.2.source-code.CHANGES Maven / Gradle / Ivy
Go to download
Show more of this group Show more artifacts with this name
Show all versions of mg4j Show documentation
Show all versions of mg4j Show documentation
MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java.
5.1.1 -> 5.2
- End-of-the-line release. With this release, the big release becomes
the official release of MG4J. For some time, this version will be fixed
in case of bugs.
5.1 -> 5.1.1
- Fixed very subtle bug in documents returned from HtmlDocumentFactory.
Unparsed document coming from streaming sources would have accessed the
data source during finalization due to toString() returning the document
title. This was causing random error reading, say, from WArc streams, if
a document was not closed properly. Added blurb to AbstractDocument that
warns about this issue.
- Fixed a bug in dynamic class naming ("Payload " was used instead of
"Payload"). Thanks to Dmitri Portnov for fixing this bug.
- Switched to SLF4J for logging.
5.0 -> 5.1
- A small revolution is taking place in MG4J: now most classes handling
indices have an IOFactory parameter that makes it possible to open files
in alternative filesystems, such as HDFS. Beware--the feature is very
pervasive and there might be missing spots. Thanks to Tim Potter for
useful discussions and for testing this new feature.
- InputStreamDocumentSequence was not behaving correctly in case of
keyboard input (two EOFs were necessary).
- The Maven artifacts did not contain the Velocity templates. Thanks
to Andrew MacKinlay for reporting this issue.
4.0.4 -> 5.0
- WARNING: this release has source and binary incompatibilities with
previous releases. Watch out.
- nextDocument() now returns DocumentIterator.END_OF_LIST instead of -1 to
denote list exhaustion. To avoid confusion and ease the transition, the
package prefix of MG4J is now it.unimi.di.*, following the change of
name of our department.
- it.unimi.di.mg4j.search.DocumentIterator is now strictly lazy; in
particular, it does not implement java.util.Iterator. Please replace
calls to DocumentIterator.hasNext() with a check to
DocumentIterator.nextDocument() != DocumentIterator.END_OF_LIST, or try
whether the semantics of DocumentIterator.mayHaveNext() suits you. The
change aligns the behaviour of the two versions of MG4J.
- The plethora of methods that accessed the positions of a term in an
IndexIterator have been replaced by the single lazy nextPosition() call,
which returns IndexIterator.END_OF_POSITIONS when the positions are
exhausted. Some static methods in IndexIterators should help with the
transition.
- MG4J is no longer based on gap-based indices. Classical interleaved indices
are used for incremental index construction and high-performance indices
are still supported for historical reasons, but all new indices are by
default built using the new quasi-succinct format.
- DiskBasedIndex.getInstance() now return an Index instead of a
BitStreamIndex. Old code should check with a reflective call whether the
result is a BitStreamIndex and act accordingly, as now it might be a
QuasiSuccinctIndex, too.
4.0.1 -> 4.0.4
- Fixed SimpleParser.parse(MutableString), which was throwing a
NullPointerException.
- Now DocumentRankScorer can load score files of any type.
4.0 -> 4.0.1
- We now force the number of documents of a virtual index to be equal
to that specified by the resolver. Collections in which the last few
documents were not referred would have generated virtual indices with
fewer documents than the standard ones.
- DocumentSequenceImmutableGraph is now part of MG4J. Building graphs
out of web documents should be quite easy.
- Fixed a small bug in the equals method of Term.
- Fixed bug in the equals and hashCode methods of Select (before, only
Index was taken into account, and not the actual subquery).
- Fixed several small inconsistencies in the Scorer hierarchy.
- Added the SubsetDocumentSequence class to extract a subset of documents
from a given sequence.
- The default target for skipping structures is now 1%.
- Now ConsecutiveDocumentIterator has specialized code for non-gapped
phrases cointaining just terms.
- Now Combine loads sizes when compressing positions using interpolative coding.
- We now use bare-bones heaps and array priority queues to increase speed.
- BM25Scorer and BM25FScorer have significantly faster ranking logic.
3.0.1 -> 4.0
- WARNING: This release has minor binary incompatibilities with previous
releases, mainly due to the move from the interface
it.unimi.dsi.util.LongBigList to the now standard
it.unimi.dsi.fasutil.longs.LongBigList. It is part of a parallel release
of fastutil, the DSI Utilities, Sux4J, MG4J, WebGraph, etc. that were
all modified to fit the new interface, and that prepare the way for our
"big" versions, that is, supporting >2^31 entries in arrays (simulated),
elements in lists, terms, documents, nodes, etc. Please read our (short)
"Moving Java to Big Data" document (JavaBig.pdf) for details.
- We now require Java 6.
- WARNING: document iterators will return FALSE, instead of TRUE, for
indices for which there are no intervals. The actual intervals returned
(if there are any) has not changed, but the placeholder role of TRUE
has been taken by FALSE.
- WARNING: The semantics of TermCollectionVisitor.prepare(ReferenceSet)
has changed slightly.
- PdfDocumentFactory has been removed. Please use the new Tika-based
factory for PDF parsing.
- Backport from the big version of DocumentIterator.END_OF_LIST as a
substitute for Integer.MAX_VALUE. Please use it in new code--it also
makes transitions to the big version easier.
- Refined semantics for DocumentIterator, with new streamlined
implementations based on AbstractDocumentIterator.
- A long-standing bug in skipTo() has been fixed thanks to a very detailed
and replicable bug report by Soumen Chakrabarti. If the last posting
in a list had an ordinal position that was an exact multiple of the
quantum, skipping beyond the pointer contained in the posting would
have erroneously returned the last pointer instead of Integer.MAX_VALUE.
- A few serious bugs of the alignment operators have been fixed. It is
also significantly faster.
- A new set of classes interfaces with Apache's Tika to provide parsing
of Office, RTF, etc. files.
- Many fixes to the remote classes (still experimental!).
- IdentityDocumentFactory was not using the FIELDNAME property.
- The toString() method of LowPass was erroneously printing "<" instead of
"~".
- Query is now serializable and AbstractCompositeQuery exposes the component
queries.
- The range operator for payloads was broken.
3.0 -> 3.0.1
- MG4J is now distributed under the GNU Lesser General Public License 3.
- MG4J is no longer dependent on COLT or jal, but it requires at least
fastutil 6.
- Memory usage during indexing is a bit less tight, due to the new linear
probing hash maps in fastutil 6 which could use more RAM.
- SKEWED_GOLOMB is no longer supported for writing.
- When loading offsets in memory, the bit stream used to read them
was not properly closed.
- Fixed bug that would cause an error when creating a single empty
batch under Windows.
- Fixed bug that was preventing payload indices from working correctly
(thanks to Polina Morozova for finding and fixing this bug).
- Combine and subclasses now will work even if the occurrences field
of the component indices is not set (thanks to Soumen Chakrabarti
for reporting this bug).
- Fixed bug in AlignDocumentIterator that was causing random
IllegalStateExceptions (thanks to Roi Blanco for reporting this bug).
2.1.3 -> 3.0
- WARNING: Massive revamp of the DocumentIteratorVisitor subsystem. Now
such visitors can return data, much like a QueryIteratorBuildervisitor.
It also has a special visit method for MultiTermIndexIterators. You'll
have to adapt your previous implementations.
- WARNING: QueryParser instances are required to provide a parse(MutableString)
method and two new escape methods that can be used to turn a string into
a text token. This feature is fundamental for automatic query generation
(thanks to Hugo Zaragoza for pointing out this problem).
- WARNING: To make a few things easier, we now have explicit document
iterators representing true and false. Their construction requires a
reference index (contrarily to that was happening with
DocumentIterators.EMPTY_ITERATOR), so the getInstance() methods of most
document iterators had to be updated, and DocumentIteratorVisitor
instances need to implemented two new visit() methods. The iterators are
generated by the tokens #TRUE and #FALSE.
- WARNING: Indexing of virtual fields uses much less memory, but batches
now have a different content: they represent actual positions in the
final virtual document. Sizes of each batch represent the known size of
a virtual moment when the batch was written. With this change, Paste
does no longer require more memory than Concatenate.
- WARNING: A new RemappingDocumentIterator class makes it possible to
mix results from different indices with positional operators. Since
there is a new Remap query node, all DocumentVisitors will have
to be updated.
- WARNING: All deprecated classes have been removed.
- WARNING: The -B option of IndexBuilder is now aligned to Scan--it
specifies the basename of a collection to be built at indexing
time. It used to be the size of the Combine buffer.
- New classes for efficient document collection construction at
indexing time. The architecture is now also very open--you can
plug in your own builders.
- Completely restructured size handling for Combine and subclasses.
Unless you use Golomb coding, you will not need to load sizes.
This is true even of batches of virtual fields, as Paste now
by default does not renumber positions, but rather expects them
to be already renumbered. The old behaviour can be obtained
via a flag.
- We moved to Jetty 6. Also, a few problems with Velocity not finding
templates have been fixed.
- New, more intelligent memory handling that should be able to avoid
completely out-of-memory errors. There is also a limit on the
number of terms per batch that should help with garbage collection.
- Fixed a bug in collection creation: we used to provide the original
factory, but this is wrong as we might not be indexing all fields. Now
we generate a suitable factory that contains only the indexed fields.
- New important feature: high-performance indices may have now variable
quanta depending on the list frequency and density. Indices now sport a
.posnumbits file that records how many bits are used to store positions.
It is used as a basic statistics to compute the correct quantum. You
can ask for a percentage of the index to be used to skip towers, and
the right quantum for each list will be computed for you. The process
is quite empirical, so always look into .stats files to check that
you are actually using no more than the percentage requested. In general,
old indices will have to be rebuilt before being able to Combine them
into an index with variable quanta, but for high-performance indices
the tool ComputePosNumBitsPositions can be used to add the missing
file.
- Memory mapping of indices now uses the new multiplexed approach
implemented in ByteBufferInputStream. This means that we can
map into memory essentially every index. Thanks to Valentin Tablan
and Ian Roberts for suggesting this approach.
- Now we feature an implementation of the state-of-the-art BM25F ranking
function.
- ZipDocumentCollection.getInstance() makes it possible to load
realiably ZipDocumentCollection instances even if they are not
in the current directory.
- New UTF-8 nice mathematical symbols for conjunction, disjunction, TRUE
and FALSE.
- Fixed problem with too many connections open when using
JdbcDocumentCollection.
- A new SUCCINCTSIZES URI key makes it possible to ask for loading sizes
into an Elias-Fano compressed list. This will slow down access by
two orders of magnitude, but it can be very useful when pasting large
indices, as pasting needs to load a large amount of size data.
- EmptyIndexIterator instances are no longer Index-based singletons. This
change was necessary to make it possible to run ranking algorithms that
require to set the weight or id even of empty iterators. This should
cause no problem.
- All document iterators have now a settable weight. The weight can
be espressed in standard syntax using braces. Note that weights
per se have no meaning--it is up to the scorers to use them.
- Now the metadata-only option of Combine and its implementations generates
the file of frequencies. This is very useful as it makes it possible to
compute the term frequencies for the virtual documents obtained by
concatenating all fields--something that is necessary for the correct
computation of BM25F.
- Fixed a bug in the grammar: queries such as "(a))" would have been
parsed as "(a)" because of a lack of check for EOF (thanks to
Hugo Zaragoza for reporting this bug).
- The parser will now accept Unicode characters 0x2227 and 0x2228
(the standard mathematical symbols for conjunction and disjunction)
for AND and OR, respectively.
- Following some testing TREC GOV2, the defaults for MAXPREANCHOR and
MAXPOSTANCHOR in HtmlDocumentFactory have been reduced to 8 and 4,
respectively.
- Fixed old bug in SemiExternalGammaList; readBits(0) was not called
after numLongs estimation, leading to EOFExceptions.
- Document pointers can now be coded in unary.
- Fixed bad bug in PartitionLexically: for high-performance indices,
the positions of the last term were not being written.
- HttpFileServer has a settable port.
- New Scorer.getWeights() method to get weights.
- Fixed a bug in TfIdf scorer that would have caused NaNs.
- Query accepts a newline-separated list of titles, besides
the usual serialised object.
2.1.2 -> 2.1.3
- URLMPHVirtualDocumentResolver required a sorted list, even if this was
not in the class specification. Now you can choose between a sorted list
(with reduced space occupancy) or a generic list (thanks to Nuno
Cardoso for reporting this bug).
- Fixed problem with VelocityViewServlet (getTemplate() must not be
invoked statically on Velocity for things to work properly; thanks to
Valentin Villenave for reporting a problem with the Lilypond Snippet
Repository which led me to fix this bug).
2.1.1 -> 2.1.2
- AlignDocumentIterator (syntax: ^) makes it possible to align
document/interval iterators from different indices. Using this
feature MG4J can easily support queries based on semantic tagging.
- Fixed another bug in Snowball stemmers: calling processTerm() with
a null argument would have caused an exception.
- Now Scan and IndexBuilder accept parseable objects as sequences.
The same happens for the WORDREADER property of some factories,
making it possible to create a moderately command-line-configurable
FastBufferedReader as WordReader.
- UNICODE_INPUT is now set in SimpleParser.jj, making it possible to
write wild Unicode queries again.
- QueryServlet now forces UTF-8 for output.
- We now distribute the javacc-generated files for easier installation.
- More liberal Velocity template-resolution setup, now documented in
the HttpQueryServer Javadoc.
- The --skips command-line option is gone. --no-skips disable skips
for interleaved indices only. By default, all indices have skips
that use about 2% of the index size.
- Fixed bad integer overflow bug when using large heights.
- New -i option for URLMPHVirtualDocumentResolver, mimicking
the same option in Sux4J's functions.
2.1 -> 2.1.1
- Major fix: the Snowball stemmers would generated empty strings, and
Combine would choke (generating empty indices) on empty strings.
- Removed obsolete PorterStemmerTermProcessor.
2.0.1 -> 2.1
- WARNING: Most utility classes have been moved to dsiutils. Old versions
are still here and deprecated, but you'll have some problems when importing
this version. Always check which version you're using!
- WARNING: TermMap has been replaced by StringMap (in dsiutils). PrefixMap
exists, but the dsiutils signature is completely different from the old one.
- Lots of stemmers coming from Snowball. We actually made some improvements
to the Java Snowball compiler to get this working at a reasonable speed.
- New (somewhat experimental) feature: you can get the terms that caused
an interval to be emitted.
- Sequential scan was not working for high-performance indices if positions
were not read. The problem was evident when combining high-performance indices
specifying -cPOSITIONS:NONE.
- Fixed a couple of NullPointerException in index construction (thanks to
Marko Srdanovic for reporting these bugs).
- Fixed missing call to super.close() in AbstractIndexClusterIndexReader that
was causing spurious warnings.
- Now Query has multiplex on by default.
- Fixed bug in MutableString.subSequence() (thanks to Espen Amble Kolstad
for reporting this bug). MutableString is now in dsiutils.
- New QueryExpander interface for modifying queries between parsing and
actual resolution. It can be used, for instance, to do term expansion.
A simple abstract implementation (AbstractTermExpander) is provided
for term expansion. Also, an implementation that multiplexes terms
over indices (MultiIndexTermExpander) is provided.
- New allLines() method in LineIterator. LineIterator is now in dsiutils.
2.0 -> 2.0.1
- Can you believe that? Fast.leastSignificantBit() under very peculiar
circumstance was returning random data, but apparently this was causing
no warm. I don't wanna know.
- Better memory handling: buffer reallocation logic in index construction
could cause out-of-memory errors. Now we retry a small reallocation after
dumping the content in a temporary file, and record the event so the
Scan process can dump the current batch.
- Fixed old minor bug in Combine: term files and global-counts files were
not closed, leading to bizarre and spurious too-many-open-files errors.
- Fixed derelativisation when using FileSystemItem.
1.1.3 -> 2.0
- METAWARNING: This release has so many changes and so many new features
that we strongly suggest to read carefully all information below
and the manual.
- WARNING: there are performance improvements due to fixed-point
computation of Golomb moduli (yes, it *really* slows down things), but
unfortunately all indices have to be rebuilt.
- WARNING: virtual fields have changed in a completely incompatible
way, and the same happened to AnchorExtractor. This was necessary
to get finally rid of problems with System.identityHashCode()
(see below).
- WARNING: BitStreamIndexIterator will now throw an UnsupportedOperationException
when positions or intervals are retrieved on an index without positions.
Previously, getting positions would have produced the same effect, but
getting intervals would have returned TRUE. This was causing a very confusing
behaviour with ordered AND, consecutivity, etc., as they were returning
false positives.
- WARNING: a great deal of work has gone into making all relevant iterators
fully lazy. Please use DocumentIterator.nextDocument() and
IntervalIterator.nextInterval(), after reading the related Javadoc
documentation. The change has produced significant performance
improvements.
- WARNING: IOExceptions are now rethrown by most index-access methods.
Previously, they would have been catched and wrapped into
RuntimeException, but this behaviour was slightly slowing down methods
called very often like nextDocument().
- WARNING: The old sequential reading methods (e.g., readDocumentPointer())
are no longer available (I guess nobody was using them anyway). They are
replaced by an IndexReader.nextIterator() method that returns an index
iterator on the term after the current one, until exhaustion.
- WARNING: Quanta are now restricted to powers of two.
- Completely new kind of index (high-performance). It uses the Lucene idea
of keeping positions in a separate file, and enriches it with MG4J skip
structures. It is now the default index type.
- Completely rewritten index reading. Now a ruby script generate different
readers for different combination of flags, increasing significantly
performance due to the reduced logic overhead. A generic class is always
available, but for production sites wired index readers are the right
choice. The wired, faster class is fetched automagically by reflection
if available.
- Completely new, memory-adaptive index construction strategy. Just specify
a number of *documents* per batch and let MG4J do the rest. Please read the
Scan class documentation.
- New payload-based indices. Now it is possible to index dates, integers,
or any other payload. By default we supply range queries.
- Significant improvements in performance. System.identityHashCode() turned
out to be *deadly* slow, so we dropped reference-based open hash map and
started using brute-force array maps (you need fastutil >= 5.0.7)
whenever we have to have to manipulate very small sets. The gains are
suprising, in particular for queries containing frequent terms.
- Even more improvement due to parallel reimplementation of all operators for
the special case in which all document iterators are index iterators. In
this case all intervals have length 1 and can be retrieved eagerly. In some
cases performance is almost doubled.
- New low-level coded-integer skipping methods have further increased performance
in certain situations (e.g., phrasal queries containing stopwords).
- Now we use precomputed bit codes for 65536 words, uniformly. This
requires 4MiB of memory just for precomputed words, but it almost doubles
decoding speed (as the logic is much, much simpler).
- New bulk reading methods for integers in gamma, shifted gamma and delta
coding. They make readDocumentPositions() several times faster as most
decodings do not require a method call.
- Many fixes to the code involving generics.
- Fixed stupid bugs in PartitionLexically.
- Moved sizes into Index (brom BitStreamIndex) and added new SIZES property
that makes it possible to specify a global sizes file. This way, it is
possible to use BM25 on clusters.
- Major fixes to documentally clustered document iterators.
- Fixed subtle semantic issue in LowPassDocumentIterator: TRUE iterators
now make the iterator valid.
- Fixed subtle semantic issue in subclasses of AbstractOrderedIntervalIterator:
how TRUE subiterators are considered as always matching (so the actual
interval matching is performed just on non-TRUE iterators).
- Fixed bug in ScoreDocumentBoundedSizeQueue that was causing enqueuing
of documents with score equal to the minimum.
- Improved implementation of MinimalPerfectHash. By fixing deterministically
the perfect hash functions we reduce to virtually zero the trials during
the construction (thanks to [email protected] for suggesting the idea).
- Fixed old copy-and-paste bug in non-scored requests to QueryEngine: offset
was not used at all (but I guess nobody was using that method anyway).
- Completely new support for query expansion. A MultiTermIndexIterator
behaves in all respects like an IndexIterator, but it's actually built
by merging the index iterators of several terms. The "frequency" is
settable so to solve term-dependency problems in IDF-based ranking schemes.
For debugging purposes, + can be used (instead of |) to cause the
constructon of a MultiTermIndexIterator.
- Brouwerian difference is now supported. It kills all intervals of the minuend
that appear in the subtrahend. It can be used for searching for terms forcing
however the context in which they are found *not* to contain some
terms, or more generally a query. It can also be used to modify index granularity
by subtracting 2-element intervals that cross section boundaries.
- ConsecutiveDocumentIterator now support gaps that can be used to match arbitrary
words. This is particularly useful to perform phrasal queries in indices where
some terms have not being indexed. Gaps are specifiable using $ instead
of a term in the built-in parser.
- New methods to access the front of a subclass of AbstractUnionDocumentIterator,
that is, the indices of the component iterators positioned on the current document.
They are used by all union-based iterators, providing a significant performance
improvements on large unions.
- New metadata-only mode for Combine and related subclasses. Mainly useful for getting
the global sizes, terms, etc. of a cluster.
- The array-writing methods of OutputBitStream now take a long for the
bit length/offset, and correspondingly return a long. The old methods are
still present, but they are deprecated (just to avoid proliferation).
- Deprecated all minimal perfect hashing constructors using the platform default
encoding. They are just an endless cause of problems. There are now constructors
with just a filename and an encoding (which can be null to mean the platform
encoding, but you have to explicitly ask for it).
- Now all TermMap implementations have a constructor accepting an Iterable.
- New constructors and main method options for minimal perfect hash tables, prefix
dictionaries and front-coded lists that support reading gzip'd files.
- Query provides a clearer selection between *no interval selection* and
*no intervals*.
- Fixed bug in ImmutableBinaryTrie: prefixes of the first binary string
would have generated an empty approximated interval (instead of [0]).
- Fixed bug in writeShiftedGamma()/readShiftedGamma(), and modified test.bsh
so that it detects the bug.
- The SPIRE 2006 algorithms are by now obsolete--we have new, provably optimally
lazy algorithms. The code reflects this.
- Lots, lots, lots of unit tests.
1.1.2 -> 1.1.3
- New score(digits) method for ResultItem for easier display.
- Now JdbcDocumentCollection works with factories featuring more than one field.
- Reintroduced the JavaBeans Activation Framework in dependencies.
- Fixed lack of calls to close() in some document factory, generating
spurious warnings.
- Fixed static fields in QueryServlet.
1.1.1 -> 1.1.2
- Fixed default values of K_1 and B in BM25 scorer following
Büttcher & Clarke's paper.
- Fixed interval methods for nonsense calls on the empty interval.
- Dumped jline--we now suggest using rlwrap.
- More sensible hash for intervals. As a consequence, the serialUID
had to be bumped.
- Fixed serious bug in OrderedAndDocumentIterator that was dropping
several correct intervals (thanks to Fabien Campagne for finding
this bug).
- Fixed very old bug in InputBitStream.read(byte[], int)--reads of
full length would have caused an ArrayIndexOutOfBoundsException
(thanks to Kevin Dorff for finding this bug).
- OrDocumentIterator was using an indirect queue instead of a
semi-indirect queue, maybe for historical reasons.
- Complete rewrite of interval operators due to new algorithms, to
be included in the revised SPIRE 2006 paper. On TREC data this led
to an average 3% increase in speed. Now the algorithms used by MG4J
are provably optimally lazy.
- The BulletParser now accepts element-type names with dashes, etc., and
moreover parses correctly explicit CDATA sections (thanks to Kevin Dorff
for finding these bugs).
- Support for unsigning signed minimal perfect hash maps.
- New Shift-Add-Xor-based signed minimal perfect hashes (even with long
signatures). Moreover, now all signed hashes have a main() method
generating by default instances of that hash.
- Massive speed improvements in OutputBitStream: finally we write
precomputed words for small integers, analogously to what happens
in InputBitStream.
1.1 -> 1.1.1
- Better loading of InputBitStream data, working also with multiple class
loaders, and serialisability of SelectedInterval (fixed by the Twease
people).
- AbstractAggregator was not setting up the equalisation factors when
equalisation was not required, resulting in divisions by 0.
- CountScorer is now a DelegatingScorer (as it should have always been).
- The empty-constructor interval selector wasn't really letting out *all*
interval--overlapping intervals would have been discarded.
- Fixed a *very old* bug in the computation of minimal-interval semantics.
Now the code is fully aligned with our SPIRE 2006 paper.
1.0.2 -> 1.1
- IMPORTANT: IndexWriter.close() no longer save automagicall
properties--you have to fetch them with IndexWriter.properties().
- Java 5 only.
- Probably the largest rewrite and extension in the history of MG4J. Too
many changes, fixes and optimisations to be described here. Almost nothing
is backward-compatible.
- We are starting to distribute unit tests with each release. We have
actually many more tests, but they are not cast inside JUnit and rather
undocumented. You are welcome to donate unit tests.
1.0.1 -> 1.0.2
- Fixed bug in InputStreamDocumentCollection: the document index (and thus
the title) was never incremented.
- New parsing factory for the BulletParser: you decide how to parse your names (an
idea by Fabien Campagne).
- Now we use 1.26n integers to minimally hash n words. 1.25n is in fact the
threshold--you need something larger than that. The change should be fully
backward-compatible.
- Now FileLinesCollection returns a Closeable FileLinesIterator.
- BloomFilter does not implement any longer the nonsensical size() method. add()
is more efficient and does not return a value.
1.0.0 -> 1.0.1
- Fixed bug in Paste if the size of size lists differ (now we extend to zero).
- The "field" property was not propogated by Combine.
- A missing throws clause in AbstractDocumentCollection's implemention of iterator()
was making it impossible to throw exceptions in implementing subclasses.
- New, efficient single-query iterator for JdbcDocumentCollection.
- NULLs do not generate null pointer exceptions any longer in JDBC document collections.
They're converted to empty input streams.
0.9.2 -> 1.0.0
- Too much to be written.
0.9.1 -> 0.9.2
- IMPORTANT: To avoid clashes with List, the get() method of TermMap
has been changed to getTerm(). We're sorry for this inconvenience.
- Now we support prefixes by means of a PrefixMap. There are easy
(ternary search trees) and very sophisticated (semi-external tries)
implementation. If you have a PrefixMap you can search for things
like "foo*" (meaning "starts with foo"), provided that the terms
starting with "foo" do not exceed a constant defined in QueryParser.
- Interval has new methods that compare to points.
- Fixed stupid bug in ClarkCormack scorer: we were comparing the document
indices, not the scores. Ouch.
- Fixed ScoredDocumentBoundedSizeQueue: now stability is forced by making
the order an actual order (not a preorder) so it is possible to get the
k-th to (k+j)-th ranked documents in a consistent way. The new version
is, unfortunately, completely incompatible with the old one.
- New CachingDocumentIterator: it decorates a DocumentIterator so that
you can get several times its interval iterators.
- FastBufferedOutputStream was NOT flushing. The flush() method was inherited,
but of course that didn't work. FastBuffered{In,Out}putStream are
now deprecated as they have been moved to fastutil.
- OrDocumentIterator would have caused IllegalStateException in some circumstances
(the array of underlying iterators was assumed to have null'd position for
empty component iterators, but this wasn't happening).
- InputBitStream is a boolean iterator, and OutputBitStream has a method
accepting a boolean iterator. This opens a new world of possibilities 8^).
- New replace() and delete() methods in MutableString for handling more
easily deletions or substitution of a class of characters.
- readLine() no long empties its argument on end of file.
0.9.0 -> 0.9.1
- A couple of missing methods in it.unimi.dsi.mg4j.util.Fast were necessary for
WebGraph.
0.8.2 -> 0.9.0
- IMPORTANT: Int2IntArrayMap and Int2LongArrayMap no longer exist: offsets and
sizes now are type-specific array lists (and they can be easily generated using
fastutil wrappers).
- Fixed stupid, stupid bug in state handling in IndexReader. Sequential
reads of an entire index would have thrown an IllegalStateException.
- Changed a few array static methods to faster fastutil counterparts.
- Fixed small glitch in lastIndexOf() semantics--searches from negative
offset of the empty string would have returned 0 instead of -1.
- Bunch of new methods in MutableString ((last)indexOfAnyBut, (co)span).
- Golomb read/write methods now support modulus 0, 0 being the only valid
argument (and the result returned upon reads).
- Minimal perfect hashes support lists with less then 16 elements, by
storing them transparently in a vector.
- Signed hashes have an incompatible format (sorry).
- Minimal perfect hashes support optimal weight length computation for
sorted term collections.
- New left/right trim methods. Moreover, trim methods preverve looseness
and compactness.
- Literally zillions of new features, everywhere.
- Experimental support for multi-index minimal-interval semantics and
skipping towers.
0.8.1 -> 0.8.2
- New methods for starting and stopping a progress meter with messages.
- New FastMultiByteArrayInputStream class that can hold 256 PiB (256 PiB = 2^28
GiB) and expose them as a repositionable stream.
0.8 -> 0.8.1
- Modified imports and class name for compliance with fastutil 3.0.
- Relicensed under the LGPL.
0.7.1 -> 0.8
- New NullInputStream to support new InputBitStream direct array wrapping.
- position() in InputBitStream will always work if the new position is
within the current buffer.
- Removed unused buffer in InputBitStream, and made unget buffer allocation
on-demand.
- Eliminated finalizers from streams.
- New debugging class.
- Fixed bug in position().
- Now the ProgressMeter gives items/s at the first printout.
- New methods for variable-length nibble coding.
- New methods for zeta coding (a new code!).
- Fixed erroneous serialisation of CRC32SignedMinimalPerfectHash.
- Completely renewed hashing scheme for minimal perfect hashing: supports
the empty string and it is faster to compute. Moreover, MinimalPerfectHash
now has an offline builder that never loads the words actually in RAM,
thus allowing to hash very large sets (albeit slowly), and checks
a suitable system property to provide optional verbose logging. The
serialisation, unfortunately, is incompatible.
0.7 -> 0.7.1
- Removed experimental classes.
- Fixed two bad bugs introduced during the in 0.7 during optimisation.
0.6 -> 0.7
- IMPORTANT: MG4J now uses the new fastutil package name (i.e., no
more fastutil). If you use parts of MG4J that require fastutil, you
should upgrade.
- New replace() methods for MutableString that entirely replace
the string content. New copy() method to obtain easily a compact
copy of a mutable string.
- New RepositionableStream interface to mark streams that can
be repositioned by bit streams.
- New FastByteArrayInputStream to read memory blocks as bit
streams.
- New unsynchronised FastBufferedReader.
- ProgressMeter count value has now setters and getters.
- Programmable meter quantum for FirstPass.
- Some optimisations.
0.5 -> 0.6
- IMPORTANT: streams and self-delimiting string formats are not
binary compatible with previous versions. Please read the docs.
- Fixed bug with serialisation of empty strings and set serialVersionUID.
- Too many addition to be described in a file, but in short: optimised
indexOf() family of methods, flexible index construction, QuickSearch
fast searches.
0.4 -> 0.5
- MutableString has now a more coherent policy for compactness and looseness.
They are preserved by all operations.
- StringBuffer-specific methods have been killed to reduce code duplication.
You need to recompile so that java uses the alternative CharSequence-specific
methods.
- Several new methods such as startsWith, endsWith etc.
0.3 -> 0.4
IMPORTANT: the hash computation functions in
MinimalPerfectHash have been changed. Please regenerate your maps.
MinimalPerfectHash has been reimplemented to use CharSequence, so
it is more general. Moreover, we have a new SignedMinimalPerfectHash
that can be used to avoid false positives.
- New replace methods in mutable strings.
- Now we try to return a reference to this in all mutable string methods.
- Various fixes to documentation.
0.2 -> 0.3
- Introduced new class MutableString.
0.1 -> 0.2
- By mistake writeLongDelta() was really called writeDelta().
© 2015 - 2025 Weber Informatics LLC | Privacy Policy