All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.mahout.math.map.package-info Maven / Gradle / Ivy

Go to download

High performance scientific and technical computing data structures and methods, mostly based on CERN's Colt Java API

There is a newer version: 0.13.0
Show newest version
/**
 * 
 * 
 * Automatically growing and shrinking maps holding objects or primitive
 * data types such as int, double, etc. Currently all maps are
 * based upon hashing.
 * 

1. Overview

*

The map package offers flexible object oriented abstractions modelling automatically * resizing maps. It is designed to be scalable in terms of performance and memory * requirements.

*

Features include:

*

*
    *
  • Maps operating on objects as well as all primitive data types such as int, * double, etc. *
  • *
  • Compact representations
  • *
  • Support for quick access to associations
  • *
  • A number of general purpose map operations
  • *
*

File-based I/O can be achieved through the standard Java built-in serialization * mechanism. All classes implement the {@link java.io.Serializable} interface. * However, the toolkit is entirely decoupled from advanced I/O. It provides data * structures and algorithms only. *

This toolkit borrows some terminology from the Javasoft * Collections framework written by Josh Bloch and introduced in JDK 1.2. *

2. Introduction

*

A map is an associative container that manages a set of (key,value) pairs. * It is useful for implementing a collection of one-to-one mappings. A (key,value) * pair is called an association. A value can be looked up up via its key. * Associations can quickly be set, removed and retrieved. They are stored in a * hashing structure based on the hash code of their keys, which is obtained by * using a hash function.

*

A map can, for example, contain Name-->Location associations like * {("Pete", "Geneva"), ("Steve", "Paris"), ("Robert", "New York")} used * in address books or Index-->Value mappings like {(0, 100), (3, * 1000), (100000, 70)} representing sparse lists or matrices. For example * this could mean at index 0 we have a value of 100, at index 3 we have a value * of 1000, at index 1000000 we have a value of 70, and at all other indexes we * have a value of, say, zero. Another example is a map of IP addresses to domain * names (DNS). Maps can also be useful to represent multi sets, that is, * sets where elements can occur more than once. For multi sets one would have * Value-->Frequency mappings like {(100, 1), (50, 1000), (101, 3))} * meaning element 100 occurs 1 time, element 50 occurs 1000 times, element 101 * occurs 3 times. Further, maps can also manage ObjectIdentifier-->Object * mappings like {(12, obj1), (7, obj2), (10000, obj3), (9, obj4)} used * in Object Databases. *

A map cannot contain two or more equal keys; a key can map to at most * one value. However, more than one key can map to identical values. For primitive * data types "equality" of keys is defined as identity (operator ==). * For maps using Object keys, the meaning of "equality" can be specified * by the user upon instance construction. It can either be defined to be identity * (operator ==) or to be given by the method {@link java.lang.Object#equals(Object)}. * Associations of kind (AnyType,Object) can be of the form (AnyKey,null) * , i.e. values can be null. *

The classes of this package make no guarantees as to the order of the elements * returned by iterators; in particular, they do not guarantee that the order will * remain constant over time. *

*

Copying

*

*

Any map can be copied. A copy is equal to the original but entirely * independent of the original. So changes in the copy are not reflected in the * original, and vice-versa. *

3. Package organization

*

For most primitive data types and for objects there exists a separate map version. * All versions are just the same, except that they operate on different data types. * Colt includes two kinds of implementations for maps: The two different implementations * are tagged Chained and Open. * Note: Chained is no more included. Wherever it is mentioned it is of historic interest only.

*
    *
  • Chained uses extendible separate chaining with chains holding unsorted * dynamically linked collision lists. *
  • Open uses extendible open addressing with double hashing. *
*

Class naming follows the schema <Implementation><KeyType><ValueType>HashMap. * For example, a {@link org.apache.mahout.math.map.OpenIntDoubleHashMap} holds (int-->double) * associations and is implemented with open addressing. A {@link org.apache.mahout.math.map.OpenIntObjectHashMap} * holds (int-->Object) associations and is implemented with open addressing. *

*

The classes for maps of a given (key,value) type are derived from a common * abstract base class tagged Abstract<KeyType><ValueType>Map. * For example, all maps operating on (int-->double) associations are * derived from {@link org.apache.mahout.math.map.AbstractIntDoubleMap}, which in turn is derived * from an abstract base class tying together all maps regardless of assocation * type, {@link org.apache.mahout.math.set.AbstractSet}. The abstract base classes provide skeleton * implementations for all but few methods. Experimental layouts (such as chaining, * open addressing, extensible hashing, red-black-trees, etc.) can easily be implemented * and inherit a rich set of functionality. Have a look at the javadoc tree * view to get the broad picture.

*

4. Example usage

* * *
*
 * int[]    keys   = {0    , 3     , 100000, 9   };
 * double[] values = {100.0, 1000.0, 70.0  , 71.0};
 * AbstractIntDoubleMap map = new OpenIntDoubleHashMap();
 * // add several associations
 * for (int i=0; i < keys.length; i++) map.put(keys[i], values[i]);
 * log.info("map="+map);
 * log.info("size="+map.size());
 * log.info(map.containsKey(3));
 * log.info("get(3)="+map.get(3));
 * log.info(map.containsKey(4));
 * log.info("get(4)="+map.get(4));
 * log.info(map.containsValue(71.0));
 * log.info("keyOf(71.0)="+map.keyOf(71.0));
 * // remove one association
 * map.removeKey(3);
 * log.info("\nmap="+map);
 * log.info(map.containsKey(3));
 * log.info("get(3)="+map.get(3));
 * log.info(map.containsValue(1000.0));
 * log.info("keyOf(1000.0)="+map.keyOf(1000.0));
 * // clear
 * map.clear();
 * log.info("\nmap="+map);
 * log.info("size="+map.size());
 * 
*
* yields the following output * * *
*
 * map=[0->100.0, 3->1000.0, 9->71.0, 100000->70.0]
 * size=4
 * true
 * get(3)=1000.0
 * false
 * get(4)=0.0
 * true
 * keyOf(71.0)=9
 * map=[0->100.0, 9->71.0, 100000->70.0]
 * false
 * get(3)=0.0
 * false
 * keyOf(1000.0)=-2147483648
 * map=[]
 * size=0
 * 
*
*

5. Notes

*

* Note that implementations are not synchronized. *

* Choosing efficient parameters for hash maps is not always easy. * However, since parameters determine efficiency and memory requirements, here is a quick guide how to choose them. * If your use case does not heavily operate on hash maps but uses them just because they provide * convenient functionality, you can safely skip this section. * For those of you who care, read on. *

* There are three parameters that can be customized upon map construction: initialCapacity, * minLoadFactor and maxLoadFactor. * The more memory one can afford, the faster a hash map. * The hash map's capacity is the maximum number of associations that can be added without needing to allocate new * internal memory. * A larger capacity means faster adding, searching and removing. * The initialCapacity corresponds to the capacity used upon instance construction. *

* The loadFactor of a hash map measures the degree of "fullness". * It is given by the number of assocations (size()) * divided by the hash map capacity (0.0 <= loadFactor <= 1.0). * The more associations are added, the larger the loadFactor and the more hash map performance degrades. * Therefore, when the loadFactor exceeds a customizable threshold (maxLoadFactor), the hash map is * automatically grown. * In such a way performance degradation can be avoided. * Similarly, when the loadFactor falls below a customizable threshold (minLoadFactor), the hash map is * automatically shrinked. * In such a way excessive memory consumption can be avoided. * Automatic resizing (both growing and shrinking) obeys the following invariant: *

* capacity * minLoadFactor <= size() <= capacity * maxLoadFactor *

The term capacity * minLoadFactor is called the low water mark, * capacity * maxLoadFactor is called the high water mark. In other * words, the number of associations may vary within the water mark constraints. * When it goes out of range, the map is automatically resized and memory consumption * changes proportionally. *

    *
  • To tune for memory at the expense of performance, both increase minLoadFactor and * maxLoadFactor. *
  • To tune for performance at the expense of memory, both decrease minLoadFactor and * maxLoadFactor. * As as special case set minLoadFactor=0 to avoid any automatic shrinking. *
* Resizing large hash maps can be time consuming, O(size()), and should be avoided if possible (maintaining * primes is not the reason). * Unnecessary growing operations can be avoided if the number of associations is known before they are added, or can be * estimated.

* In such a case good parameters are as follows: *

* For chaining: *
Set the initialCapacity = 1.4*expectedSize or greater. *
Set the maxLoadFactor = 0.8 or greater. *

* For open addressing: *
Set the initialCapacity = 2*expectedSize or greater. Alternatively call ensureCapacity(...). *
Set the maxLoadFactor = 0.5. *
Never set maxLoadFactor > 0.55; open addressing exponentially slows down beyond that point. *

* In this way the hash map will never need to grow and still stay fast. * It is never a good idea to set maxLoadFactor < 0.1, * because the hash map would grow too often. * If it is entirelly unknown how many associations the application will use, * the default constructor should be used. The map will grow and shrink as needed. *

* Comparision of chaining and open addressing *

Chaining is faster than open addressing, when assuming unconstrained memory * consumption. Open addressing is more space efficient than chaining, because * it does not create entry objects but uses primitive arrays which are considerably * smaller. Entry objects consume significant amounts of memory compared to the * information they actually hold. Open addressing also poses no problems to the * garbage collector. In contrast, chaining can create millions of entry objects * which are linked; a nightmare for any garbage collector. In addition, entry * object creation is a bit slow.
* Therefore, with the same amount of memory, or even less memory, hash maps with * larger capacity can be maintained under open addressing, which yields smaller * loadFactors, which in turn keeps performance competitive with chaining. In our * benchmarks, using significantly less memory, open addressing usually is not * more than 1.2-1.5 times slower than chaining. *

Further readings: *
Knuth D., The Art of Computer Programming: Searching and Sorting, 3rd ed. *
Griswold W., Townsend G., The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon, * Software - Practice and Experience, Vol. 23(4), 351-367 (April 1993). *
Larson P., Dynamic hash tables, Comm. of the ACM, 31, (4), 1988. *

* Performance: *

* Time complexity: *
The classes offer expected time complexity O(1) (i.e. constant time) for the basic operations * put, get, removeKey, containsKey and size, * assuming the hash function disperses the elements properly among the buckets. * Otherwise, pathological cases, although highly improbable, can occur, degrading performance to O(N) in the * worst case. * Operations containsValue and keyOf are O(N). *

* Memory requirements for open addressing: *
worst case: memory [bytes] = (1/minLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value)). *
best case: memory [bytes] = (1/maxLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value)). * Where sizeOf(int) = 4, sizeOf(double) = 8, sizeOf(Object) = 4, etc. * Thus, an OpenIntIntHashMap with minLoadFactor=0.25 and maxLoadFactor=0.5 and 1000000 associations uses * between 17 MB and 34 MB. * The same map with 1000 associations uses between 17 and 34 KB. *

* * */ package org.apache.mahout.math.map;





© 2015 - 2024 Weber Informatics LLC | Privacy Policy