h.jvptree.0.1.source-code.overview.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of jvptree Show documentation
A generic vp-tree implementation in Java
There is a newer version: 0.3.0



    
        Jvptree overview
    

    
        jvptree

        Jvptree is a generic vantage-point tree implementation that allows for quick (O(log(n))) searches for the nearest neighbors to a given point. Vantage-point trees are binary space partitioning trees that partition points according to their distance from each node's "vantage point." Points that are closer than a chosen threshold go into one child node, while points that are farther away go into the other. Vantage point trees operate on any metric space.

        Major concepts

        The main thing vantage-point trees do is partitioning points into groups that are closer or farther than a given distance threshold. To do that, a vp-tree needs to be able to figure out how far apart any two points are and also decide what to use as a distance threshold. At a minimum, you'll need to provide a distance function that can calculate the distance between points. You may optionally specify a threshold selection strategy; if you don't, a reasonable default will be used.

        Distance functions

        You must always specify a distance function when creating a vp-tree. Distance functions take two points as arguments and must satisfy the requirements of a metric space, namely:

        
            d(x, y) >= 0
            d(x, y) = 0 if and only if x == y
            d(x, y) == d(y, x)
            d(x, z) <= d(x, y) + d(y, z)
        

        Threshold selection strategies

        You may optionally specify a strategy for choosing a distance threshold for partitioning. By default, jvptree will use sampling median strategy, where it will take the median distance from a small subset of the points to partition. Jvptree also includes a threshold selection strategy that takes the median of all points to be partitioned; this is slower, but may result in a more balanced tree. Most users will not need to specify a threshold selection strategy.

        Node capacity

        Additionally, you may specify a desired capacity for the tree's leaf nodes. It's worth mentioning early that you almost certainly do not need to worry about this; a reasonable default (32 points) will be used, and most users won't realize significant performance gains by tuning it.

        Still, for those in need, you may choose a desired capacity for leaf nodes in a vp-tree. At one extreme, leaf nodes may contain only a single point. This means that searches will have to traverse more nodes, but once a leaf node is reached, fewer points will need to be searched to find nearest neighbors.

        Using a larger node capacity will result in a "flatter" tree, and fewer nodes will need to be traversed when searching, but more nodes will need to be tested once a search reaches a leaf node. Larger node capacities also lead to less memory overhead because there are fewer nodes in the tree.

        As a general rule of thumb, node capacities should be on the same order of magnitude as your typical search result size. The idea is that if a search reaches a leaf node, most of the points in the node will wind up in the collection of nearest neighbors (i.e. they all would have had to been checked anyhow) and few other nodes will have to be visited to gather any remaining neighbors.

        License

        Jvptree is available to the public under the MIT License.