hep.aida.tfloat.bin.QuantileFloatBin1D Maven / Gradle / Ivy
Show all versions of parallelcolt Show documentation
package hep.aida.tfloat.bin;
import cern.colt.list.tfloat.FloatArrayList;
import cern.jet.random.tfloat.engine.FloatRandomEngine;
import cern.jet.stat.tfloat.quantile.FloatQuantileFinder;
import cern.jet.stat.tfloat.quantile.FloatQuantileFinderFactory;
/**
* 1-dimensional non-rebinnable bin holding float elements with
* scalable quantile operations defined upon; Using little main memory, quickly
* computes approximate quantiles over very large data sequences with and even
* without a-priori knowledge of the number of elements to be filled;
* Conceptually a strongly lossily compressed multiset (or bag); Guarantees to
* respect the worst case approximation error specified upon instance
* construction. First see the package
* summary and javadoc tree view to get the
* broad picture.
*
* Motivation and Problem: Intended to help scale applications requiring
* quantile computation. Quantile computation on very large data sequences is
* problematic, for the following reasons: Computing quantiles requires sorting
* the data sequence. To sort a data sequence the entire data sequence needs to
* be available. Thus, data cannot be thrown away during filling (as done by
* static bins like {@link StaticFloatBin1D} and {@link MightyStaticFloatBin1D}
* ). It needs to be kept, either in main memory or on disk. There is often not
* enough main memory available. Thus, during filling data needs to be streamed
* onto disk. Sorting disk resident data is prohibitively time consuming. As a
* consequence, traditional methods either need very large memories (like
* {@link DynamicFloatBin1D}) or time consuming disk based sorting.
*
* This class proposes to efficiently solve the problem, at the expense of
* producing approximate rather than exact results. It can deal with infinitely
* many elements without resorting to disk. The main memory requirements are
* smaller than for any other known approximate technique by an order of
* magnitude. They get even smaller if an upper limit on the maximum number of
* elements ever to be added is known a-priori.
*
* Approximation error: The approximation guarantees are parametrizable
* and explicit but probabilistic, and apply for arbitrary value distributions
* and arrival distributions of the data sequence. In other words, this class
* guarantees to respect the worst case approximation error specified upon
* instance construction to a certain probability. Of course, if it is specified
* that the approximation error should not exceed some number very close
* to zero, this class will potentially consume just as much memory as any of
* the traditional exact techniques would do. However, for errors larger than
* 10-5, its memory requirements are modest, as shown by the table
* below.
*
* Main memory requirements: Given in megabytes, assuming a single
* element (float) takes 8 byte. The number of elements required is
* then MB*1024*1024/8.
*
* Parameters:
*
* - epsilon - the maximum allowed approximation error on quantiles; in
* [0.0,1.0]. To get exact rather than approximate quantiles, set
* epsilon=0.0;
*
*
- delta - the probability allowed that the approximation error fails
* to be smaller than epsilon; in [0.0,1.0]. To avoid probabilistic
* answers, set delta=0.0. For example, delta = 0.0001 is
* equivalent to a confidence of 99.99%.
*
*
- quantiles - the number of quantiles to be computed; in
* [0,Integer.MAX_VALUE].
*
*
- is N known? - specifies whether the exact size of the dataset over
* which quantiles are to be computed is known.
*
*
- Nmax - the exact dataset size, if known. Otherwise, an upper
* limit on the dataset size. If no upper limit is known set to infinity (
* Long.MAX_VALUE).
*
* Nmax=inf - we are sure that exactly (known) or less than
* (unknown) infinity elements will be added.
* Nmax=106 - we are sure that exactly (known) or
* less than (unknown) 106 elements will be added.
* Nmax=107 - we are sure that exactly (known) or
* less than (unknown) 107 elements will be added.
* Nmax=108 - we are sure that exactly (known) or
* less than (unknown) 108 elements will be added.
*
*
*
*
*
*
* Required main memory [MB]
*
*
* #quantiles
* epsilon
* delta
*
* N
* unknown
*
*
* N known
*
*
*
* Nmax=inf
* Nmax=106
* Nmax=107
* Nmax=108
* Nmax=inf
* Nmax=106
* Nmax=107
* Nmax=108
*
*
*
*
*
*
*
* any
* 0
* any
* infinity
*
* 7.6
* 76
* 762
* infinity
*
* 7.6
* 76
* 762
*
*
* any
* 10 -1
* 0
* infinity
* 0.003
* 0.005
* 0.006
* 0.03
* 0.003
* 0.005
* 0.006
*
*
* 10 -2
* 0.02
* 0.03
* 0.05
* 0.31
* 0.02
* 0.03
* 0.05
*
*
* 10 -3
* 0.12
* 0.2
* 0.3
* 2.7
* 0.12
* 0.2
* 0.3
*
*
* 10 -4
* 0.6
* 1.2
* 2.1
* 26.9
* 0.6
* 1.2
* 2.1
*
*
* 10 -5
* 2.5
* 6.4
* 11.6
* 205
* 2.5
* 6.4
* 11.6
*
*
* 10 -6
* 7.6
* 25.4
* 63.6
* 1758
* 7.6
* 25.4
* 63.6
*
*
*
*
*
*
*
* 100
*
* 10 -2
* 10
* -1
* 0.033
* 0.021
* 0.03
* 0.03
* 0.020
* 0.020
* 0.020
* 0.020
*
*
* 10
* -5
* 0.038
* 0.021
* 0.03
* 0.04
* 0.024
* 0.020
* 0.020
* 0.020
*
*
*
* 10 -3
* 10 -1
* 0.48
* 0.12
* 0.2
* 0.3
* 0.32
* 0.12
* 0.2
* 0.3
*
*
* 10 -5
* 0.54
* 0.12
* 0.2
* 0.3
* 0.37
* 0.12
* 0.2
* 0.3
*
*
*
* 10 -4
* 10 -1
* 6.6
* 0.6
* 1.2
* 2.1
* 4.6
* 0.6
* 1.2
* 2.1
*
*
* 10 -5
* 7.2
* 0.6
* 1.2
* 2.1
* 5.2
* 0.6
* 1.2
* 2.1
*
*
*
* 10 -5
* 10 -1
* 86
* 2.5
* 6.4
* 11.6
* 63
* 2.5
* 6.4
* 11.6
*
*
* 10 -5
* 94
* 2.5
* 6.4
* 11.6
* 70
* 2.5
* 6.4
* 11.6
*
*
*
*
*
*
*
*
* 10000
*
* 10 -2
* 10 -1
* 0.04
* 0.02
* 0.03
* 0.04
* 0.02
* 0.02
* 0.02
* 0.02
*
*
* 10 -5
* 0.04
* 0.02
* 0.03
* 0.04
* 0.03
* 0.02
* 0.03
* 0.03
*
*
*
* 10 -3
* 10 -1
* 0.52
* 0.12
* 0.21
* 0.3
* 0.35
* 0.12
* 0.21
* 0.3
*
*
* 10 -5
* 0.56
* 0.12
* 0.21
* 0.3
* 0.38
* 0.12
* 0.21
* 0.3
*
*
*
* 10 -4
* 10 -1
* 7.0
* 0.64
* 1.2
* 2.1
* 5.0
* 0.64
* 1.2
* 2.1
*
*
* 10 -5
* 7.5
* 0.64
* 1.2
* 2.1
* 5.4
* 0.64
* 1.2
* 2.1
*
*
*
* 10 -5
* 10 -1
* 90
* 2.5
* 6.4
* 11.6
* 67
* 2.5
* 6.4
* 11.6
*
*
* 10 -5
* 96
* 2.5
* 6.4
* 11.6
* 71
* 2.5
* 6.4
* 11.6
*
*
*
*
*
*
*
*
* #quantiles
*
* epsilon
*
* delta
* Nmax=inf
* Nmax=106
* Nmax=107
* Nmax=108
* Nmax=inf
* Nmax=106
* Nmax=107
* Nmax=108
*
*
* N
* unknown
* N known
*
*
*
*
* Required main memory [MB]
*
*
*
*
*
* Implementation:
*
* After: Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Random
* Sampling Techniques for Space Efficient Online Computation of Order
* Statistics of Large Datasets. Proc. of the 1999 ACM SIGMOD Int. Conf. on
* Management of Data, Paper available
* here.
*
* and
*
* Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Approximate
* Medians and other Quantiles in One Pass and with Limited Memory, Proc. of the
* 1998 ACM SIGMOD Int. Conf. on Management of Data, Paper available
* here.
*
* The broad picture is as follows. Two concepts are used: Shrinking and
* Sampling. Shrinking takes a data sequence, sorts it and produces a
* shrinked data sequence by picking every k-th element and throwing away all
* the rest. The shrinked data sequence is an approximation to the original data
* sequence.
*
* Imagine a large data sequence (residing on disk or being generated in memory
* on the fly) and a main memory block of n=b*k elements (
* b is the number of buffers, k is the number of elements per
* buffer). Fill elements from the data sequence into the block until it is full
* or the data sequence is exhausted. When the block (or a subset of buffers) is
* full and the data sequence is not exhausted, apply shrinking to lossily
* compress a number of buffers into one single buffer. Repeat these steps until
* all elements of the data sequence have been consumed. Now the block is a
* shrinked approximation of the original data sequence. Treating it as if it
* would be the original data sequence, we can determine quantiles in main
* memory.
*
* Now, the whole thing boils down to the question of: Can we choose b
* and k (the number of buffers and the buffer size) such that
* b*k is minimized, yet quantiles determined upon the block are
* guaranteed to be away from the true quantiles no more than some
* epsilon? It turns out, we can. It also turns out that the required
* main memory block size n=b*k is usually moderate (see the table
* above).
*
* The theme can be combined with random sampling to further reduce main memory
* requirements, at the expense of probabilistic guarantees. Sampling filters
* the data sequence and feeds only selected elements to the algorithm outlined
* above. Sampling is turned on or off, depending on the parametrization.
*
* This quick overview does not go into important details, such as assigning
* proper weights to buffers, how to choose subsets of buffers to shrink,
* etc. For more information consult the papers cited above.
*
*
* Time Performance:
*
*
Pentium Pro 200 Mhz, SunJDK 1.2.2, NT, java -classic,
* filling 10 4 elements at a time, reading 100 percentiles at a
* time,
* hasSumOfLogarithms()=false, hasSumOfInversions()=false,
* getMaxOrderForSumOfPowers()=2
*
*
*
* Performance
*
*
*
* Quantiles
*
* Epsilon
*
* Delta
*
* Filling
* [#elements/sec]
*
* Quantile computation
* [#quantiles/sec]
*
*
* N unknown,
* Nmax=inf
* N known,
* Nmax=107
* N unknown,
* Nmax=inf
* N known,
* Nmax=107
*
*
*
*
*
*
*
*
* 104
* 10 -1
* 10
* -1
*
*
* 1600000
*
*
* 1300000
* 250000
* 130000
*
*
* 10 -2
* 360000
* 1200000
* 50000
* 20000
*
*
* 10 -3
* 150000
* 200000
* 3600
* 3000
*
*
* 10 -4
* 120000
* 170000
* 80
* 1000
*
*
*
*
* @see cern.jet.stat.tfloat.quantile
* @author [email protected]
* @version 0.9, 03-Jul-99
*/
public class QuantileFloatBin1D extends MightyStaticFloatBin1D {
/**
*
*/
private static final long serialVersionUID = 1L;
protected FloatQuantileFinder finder = null;
/**
* Not public; for use by subclasses only! Constructs and returns an empty
* bin.
*/
protected QuantileFloatBin1D() {
super(false, false, 2);
}
/**
* Equivalent to
* new QuantileBin1D(false, Long.MAX_VALUE, epsilon, 0.001, 10000, new cern.jet.random.engine.DRand(new java.util.Date())
* .
*/
public QuantileFloatBin1D(float epsilon) {
this(false, Long.MAX_VALUE, epsilon, 0.001f, 10000, new cern.jet.random.tfloat.engine.FRand(
new java.util.Date()));
}
/**
* Equivalent to
* new QuantileBin1D(known_N, N, epsilon, delta, quantiles, randomGenerator, false, false, 2)
* .
*/
public QuantileFloatBin1D(boolean known_N, long N, float epsilon, float delta, int quantiles,
FloatRandomEngine randomGenerator) {
this(known_N, N, epsilon, delta, quantiles, randomGenerator, false, false, 2);
}
/**
* Constructs and returns an empty bin that, under the given constraints,
* minimizes the amount of memory needed.
*
* Some applications exactly know in advance over how many elements
* quantiles are to be computed. Provided with such information the main
* memory requirements of this class are small. Other applications don't
* know in advance over how many elements quantiles are to be computed.
* However, some of them can give an upper limit, which will reduce main
* memory requirements. For example, if elements are selected from a
* database and filled into histograms, it is usually not known in advance
* how many elements are being filled, but one may know that at most
* S elements, the number of elements in the database, are filled.
* A third type of application knowns nothing at all about the number of
* elements to be filled; from zero to infinitely many elements may actually
* be filled. This method efficiently supports all three types of
* applications.
*
* @param known_N
* specifies whether the number of elements over which quantiles
* are to be computed is known or not.
*
* @param N
* if known_N==true, the number of elements over which
* quantiles are to be computed. if known_N==false, the
* upper limit on the number of elements over which quantiles are
* to be computed. In other words, the maximum number of elements
* ever to be added. If such an upper limit is a-priori unknown,
* then set N = Long.MAX_VALUE.
*
* @param epsilon
* the approximation error which is guaranteed not to be exceeded
* (e.g. 0.001) (0 <= epsilon <= 1). To
* get exact rather than approximate quantiles, set
* epsilon=0.0;
*
* @param delta
* the allowed probability that the actual approximation error
* exceeds epsilon (e.g. 0.0001) (0 <= delta <=
* 1). To avoid probabilistic answers, set delta=0.0.
* For example, delta = 0.0001 is equivalent to a
* confidence of 99.99%.
*
* @param quantiles
* the number of quantiles to be computed (e.g. 100) (
* quantiles >= 1). If unknown in advance, set this
* number large, e.g. quantiles >= 10000.
*
* @param randomGenerator
* a uniform random number generator. Set this parameter to
* null to use a default generator seeded with the
* current time.
*
* The next three parameters specify additional capabilities
* unrelated to quantile computation. They are identical to the
* one's defined in the constructor of the parent class
* {@link MightyStaticFloatBin1D}.
*
* @param hasSumOfLogarithms
* Tells whether {@link #sumOfLogarithms()} can return meaningful
* results. Set this parameter to false if measures of
* sum of logarithms, geometric mean and product are not
* required.
*
* @param hasSumOfInversions
* Tells whether {@link #sumOfInversions()} can return meaningful
* results. Set this parameter to false if measures of
* sum of inversions, harmonic mean and sumOfPowers(-1) are not
* required.
*
* @param maxOrderForSumOfPowers
* The maximum order k for which
* {@link #sumOfPowers(int)} can return meaningful results. Set
* this parameter to at least 3 if the skew is required, to at
* least 4 if the kurtosis is required. In general, if moments
* are required set this parameter at least as large as the
* largest required moment. This method always substitutes
* Math.max(2,maxOrderForSumOfPowers) for the parameter
* passed in. Thus, sumOfPowers(0..2) always returns
* meaningful results.
*/
public QuantileFloatBin1D(boolean known_N, long N, float epsilon, float delta, int quantiles,
FloatRandomEngine randomGenerator, boolean hasSumOfLogarithms, boolean hasSumOfInversions,
int maxOrderForSumOfPowers) {
super(hasSumOfLogarithms, hasSumOfInversions, maxOrderForSumOfPowers);
this.finder = FloatQuantileFinderFactory.newFloatQuantileFinder(known_N, N, epsilon, delta, quantiles,
randomGenerator);
this.clear();
}
/**
* Adds the part of the specified list between indexes from
* (inclusive) and to (inclusive) to the receiver.
*
* @param list
* the list of which elements shall be added.
* @param from
* the index of the first element to be added (inclusive).
* @param to
* the index of the last element to be added (inclusive).
* @throws IndexOutOfBoundsException
* if
* list.size()>0 && (from<0 || from>to || to>=list.size())
* .
*/
public synchronized void addAllOfFromTo(FloatArrayList list, int from, int to) {
super.addAllOfFromTo(list, from, to);
if (this.finder != null)
this.finder.addAllOfFromTo(list, from, to);
}
/**
* Removes all elements from the receiver. The receiver will be empty after
* this call returns.
*/
public synchronized void clear() {
super.clear();
if (this.finder != null)
this.finder.clear();
}
/**
* Returns a deep copy of the receiver.
*
* @return a deep copy of the receiver.
*/
public synchronized Object clone() {
QuantileFloatBin1D clone = (QuantileFloatBin1D) super.clone();
if (this.finder != null)
clone.finder = (FloatQuantileFinder) clone.finder.clone();
return clone;
}
/**
* Computes the deviations from the receiver's measures to another bin's
* measures.
*
* @param other
* the other bin to compare with
* @return a summary of the deviations.
*/
public String compareWith(AbstractFloatBin1D other) {
StringBuffer buf = new StringBuffer(super.compareWith(other));
if (other instanceof QuantileFloatBin1D) {
QuantileFloatBin1D q = (QuantileFloatBin1D) other;
buf.append("25%, 50% and 75% Quantiles: " + relError(quantile(0.25f), q.quantile(0.25f)) + ", "
+ relError(quantile(0.5f), q.quantile(0.5f)) + ", " + relError(quantile(0.75f), q.quantile(0.75f)));
buf.append("\nquantileInverse(mean): " + relError(quantileInverse(mean()), q.quantileInverse(q.mean()))
+ " %");
buf.append("\n");
}
return buf.toString();
}
/**
* Returns the median.
*/
public float median() {
return quantile(0.5f);
}
/**
* Computes and returns the phi-quantile.
*
* @param phi
* the percentage for which the quantile is to be computed. phi
* must be in the interval (0.0,1.0].
* @return the phi quantile element.
*/
public synchronized float quantile(float phi) {
return quantiles(new FloatArrayList(new float[] { phi })).get(0);
}
/**
* Returns how many percent of the elements contained in the receiver are
* <= element. Does linear interpolation if the element is not
* contained but lies in between two contained elements.
*
* @param element
* the element to search for.
* @return the percentage phi of elements <= element (
* 0.0 <= phi <=1.0).
*/
public synchronized float quantileInverse(float element) {
return finder.phi(element);
}
/**
* Returns the quantiles of the specified percentages. For implementation
* reasons considerably more efficient than calling {@link #quantile(float)}
* various times.
*
* @param phis
* the percentages for which quantiles are to be computed. Each
* percentage must be in the interval (0.0,1.0].
* percentages must be sorted ascending.
* @return the quantiles.
*/
public synchronized FloatArrayList quantiles(cern.colt.list.tfloat.FloatArrayList phis) {
return finder.quantileElements(phis);
}
/**
* Returns how many elements are contained in the range
* [minElement,maxElement]. Does linear interpolation if one or
* both of the parameter elements are not contained. Returns exact or
* approximate results, depending on the parametrization of this class or
* subclasses.
*
* @param minElement
* the minimum element to search for.
* @param maxElement
* the maximum element to search for.
* @return the number of elements in the range.
*/
public int sizeOfRange(float minElement, float maxElement) {
return Math.round(size() * (quantileInverse(maxElement) - quantileInverse(minElement)));
}
/**
* Divides (rebins) a copy of the receiver at the given percentage
* boundaries into bins and returns these bins, such that each bin
* approximately reflects the data elements of its range.
*
* The receiver is not physically rebinned (divided); it stays unaffected by
* this operation. The returned bins are such that if one would have
* filled elements into multiple bins instead of one single all encompassing
* bin only, those multiple bins would have approximately the same
* statistics measures as the one's returned by this method.
*
* The split(...) methods are particularly well suited for
* real-time interactive rebinning (the famous "scrolling slider" effect).
*
* Passing equi-distant percentages like
* (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) into this method will yield bins
* of an equi-depth histogram, i.e. a histogram with bin boundaries
* adjusted such that each bin contains the same number of elements, in this
* case 20% each. Equi-depth histograms can be useful if, for example, not
* enough properties of the data to be captured are known a-priori to be
* able to define reasonable bin boundaries (partitions). For example, when
* guesses about minimas and maximas are strongly unreliable. Or when
* chances are that by focussing too much on one particular area other
* important areas and characters of a data set may be missed.
*
* Implementation:
*
* The receiver is divided into s = percentages.size()-1 intervals
* (bins). For each interval I, its minimum and maximum elements
* are determined based upon quantile computation. Further, each interval
* I is split into k equi-percent-distant subintervals
* (sub-bins). In other words, an interval is split into subintervals such
* that each subinterval contains the same number of elements.
*
* For each subinterval S, its minimum and maximum are determined,
* again, based upon quantile computation. They yield an approximate
* arithmetic mean am = (min+max)/2 of the subinterval. A
* subinterval is treated as if it would contain only elements equal to the
* mean am. Thus, if the subinterval contains, say, n
* elements, it is assumed to consist of n mean elements
* (am,am,...,am). A subinterval's sum of elements, sum of squared
* elements, sum of inversions, etc. are then approximated using such a
* sequence of mean elements.
*
* Finally, the statistics measures of an interval I are computed
* by summing up (integrating) the measures of its subintervals.
*
* Accuracy:
*
* Depending on the accuracy of quantile computation and the number of
* subintervals per interval (the resolution). Objects of this class compute
* exact or approximate quantiles, depending on the parameters used upon
* instance construction. Objects of subclasses may always compute
* exact quantiles, as is the case for {@link DynamicFloatBin1D}. Most
* importantly for this class QuantileBin1D, a reasonably small
* epsilon (e.g. 0.01, perhaps 0.001) should be used upon instance
* construction. The confidence parameter delta is less important,
* you may find delta=0.00001 appropriate.
* The larger the resolution, the smaller the approximation error, up to
* some limit. Integrating over only a few subintervals per interval will
* yield very crude approximations. If the resolution is set to a reasonably
* large number, say 10..100, more small subintervals are integrated,
* resulting in more accurate results.
* Note that for good accuracy, the number of quantiles computable with the
* given approximation guarantees should upon instance construction be
* specified, so as to satisfy
*
* quantiles > resolution * (percentages.size()-1)
*
*
* Example:
*
* resolution=2, percentList = (0.0, 0.1, 0.2, 0.5, 0.9, 1.0) means
* the receiver is to be split into 5 bins:
*
* - bin 0 ranges from [0%..10%) and holds the smallest 10% of
* the sorted elements.
*
- bin 1 ranges from [10%..20%) and holds the next smallest 10%
* of the sorted elements.
*
- bin 2 ranges from [20%..50%) and holds the next smallest 30%
* of the sorted elements.
*
- bin 3 ranges from [50%..90%) and holds the next smallest 40%
* of the sorted elements.
*
- bin 4 ranges from [90%..100%) and holds the largest 10% of
* the sorted elements.
*
*
* The statistics measures for each bin are to be computed at a resolution
* of 2 subbins per bin. Thus, the statistics measures of a bin are the
* integrated measures over 2 subbins, each containing the same amount of
* elements:
*
* - bin 0 has a subbin ranging from [ 0%.. 5%) and a subbin
* ranging from [ 5%..10%).
*
- bin 1 has a subbin ranging from [10%..15%) and a subbin
* ranging from [15%..20%).
*
- bin 2 has a subbin ranging from [20%..35%) and a subbin
* ranging from [35%..50%).
*
- bin 3 has a subbin ranging from [50%..70%) and a subbin
* ranging from [70%..90%).
*
- bin 4 has a subbin ranging from [90%..95%) and a subbin
* ranging from [95%..100%).
*
*
* Lets concentrate on the subbins of bin 0.
*
* - Assume the subbin A=[0%..5%) has a minimum of 300
* and a maximum of 350 (0% of all elements are less than 300, 5%
* of all elements are less than 350).
*
- Assume the subbin B=[5%..10%) has a minimum of 350
* and a maximum of 550 (5% of all elements are less than 350, 10%
* of all elements are less than 550).
*
*
* Assume the entire data set consists of N=100 elements.
*
* - Then subbin A has an approximate mean of 300+350 / 2 = 325,
* a size of N*(5%-0%) = 100*5% = 5 elements, an approximate sum of
* 325 * 100*5% = 1625, an approximate sum of squares of
* 3252 * 100*5% = 528125, an approximate sum of
* inversions of (1.0/325) * 100*5% = 0.015, etc.
*
- Analogously, subbin B has an approximate mean of
* 350+550 / 2 = 450, a size of N*(10%-5%) = 100*5% = 5
* elements, an approximate sum of 450 * 100*5% = 2250, an
* approximate sum of squares of 4502 * 100*5% = 1012500
* , an approximate sum of inversions of (1.0/450) * 100*5% = 0.01,
* etc.
*
*
* Finally, the statistics measures of bin 0 are computed by summing up
* (integrating) the measures of its subintervals: Bin 0 has a size of
* N*(10%-0%)=10 elements (we knew that already), sum of
* 1625+2250=3875, sum of squares of
* 528125+1012500=1540625, sum of inversions of
* 0.015+0.01=0.025, etc. From these follow other measures such as
* mean=3875/10=387.5, rms = sqrt(1540625 / 10)=392.5, etc. The
* other bins are computes analogously.
*
* @param percentages
* the percentage boundaries at which the receiver shall be
* split.
* @param k
* the desired number of subintervals per interval.
*/
public synchronized MightyStaticFloatBin1D[] splitApproximately(FloatArrayList percentages, int k) {
/*
* percentages = [p0, p1, p2, ..., p(size-2), p(size-1)] defines bins
* [p0,p1), [p1,p2), ..., [p(size-2),p(size-1)) each bin is divided into
* k equi-percent-distant sub bins (subintervals). e.g. k = 2 means
* "compute" with a resolution (accuracy) of 2 subbins (subintervals)
* per bin,
*
* percentages = [0.1, 0.2, 0.3, ..., 0.9, 1.0] means bin 0 holds the
* first 0.1-0.0=10% of the sorted elements, bin 1 holds the next
* 0.2-0.1=10% of the sorted elements, ...
*
* bins = [0.1, 0.2), [0.2, 0.3), ..., [0.9, 1.0) subBins = [0.1, 0.15,
* 0.2, 0.25, 0.3, ....]
*
* [0.1, 0.15) [0.15, 0.2) [0.3, 0.35) [0.35, 0.4)
*
* [0.2, 0.25) [0.25, 0.3)
*
*/
int percentSize = percentages.size();
if (k < 1 || percentSize < 2)
throw new IllegalArgumentException();
float[] percent = percentages.elements();
int noOfBins = percentSize - 1;
// construct subintervals
float[] subBins = new float[1 + k * (percentSize - 1)];
subBins[0] = percent[0];
int c = 1;
for (int i = 0; i < noOfBins; i++) {
float step = (percent[i + 1] - percent[i]) / k;
for (int j = 1; j <= k; j++) {
subBins[c++] = percent[i] + j * step;
}
}
// compute quantile elements;
float[] quantiles = quantiles(new FloatArrayList(subBins)).elements();
// collect summary statistics for each bin.
// one bin's statistics are the integrated statistics of its
// subintervals.
MightyStaticFloatBin1D[] splitBins = new MightyStaticFloatBin1D[noOfBins];
int maxOrderForSumOfPowers = getMaxOrderForSumOfPowers();
maxOrderForSumOfPowers = Math.min(10, maxOrderForSumOfPowers); // don't
// compute
// tons
// of
// measures
int dataSize = this.size();
c = 0;
for (int i = 0; i < noOfBins; i++) { // for each bin
float step = (percent[i + 1] - percent[i]) / k;
float binSum = 0;
float binSumOfSquares = 0;
float binSumOfLogarithms = 0;
float binSumOfInversions = 0;
float[] binSumOfPowers = null;
if (maxOrderForSumOfPowers > 2) {
binSumOfPowers = new float[maxOrderForSumOfPowers - 2];
}
float binMin = quantiles[c++];
float safe_min = binMin;
float subIntervalSize = dataSize * step;
for (int j = 1; j <= k; j++) { // integrate all subintervals
float binMax = quantiles[c++];
float binMean = (binMin + binMax) / 2;
binSum += binMean * subIntervalSize;
binSumOfSquares += binMean * binMean * subIntervalSize;
if (this.hasSumOfLogarithms) {
binSumOfLogarithms += (Math.log(binMean)) * subIntervalSize;
}
if (this.hasSumOfInversions) {
binSumOfInversions += (1 / binMean) * subIntervalSize;
}
if (maxOrderForSumOfPowers >= 3)
binSumOfPowers[0] += binMean * binMean * binMean * subIntervalSize;
if (maxOrderForSumOfPowers >= 4)
binSumOfPowers[1] += binMean * binMean * binMean * binMean * subIntervalSize;
for (int p = 5; p <= maxOrderForSumOfPowers; p++) {
binSumOfPowers[p - 3] += Math.pow(binMean, p) * subIntervalSize;
}
binMin = binMax;
}
c--;
// example: bin(0) contains (0.2-0.1) == 10% of all elements
int binSize = Math.round((percent[i + 1] - percent[i]) * dataSize);
float binMax = binMin;
binMin = safe_min;
// fill statistics
splitBins[i] = new MightyStaticFloatBin1D(this.hasSumOfLogarithms, this.hasSumOfInversions,
maxOrderForSumOfPowers);
if (binSize > 0) {
splitBins[i].size = binSize;
splitBins[i].min = binMin;
splitBins[i].max = binMax;
splitBins[i].sum = binSum;
splitBins[i].sum_xx = binSumOfSquares;
splitBins[i].sumOfLogarithms = binSumOfLogarithms;
splitBins[i].sumOfInversions = binSumOfInversions;
splitBins[i].sumOfPowers = binSumOfPowers;
}
/*
* float binMean = binSum / binSize;
* System.out.println("size="+binSize);
* System.out.println("min="+binMin);
* System.out.println("max="+binMax);
* System.out.println("mean="+binMean);
* System.out.println("sum_x="+binSum);
* System.out.println("sum_xx="+binSumOfSquares);
* System.out.println("rms="+Math.sqrt(binSumOfSquares / binSize));
* System.out.println();
*/
}
return splitBins;
}
/**
* Divides (rebins) a copy of the receiver at the given interval
* boundaries into bins and returns these bins, such that each bin
* approximately reflects the data elements of its range.
*
* For each interval boundary of the axis (including -infinity and
* +infinity), computes the percentage (quantile inverse) of elements less
* than the boundary. Then lets
* {@link #splitApproximately(FloatArrayList,int)} do the real work.
*
* @param axis
* an axis defining interval boundaries.
* @param k
* the desired number of subintervals per interval.
*/
public synchronized MightyStaticFloatBin1D[] splitApproximately(hep.aida.tfloat.FloatIAxis axis, int k) {
FloatArrayList percentages = new FloatArrayList(new hep.aida.tfloat.ref.FloatConverter().edges(axis));
percentages.beforeInsert(0, Float.NEGATIVE_INFINITY);
percentages.add(Float.POSITIVE_INFINITY);
for (int i = percentages.size(); --i >= 0;) {
percentages.set(i, quantileInverse(percentages.get(i)));
}
return splitApproximately(percentages, k);
}
/**
* Returns a String representation of the receiver.
*/
public synchronized String toString() {
StringBuffer buf = new StringBuffer(super.toString());
buf.append("25%, 50%, 75% Quantiles: " + quantile(0.25f) + ", " + quantile(0.5f) + ", " + quantile(0.75f));
// buf.append("10%, 25%, 50%, 75%, 90% Quantiles: "+quantile(0.1) + ",
// "+ quantile(0.25) + ", "+ quantile(0.5) + ", " + quantile(0.75) + ",
// " + quantile(0.9));
buf.append("\nquantileInverse(median): " + quantileInverse(median()));
buf.append("\n");
return buf.toString();
}
}