All Downloads are FREE. Search and download functionalities are using the official Maven repository.

hep.aida.tfloat.bin.QuantileFloatBin1D Maven / Gradle / Ivy

Go to download

Parallel Colt is a multithreaded version of Colt - a library for high performance scientific computing in Java. It contains efficient algorithms for data analysis, linear algebra, multi-dimensional arrays, Fourier transforms, statistics and histogramming.

The newest version!
package hep.aida.tfloat.bin;

import cern.colt.list.tfloat.FloatArrayList;
import cern.jet.random.tfloat.engine.FloatRandomEngine;
import cern.jet.stat.tfloat.quantile.FloatQuantileFinder;
import cern.jet.stat.tfloat.quantile.FloatQuantileFinderFactory;

/**
 * 1-dimensional non-rebinnable bin holding float elements with
 * scalable quantile operations defined upon; Using little main memory, quickly
 * computes approximate quantiles over very large data sequences with and even
 * without a-priori knowledge of the number of elements to be filled;
 * Conceptually a strongly lossily compressed multiset (or bag); Guarantees to
 * respect the worst case approximation error specified upon instance
 * construction. First see the package
 * summary and javadoc tree view to get the
 * broad picture.
 * 

* Motivation and Problem: Intended to help scale applications requiring * quantile computation. Quantile computation on very large data sequences is * problematic, for the following reasons: Computing quantiles requires sorting * the data sequence. To sort a data sequence the entire data sequence needs to * be available. Thus, data cannot be thrown away during filling (as done by * static bins like {@link StaticFloatBin1D} and {@link MightyStaticFloatBin1D} * ). It needs to be kept, either in main memory or on disk. There is often not * enough main memory available. Thus, during filling data needs to be streamed * onto disk. Sorting disk resident data is prohibitively time consuming. As a * consequence, traditional methods either need very large memories (like * {@link DynamicFloatBin1D}) or time consuming disk based sorting. *

* This class proposes to efficiently solve the problem, at the expense of * producing approximate rather than exact results. It can deal with infinitely * many elements without resorting to disk. The main memory requirements are * smaller than for any other known approximate technique by an order of * magnitude. They get even smaller if an upper limit on the maximum number of * elements ever to be added is known a-priori. *

* Approximation error: The approximation guarantees are parametrizable * and explicit but probabilistic, and apply for arbitrary value distributions * and arrival distributions of the data sequence. In other words, this class * guarantees to respect the worst case approximation error specified upon * instance construction to a certain probability. Of course, if it is specified * that the approximation error should not exceed some number very close * to zero, this class will potentially consume just as much memory as any of * the traditional exact techniques would do. However, for errors larger than * 10-5, its memory requirements are modest, as shown by the table * below. *

* Main memory requirements: Given in megabytes, assuming a single * element (float) takes 8 byte. The number of elements required is * then MB*1024*1024/8. *

* Parameters: *

    *
  • epsilon - the maximum allowed approximation error on quantiles; in * [0.0,1.0]. To get exact rather than approximate quantiles, set * epsilon=0.0; * *
  • delta - the probability allowed that the approximation error fails * to be smaller than epsilon; in [0.0,1.0]. To avoid probabilistic * answers, set delta=0.0. For example, delta = 0.0001 is * equivalent to a confidence of 99.99%. * *
  • quantiles - the number of quantiles to be computed; in * [0,Integer.MAX_VALUE]. * *
  • is N known? - specifies whether the exact size of the dataset over * which quantiles are to be computed is known. * *
  • Nmax - the exact dataset size, if known. Otherwise, an upper * limit on the dataset size. If no upper limit is known set to infinity ( * Long.MAX_VALUE). *
* Nmax=inf - we are sure that exactly (known) or less than * (unknown) infinity elements will be added.
* Nmax=106 - we are sure that exactly (known) or * less than (unknown) 106 elements will be added.
* Nmax=107 - we are sure that exactly (known) or * less than (unknown) 107 elements will be added.
* Nmax=108 - we are sure that exactly (known) or * less than (unknown) 108 elements will be added. *

* * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Required main memory [MB]
#quantiles
epsilon
delta N * unknown * N known *
Nmax=infNmax=106Nmax=107Nmax=108Nmax=infNmax=106Nmax=107Nmax=108
  
any
0
anyinfinity * 7.676762infinity * 7.676762
any
10 -1
0infinity0.0030.0050.0060.030.0030.0050.006
10 -2
0.020.030.050.310.020.030.05
10 -3
0.120.20.32.70.120.20.3
10 -4
0.61.22.126.90.61.22.1
10 -5
2.56.411.62052.56.411.6
10 -6
7.625.463.617587.625.463.6
  
100 *
10 -2
10 * -10.0330.0210.030.030.0200.0200.0200.020
10 * -50.0380.0210.030.040.0240.0200.0200.020
*
10 -3
10 -10.480.120.20.30.320.120.20.3
10 -50.540.120.20.30.370.120.20.3
*
10 -4
10 -16.60.61.22.14.60.61.22.1
10 -57.20.61.22.15.20.61.22.1
*
10 -5
10 -1862.56.411.6632.56.411.6
10 -5942.56.411.6702.56.411.6
  
* 10000 *
10 -2
10 -10.040.020.030.040.020.020.020.02
10 -50.040.020.030.040.030.020.030.03
*
10 -3
10 -10.520.120.210.30.350.120.210.3
10 -50.560.120.210.30.380.120.210.3
*
10 -4
10 -17.00.641.22.15.00.641.22.1
10 -57.50.641.22.15.40.641.22.1
*
10 -5
10 -1902.56.411.6672.56.411.6
10 -5962.56.411.6712.56.411.6
   
* #quantiles * epsilon * deltaNmax=infNmax=106Nmax=107Nmax=108Nmax=infNmax=106Nmax=107Nmax=108
N * unknownN known *
* Required main memory [MB]
* * *

* Implementation: *

* After: Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Random * Sampling Techniques for Space Efficient Online Computation of Order * Statistics of Large Datasets. Proc. of the 1999 ACM SIGMOD Int. Conf. on * Management of Data, Paper available * here. *

* and *

* Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Approximate * Medians and other Quantiles in One Pass and with Limited Memory, Proc. of the * 1998 ACM SIGMOD Int. Conf. on Management of Data, Paper available * here. *

* The broad picture is as follows. Two concepts are used: Shrinking and * Sampling. Shrinking takes a data sequence, sorts it and produces a * shrinked data sequence by picking every k-th element and throwing away all * the rest. The shrinked data sequence is an approximation to the original data * sequence. *

* Imagine a large data sequence (residing on disk or being generated in memory * on the fly) and a main memory block of n=b*k elements ( * b is the number of buffers, k is the number of elements per * buffer). Fill elements from the data sequence into the block until it is full * or the data sequence is exhausted. When the block (or a subset of buffers) is * full and the data sequence is not exhausted, apply shrinking to lossily * compress a number of buffers into one single buffer. Repeat these steps until * all elements of the data sequence have been consumed. Now the block is a * shrinked approximation of the original data sequence. Treating it as if it * would be the original data sequence, we can determine quantiles in main * memory. *

* Now, the whole thing boils down to the question of: Can we choose b * and k (the number of buffers and the buffer size) such that * b*k is minimized, yet quantiles determined upon the block are * guaranteed to be away from the true quantiles no more than some * epsilon? It turns out, we can. It also turns out that the required * main memory block size n=b*k is usually moderate (see the table * above). *

* The theme can be combined with random sampling to further reduce main memory * requirements, at the expense of probabilistic guarantees. Sampling filters * the data sequence and feeds only selected elements to the algorithm outlined * above. Sampling is turned on or off, depending on the parametrization. *

* This quick overview does not go into important details, such as assigning * proper weights to buffers, how to choose subsets of buffers to shrink, * etc. For more information consult the papers cited above. * *

* Time Performance: *

*

Pentium Pro 200 Mhz, SunJDK 1.2.2, NT, java -classic,
* filling 10 4 elements at a time, reading 100 percentiles at a * time,
* hasSumOfLogarithms()=false, hasSumOfInversions()=false, * getMaxOrderForSumOfPowers()=2
*
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Performance
* Quantiles * Epsilon * Delta Filling
* [#elements/sec]
 Quantile computation
* [#quantiles/sec]
N unknown,
* Nmax=inf
N known,
* Nmax=107
N unknown,
* Nmax=inf
N known,
* Nmax=107
   
* 10410 -110 * -1 *

* 1600000 *

*
1300000250000130000
10 -236000012000005000020000
10 -315000020000036003000
10 -4120000170000801000
*
* * @see cern.jet.stat.tfloat.quantile * @author [email protected] * @version 0.9, 03-Jul-99 */ public class QuantileFloatBin1D extends MightyStaticFloatBin1D { /** * */ private static final long serialVersionUID = 1L; protected FloatQuantileFinder finder = null; /** * Not public; for use by subclasses only! Constructs and returns an empty * bin. */ protected QuantileFloatBin1D() { super(false, false, 2); } /** * Equivalent to * new QuantileBin1D(false, Long.MAX_VALUE, epsilon, 0.001, 10000, new cern.jet.random.engine.DRand(new java.util.Date()) * . */ public QuantileFloatBin1D(float epsilon) { this(false, Long.MAX_VALUE, epsilon, 0.001f, 10000, new cern.jet.random.tfloat.engine.FRand( new java.util.Date())); } /** * Equivalent to * new QuantileBin1D(known_N, N, epsilon, delta, quantiles, randomGenerator, false, false, 2) * . */ public QuantileFloatBin1D(boolean known_N, long N, float epsilon, float delta, int quantiles, FloatRandomEngine randomGenerator) { this(known_N, N, epsilon, delta, quantiles, randomGenerator, false, false, 2); } /** * Constructs and returns an empty bin that, under the given constraints, * minimizes the amount of memory needed. * * Some applications exactly know in advance over how many elements * quantiles are to be computed. Provided with such information the main * memory requirements of this class are small. Other applications don't * know in advance over how many elements quantiles are to be computed. * However, some of them can give an upper limit, which will reduce main * memory requirements. For example, if elements are selected from a * database and filled into histograms, it is usually not known in advance * how many elements are being filled, but one may know that at most * S elements, the number of elements in the database, are filled. * A third type of application knowns nothing at all about the number of * elements to be filled; from zero to infinitely many elements may actually * be filled. This method efficiently supports all three types of * applications. * * @param known_N * specifies whether the number of elements over which quantiles * are to be computed is known or not. *

* @param N * if known_N==true, the number of elements over which * quantiles are to be computed. if known_N==false, the * upper limit on the number of elements over which quantiles are * to be computed. In other words, the maximum number of elements * ever to be added. If such an upper limit is a-priori unknown, * then set N = Long.MAX_VALUE. *

* @param epsilon * the approximation error which is guaranteed not to be exceeded * (e.g. 0.001) (0 <= epsilon <= 1). To * get exact rather than approximate quantiles, set * epsilon=0.0; *

* @param delta * the allowed probability that the actual approximation error * exceeds epsilon (e.g. 0.0001) (0 <= delta <= * 1). To avoid probabilistic answers, set delta=0.0. * For example, delta = 0.0001 is equivalent to a * confidence of 99.99%. *

* @param quantiles * the number of quantiles to be computed (e.g. 100) ( * quantiles >= 1). If unknown in advance, set this * number large, e.g. quantiles >= 10000. *

* @param randomGenerator * a uniform random number generator. Set this parameter to * null to use a default generator seeded with the * current time. *

* The next three parameters specify additional capabilities * unrelated to quantile computation. They are identical to the * one's defined in the constructor of the parent class * {@link MightyStaticFloatBin1D}. *

* @param hasSumOfLogarithms * Tells whether {@link #sumOfLogarithms()} can return meaningful * results. Set this parameter to false if measures of * sum of logarithms, geometric mean and product are not * required. *

* @param hasSumOfInversions * Tells whether {@link #sumOfInversions()} can return meaningful * results. Set this parameter to false if measures of * sum of inversions, harmonic mean and sumOfPowers(-1) are not * required. *

* @param maxOrderForSumOfPowers * The maximum order k for which * {@link #sumOfPowers(int)} can return meaningful results. Set * this parameter to at least 3 if the skew is required, to at * least 4 if the kurtosis is required. In general, if moments * are required set this parameter at least as large as the * largest required moment. This method always substitutes * Math.max(2,maxOrderForSumOfPowers) for the parameter * passed in. Thus, sumOfPowers(0..2) always returns * meaningful results. */ public QuantileFloatBin1D(boolean known_N, long N, float epsilon, float delta, int quantiles, FloatRandomEngine randomGenerator, boolean hasSumOfLogarithms, boolean hasSumOfInversions, int maxOrderForSumOfPowers) { super(hasSumOfLogarithms, hasSumOfInversions, maxOrderForSumOfPowers); this.finder = FloatQuantileFinderFactory.newFloatQuantileFinder(known_N, N, epsilon, delta, quantiles, randomGenerator); this.clear(); } /** * Adds the part of the specified list between indexes from * (inclusive) and to (inclusive) to the receiver. * * @param list * the list of which elements shall be added. * @param from * the index of the first element to be added (inclusive). * @param to * the index of the last element to be added (inclusive). * @throws IndexOutOfBoundsException * if * list.size()>0 && (from<0 || from>to || to>=list.size()) * . */ public synchronized void addAllOfFromTo(FloatArrayList list, int from, int to) { super.addAllOfFromTo(list, from, to); if (this.finder != null) this.finder.addAllOfFromTo(list, from, to); } /** * Removes all elements from the receiver. The receiver will be empty after * this call returns. */ public synchronized void clear() { super.clear(); if (this.finder != null) this.finder.clear(); } /** * Returns a deep copy of the receiver. * * @return a deep copy of the receiver. */ public synchronized Object clone() { QuantileFloatBin1D clone = (QuantileFloatBin1D) super.clone(); if (this.finder != null) clone.finder = (FloatQuantileFinder) clone.finder.clone(); return clone; } /** * Computes the deviations from the receiver's measures to another bin's * measures. * * @param other * the other bin to compare with * @return a summary of the deviations. */ public String compareWith(AbstractFloatBin1D other) { StringBuffer buf = new StringBuffer(super.compareWith(other)); if (other instanceof QuantileFloatBin1D) { QuantileFloatBin1D q = (QuantileFloatBin1D) other; buf.append("25%, 50% and 75% Quantiles: " + relError(quantile(0.25f), q.quantile(0.25f)) + ", " + relError(quantile(0.5f), q.quantile(0.5f)) + ", " + relError(quantile(0.75f), q.quantile(0.75f))); buf.append("\nquantileInverse(mean): " + relError(quantileInverse(mean()), q.quantileInverse(q.mean())) + " %"); buf.append("\n"); } return buf.toString(); } /** * Returns the median. */ public float median() { return quantile(0.5f); } /** * Computes and returns the phi-quantile. * * @param phi * the percentage for which the quantile is to be computed. phi * must be in the interval (0.0,1.0]. * @return the phi quantile element. */ public synchronized float quantile(float phi) { return quantiles(new FloatArrayList(new float[] { phi })).get(0); } /** * Returns how many percent of the elements contained in the receiver are * <= element. Does linear interpolation if the element is not * contained but lies in between two contained elements. * * @param element * the element to search for. * @return the percentage phi of elements <= element ( * 0.0 <= phi <=1.0). */ public synchronized float quantileInverse(float element) { return finder.phi(element); } /** * Returns the quantiles of the specified percentages. For implementation * reasons considerably more efficient than calling {@link #quantile(float)} * various times. * * @param phis * the percentages for which quantiles are to be computed. Each * percentage must be in the interval (0.0,1.0]. * percentages must be sorted ascending. * @return the quantiles. */ public synchronized FloatArrayList quantiles(cern.colt.list.tfloat.FloatArrayList phis) { return finder.quantileElements(phis); } /** * Returns how many elements are contained in the range * [minElement,maxElement]. Does linear interpolation if one or * both of the parameter elements are not contained. Returns exact or * approximate results, depending on the parametrization of this class or * subclasses. * * @param minElement * the minimum element to search for. * @param maxElement * the maximum element to search for. * @return the number of elements in the range. */ public int sizeOfRange(float minElement, float maxElement) { return Math.round(size() * (quantileInverse(maxElement) - quantileInverse(minElement))); } /** * Divides (rebins) a copy of the receiver at the given percentage * boundaries into bins and returns these bins, such that each bin * approximately reflects the data elements of its range. * * The receiver is not physically rebinned (divided); it stays unaffected by * this operation. The returned bins are such that if one would have * filled elements into multiple bins instead of one single all encompassing * bin only, those multiple bins would have approximately the same * statistics measures as the one's returned by this method. *

* The split(...) methods are particularly well suited for * real-time interactive rebinning (the famous "scrolling slider" effect). *

* Passing equi-distant percentages like * (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) into this method will yield bins * of an equi-depth histogram, i.e. a histogram with bin boundaries * adjusted such that each bin contains the same number of elements, in this * case 20% each. Equi-depth histograms can be useful if, for example, not * enough properties of the data to be captured are known a-priori to be * able to define reasonable bin boundaries (partitions). For example, when * guesses about minimas and maximas are strongly unreliable. Or when * chances are that by focussing too much on one particular area other * important areas and characters of a data set may be missed. *

* Implementation: *

* The receiver is divided into s = percentages.size()-1 intervals * (bins). For each interval I, its minimum and maximum elements * are determined based upon quantile computation. Further, each interval * I is split into k equi-percent-distant subintervals * (sub-bins). In other words, an interval is split into subintervals such * that each subinterval contains the same number of elements. *

* For each subinterval S, its minimum and maximum are determined, * again, based upon quantile computation. They yield an approximate * arithmetic mean am = (min+max)/2 of the subinterval. A * subinterval is treated as if it would contain only elements equal to the * mean am. Thus, if the subinterval contains, say, n * elements, it is assumed to consist of n mean elements * (am,am,...,am). A subinterval's sum of elements, sum of squared * elements, sum of inversions, etc. are then approximated using such a * sequence of mean elements. *

* Finally, the statistics measures of an interval I are computed * by summing up (integrating) the measures of its subintervals. *

* Accuracy: *

* Depending on the accuracy of quantile computation and the number of * subintervals per interval (the resolution). Objects of this class compute * exact or approximate quantiles, depending on the parameters used upon * instance construction. Objects of subclasses may always compute * exact quantiles, as is the case for {@link DynamicFloatBin1D}. Most * importantly for this class QuantileBin1D, a reasonably small * epsilon (e.g. 0.01, perhaps 0.001) should be used upon instance * construction. The confidence parameter delta is less important, * you may find delta=0.00001 appropriate.
* The larger the resolution, the smaller the approximation error, up to * some limit. Integrating over only a few subintervals per interval will * yield very crude approximations. If the resolution is set to a reasonably * large number, say 10..100, more small subintervals are integrated, * resulting in more accurate results.
* Note that for good accuracy, the number of quantiles computable with the * given approximation guarantees should upon instance construction be * specified, so as to satisfy *

* quantiles > resolution * (percentages.size()-1) *

*

* Example: *

* resolution=2, percentList = (0.0, 0.1, 0.2, 0.5, 0.9, 1.0) means * the receiver is to be split into 5 bins:
*

    *
  • bin 0 ranges from [0%..10%) and holds the smallest 10% of * the sorted elements. *
  • bin 1 ranges from [10%..20%) and holds the next smallest 10% * of the sorted elements. *
  • bin 2 ranges from [20%..50%) and holds the next smallest 30% * of the sorted elements. *
  • bin 3 ranges from [50%..90%) and holds the next smallest 40% * of the sorted elements. *
  • bin 4 ranges from [90%..100%) and holds the largest 10% of * the sorted elements. *
*

* The statistics measures for each bin are to be computed at a resolution * of 2 subbins per bin. Thus, the statistics measures of a bin are the * integrated measures over 2 subbins, each containing the same amount of * elements: *

    *
  • bin 0 has a subbin ranging from [ 0%.. 5%) and a subbin * ranging from [ 5%..10%). *
  • bin 1 has a subbin ranging from [10%..15%) and a subbin * ranging from [15%..20%). *
  • bin 2 has a subbin ranging from [20%..35%) and a subbin * ranging from [35%..50%). *
  • bin 3 has a subbin ranging from [50%..70%) and a subbin * ranging from [70%..90%). *
  • bin 4 has a subbin ranging from [90%..95%) and a subbin * ranging from [95%..100%). *
*

* Lets concentrate on the subbins of bin 0. *

    *
  • Assume the subbin A=[0%..5%) has a minimum of 300 * and a maximum of 350 (0% of all elements are less than 300, 5% * of all elements are less than 350). *
  • Assume the subbin B=[5%..10%) has a minimum of 350 * and a maximum of 550 (5% of all elements are less than 350, 10% * of all elements are less than 550). *
*

* Assume the entire data set consists of N=100 elements. *

    *
  • Then subbin A has an approximate mean of 300+350 / 2 = 325, * a size of N*(5%-0%) = 100*5% = 5 elements, an approximate sum of * 325 * 100*5% = 1625, an approximate sum of squares of * 3252 * 100*5% = 528125, an approximate sum of * inversions of (1.0/325) * 100*5% = 0.015, etc. *
  • Analogously, subbin B has an approximate mean of * 350+550 / 2 = 450, a size of N*(10%-5%) = 100*5% = 5 * elements, an approximate sum of 450 * 100*5% = 2250, an * approximate sum of squares of 4502 * 100*5% = 1012500 * , an approximate sum of inversions of (1.0/450) * 100*5% = 0.01, * etc. *
*

* Finally, the statistics measures of bin 0 are computed by summing up * (integrating) the measures of its subintervals: Bin 0 has a size of * N*(10%-0%)=10 elements (we knew that already), sum of * 1625+2250=3875, sum of squares of * 528125+1012500=1540625, sum of inversions of * 0.015+0.01=0.025, etc. From these follow other measures such as * mean=3875/10=387.5, rms = sqrt(1540625 / 10)=392.5, etc. The * other bins are computes analogously. * * @param percentages * the percentage boundaries at which the receiver shall be * split. * @param k * the desired number of subintervals per interval. */ public synchronized MightyStaticFloatBin1D[] splitApproximately(FloatArrayList percentages, int k) { /* * percentages = [p0, p1, p2, ..., p(size-2), p(size-1)] defines bins * [p0,p1), [p1,p2), ..., [p(size-2),p(size-1)) each bin is divided into * k equi-percent-distant sub bins (subintervals). e.g. k = 2 means * "compute" with a resolution (accuracy) of 2 subbins (subintervals) * per bin, * * percentages = [0.1, 0.2, 0.3, ..., 0.9, 1.0] means bin 0 holds the * first 0.1-0.0=10% of the sorted elements, bin 1 holds the next * 0.2-0.1=10% of the sorted elements, ... * * bins = [0.1, 0.2), [0.2, 0.3), ..., [0.9, 1.0) subBins = [0.1, 0.15, * 0.2, 0.25, 0.3, ....] * * [0.1, 0.15) [0.15, 0.2) [0.3, 0.35) [0.35, 0.4) * * [0.2, 0.25) [0.25, 0.3) * */ int percentSize = percentages.size(); if (k < 1 || percentSize < 2) throw new IllegalArgumentException(); float[] percent = percentages.elements(); int noOfBins = percentSize - 1; // construct subintervals float[] subBins = new float[1 + k * (percentSize - 1)]; subBins[0] = percent[0]; int c = 1; for (int i = 0; i < noOfBins; i++) { float step = (percent[i + 1] - percent[i]) / k; for (int j = 1; j <= k; j++) { subBins[c++] = percent[i] + j * step; } } // compute quantile elements; float[] quantiles = quantiles(new FloatArrayList(subBins)).elements(); // collect summary statistics for each bin. // one bin's statistics are the integrated statistics of its // subintervals. MightyStaticFloatBin1D[] splitBins = new MightyStaticFloatBin1D[noOfBins]; int maxOrderForSumOfPowers = getMaxOrderForSumOfPowers(); maxOrderForSumOfPowers = Math.min(10, maxOrderForSumOfPowers); // don't // compute // tons // of // measures int dataSize = this.size(); c = 0; for (int i = 0; i < noOfBins; i++) { // for each bin float step = (percent[i + 1] - percent[i]) / k; float binSum = 0; float binSumOfSquares = 0; float binSumOfLogarithms = 0; float binSumOfInversions = 0; float[] binSumOfPowers = null; if (maxOrderForSumOfPowers > 2) { binSumOfPowers = new float[maxOrderForSumOfPowers - 2]; } float binMin = quantiles[c++]; float safe_min = binMin; float subIntervalSize = dataSize * step; for (int j = 1; j <= k; j++) { // integrate all subintervals float binMax = quantiles[c++]; float binMean = (binMin + binMax) / 2; binSum += binMean * subIntervalSize; binSumOfSquares += binMean * binMean * subIntervalSize; if (this.hasSumOfLogarithms) { binSumOfLogarithms += (Math.log(binMean)) * subIntervalSize; } if (this.hasSumOfInversions) { binSumOfInversions += (1 / binMean) * subIntervalSize; } if (maxOrderForSumOfPowers >= 3) binSumOfPowers[0] += binMean * binMean * binMean * subIntervalSize; if (maxOrderForSumOfPowers >= 4) binSumOfPowers[1] += binMean * binMean * binMean * binMean * subIntervalSize; for (int p = 5; p <= maxOrderForSumOfPowers; p++) { binSumOfPowers[p - 3] += Math.pow(binMean, p) * subIntervalSize; } binMin = binMax; } c--; // example: bin(0) contains (0.2-0.1) == 10% of all elements int binSize = Math.round((percent[i + 1] - percent[i]) * dataSize); float binMax = binMin; binMin = safe_min; // fill statistics splitBins[i] = new MightyStaticFloatBin1D(this.hasSumOfLogarithms, this.hasSumOfInversions, maxOrderForSumOfPowers); if (binSize > 0) { splitBins[i].size = binSize; splitBins[i].min = binMin; splitBins[i].max = binMax; splitBins[i].sum = binSum; splitBins[i].sum_xx = binSumOfSquares; splitBins[i].sumOfLogarithms = binSumOfLogarithms; splitBins[i].sumOfInversions = binSumOfInversions; splitBins[i].sumOfPowers = binSumOfPowers; } /* * float binMean = binSum / binSize; * System.out.println("size="+binSize); * System.out.println("min="+binMin); * System.out.println("max="+binMax); * System.out.println("mean="+binMean); * System.out.println("sum_x="+binSum); * System.out.println("sum_xx="+binSumOfSquares); * System.out.println("rms="+Math.sqrt(binSumOfSquares / binSize)); * System.out.println(); */ } return splitBins; } /** * Divides (rebins) a copy of the receiver at the given interval * boundaries into bins and returns these bins, such that each bin * approximately reflects the data elements of its range. * * For each interval boundary of the axis (including -infinity and * +infinity), computes the percentage (quantile inverse) of elements less * than the boundary. Then lets * {@link #splitApproximately(FloatArrayList,int)} do the real work. * * @param axis * an axis defining interval boundaries. * @param k * the desired number of subintervals per interval. */ public synchronized MightyStaticFloatBin1D[] splitApproximately(hep.aida.tfloat.FloatIAxis axis, int k) { FloatArrayList percentages = new FloatArrayList(new hep.aida.tfloat.ref.FloatConverter().edges(axis)); percentages.beforeInsert(0, Float.NEGATIVE_INFINITY); percentages.add(Float.POSITIVE_INFINITY); for (int i = percentages.size(); --i >= 0;) { percentages.set(i, quantileInverse(percentages.get(i))); } return splitApproximately(percentages, k); } /** * Returns a String representation of the receiver. */ public synchronized String toString() { StringBuffer buf = new StringBuffer(super.toString()); buf.append("25%, 50%, 75% Quantiles: " + quantile(0.25f) + ", " + quantile(0.5f) + ", " + quantile(0.75f)); // buf.append("10%, 25%, 50%, 75%, 90% Quantiles: "+quantile(0.1) + ", // "+ quantile(0.25) + ", "+ quantile(0.5) + ", " + quantile(0.75) + ", // " + quantile(0.9)); buf.append("\nquantileInverse(median): " + quantileInverse(median())); buf.append("\n"); return buf.toString(); } }





© 2015 - 2025 Weber Informatics LLC | Privacy Policy