
water.fvec.Chunk Maven / Gradle / Ivy
package water.fvec;
import water.*;
import water.parser.BufferedString;
/** A compression scheme, over a chunk of data - a single array of bytes.
* Chunks are mapped many-to-1 to a {@link Vec}. The actual vector
* header info is in the Vec - which contains info to find all the bytes of
* the distributed vector. Subclasses of this abstract class implement
* (possibly empty) compression schemes.
*
* Chunks are collections of elements, and support an array-like API.
* Chunks are subsets of a Vec; while the elements in a Vec are numbered
* starting at 0, any given Chunk has some (probably non-zero) starting row,
* and a length which is smaller than the whole Vec. Chunks are limited to a
* single Java byte array in a single JVM heap, and only an int's worth of
* elements. Chunks support both the notions of a global row-number and a
* chunk-local numbering. The global row-number calls are variants of {@code
* at} and {@code set}. If the row is outside the current Chunk's range, the
* data will be loaded by fetching from the correct Chunk. This probably
* involves some network traffic, and if all rows are loaded then the entire
* dataset will be pulled local (possibly triggering an OutOfMemory).
*
*
The chunk-local numbering supports the common {@code for} loop iterator
* pattern, using {@code at} and {@code set} calls that end in a '{@code 0}',
* and is faster than the global row-numbering for tight loops (because it
* avoids some range checks):
*
* for( int row=0; row < chunk._len; row++ )
* ...chunk.atd(row)...
*
*
* The array-like API allows loading and storing elements in and out of
* Chunks. When loading, values are decompressed. When storing, an attempt
* to compress back into the actual underlying Chunk subclass is made; if this
* fails the Chunk is "inflated" into a {@link NewChunk}, and the store
* completed there. Later the NewChunk will be compressed (probably into a
* different underlying Chunk subclass) and put back in the K/V store under
* the same Key - effectively replacing the original Chunk; this is done when
* {@link #close} is called, and is taken care of by the standard {@link
* MRTask} calls.
*
*
Chunk updates are not multi-thread safe; the caller must do correct
* synchronization. This is already handled by the Map/Reduce {MRTask)
* framework. Chunk updates are not visible cross-cluster until the {@link
* #close} is made; again this is handled by MRTask directly.
*
*
In addition to normal load and store operations, Chunks support the
* notion a missing element via the {@code isNA_abs()} calls, and a "next
* non-zero" notion for rapidly iterating over sparse data.
*
*
Data Types
*
*
Chunks hold Java primitive values, timestamps, UUIDs, or Strings. All
* the Chunks in a Vec hold the same type. Most of the types are compressed.
* Integer types (boolean, byte, short, int, long) are always lossless. Float
* and Double types might lose 1 or 2 ulps in the compression. Time data is
* held as milliseconds since the Unix Epoch. UUIDs are held as 128-bit
* integers (a pair of Java longs). Strings are compressed in various obvious
* ways. Sparse data is held... sparsely; e.g. loading data in SVMLight
* format will not "blow up" the in-memory representation. Categoricals/factors
* are held as small integers, with a shared String lookup table on the side.
*
*
Chunks support the notion of missing data. Missing float and
* double data is always treated as a NaN, both if read or written. There is
* no equivalent of NaN for integer data; reading a missing integer value is a
* coding error and will be flagged. If you are working with integer data
* with missing elements, you must first check for a missing value before
* loading it:
*
* if( !chk.isNA(row) ) ...chk.at8(row)....
*
*
* The same holds true for the other non-real types (timestamps, UUIDs,
* Strings, or categoricals); they must be checked for missing before being used.
*
*
Performance Concerns
*
*
The standard {@code for} loop mentioned above is the fastest way to
* access data; definitely faster (and less error prone) than iterating over
* global row numbers. Iterating over a single Chunk is nearly always
* memory-bandwidth bound. Often code will iterate over a number of Chunks
* aligned together (the common use-case of looking a whole rows of a
* dataset). Again, typically such a code pattern is memory-bandwidth bound
* although the X86 will stop being able to prefetch well beyond 100 or 200
* Chunks.
*
*
Note that Chunk alignment is guaranteed within all the Vecs of a Frame:
* Same numbered Chunks of different Vecs will have the same global
* row numbering and the same length, enabling a particularly simple and
* efficient way to iterate over all rows.
*
*
This example computes the Euclidean distance between all the columns and
* a given point, and stores the squared distance back in the last column.
* Note that due "NaN poisoning" if any row element is missing, the entire
* distance calculated will be NaN.
*
{@code
final double[] _point; // The given point
public void map( Chunk[] chks ) { // Map over a set of same-numbered Chunks
for( int row=0; row < chks[0]._len; row++ ) { // For all rows
double dist=0; // Squared distance
for( int col=0; col < chks.length-1; col++ ) { // For all cols, except the last output col
double d = chks[col].atd(row) - _point[col]; // Distance along this dimension
dist += d*d; // Sum-squared-distance
}
chks[chks.length-1].set( row, dist ); // Store back the distance in the last col
}
}}
*/
public abstract class Chunk extends Iced {
public Chunk() {}
private Chunk(byte [] bytes) {_mem = bytes;initFromBytes();}
/**
* Sparse bulk interface, stream through the compressed values and extract them into dense double array.
* @param vals holds extracted values, length must be >= this.sparseLen()
* @param vals holds extracted chunk-relative row ids, length must be >= this.sparseLen()
* @return number of extracted (non-zero) elements, equal to sparseLen()
*/
public int asSparseDoubles(double[] vals, int[] ids){return asSparseDoubles(vals,ids,Double.NaN);}
public int asSparseDoubles(double [] vals, int [] ids, double NA) {
if(vals.length < sparseLenZero())
throw new IllegalArgumentException();
getDoubles(vals,0,_len);
for(int i = 0; i < _len; ++i) ids[i] = i;
return len();
}
/**
* Dense bulk interface, fetch values from the given range
* @param vals
* @param from
* @param to
*/
public double [] getDoubles(double[] vals, int from, int to){ return getDoubles(vals,from,to, Double.NaN);}
public double [] getDoubles(double [] vals, int from, int to, double NA){
for(int i = from; i < to; ++i) {
vals[i - from] = atd(i);
if(Double.isNaN(vals[i-from]))
vals[i - from] = NA;
}
return vals;
}
public int [] getIntegers(int [] vals, int from, int to, int NA){
for(int i = from; i < to; ++i) {
double d = atd(i);
if(Double.isNaN(d))
vals[i] = NA;
else {
vals[i] = (int)d;
if(vals[i] != d) throw new IllegalArgumentException("Calling getIntegers on non-integer column");
}
}
return vals;
}
/**
* Dense bulk interface, fetch values from the given ids
* @param vals
* @param ids
*/
public double[] getDoubles(double [] vals, int [] ids){
int j = 0;
for(int i:ids) vals[j++] = atd(i);
return vals;
}
/** Global starting row for this local Chunk; a read-only field. */
transient long _start = -1;
/** Global starting row for this local Chunk */
public final long start() { return _start; }
/** Global index of this chunk filled during chunk load */
transient int _cidx = -1;
/** Number of rows in this Chunk; publically a read-only field. Odd API
* design choice: public, not-final, read-only, NO-ACCESSOR.
*
* NO-ACCESSOR: This is a high-performance field, and must have a known
* zero-cost cost-model; accessors hide that cost model, and make it
* not-obvious that a loop will be properly optimized or not.
*
*
not-final: set in various deserializers.
*
Proper usage: read the field, probably in a hot loop.
*
* for( int row=0; row < chunk._len; row++ )
* ...chunk.atd(row)...
*
**/
public transient int _len;
/** Internal set of _len. Used by lots of subclasses. Not a publically visible API. */
int set_len(int len) { return _len = len; }
/** Read-only length of chunk (number of rows). */
public int len() { return _len; }
/** Normally==null, changed if chunk is written to. Not a publically readable or writable field. */
private transient Chunk _chk2;
/** Exposed for internal testing only. Not a publically visible API. */
public Chunk chk2() { return _chk2; }
/** Owning Vec; a read-only field */
transient Vec _vec;
/** Owning Vec */
public Vec vec() { return _vec; }
/** Set the owning Vec */
public void setVec(Vec vec) { _vec = vec; }
/** Set the start */
public void setStart(long start) { _start = start; }
/** The Big Data. Frequently set in the subclasses, but not otherwise a publically writable field. */
byte[] _mem;
/** Short-cut to the embedded big-data memory. Generally not useful for
* public consumption, since the data remains compressed and holding on to a
* pointer to this array defeats the user-mode spill-to-disk. */
public byte[] getBytes() { return _mem; }
public void setBytes(byte[] mem) { _mem = mem; }
/** Used by a ParseExceptionTest to break the Chunk invariants and trigger an
* NPE. Not intended for public use. */
public final void crushBytes() { _mem=null; }
final long at8_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return at8((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** Load a {@code double} value using absolute row numbers. Returns
* Double.NaN if value is missing.
*
* This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
*
Slightly slower than {@link #atd} since it range-checks within a chunk.
* @return double value at the given row, or NaN if the value is missing */
final double at_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return atd((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** Missing value status.
*
*
This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
*
Slightly slower than {@link #isNA} since it range-checks within a chunk.
* @return true if the value is missing */
final boolean isNA_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return isNA((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** Low half of a 128-bit UUID, or throws if the value is missing.
*
*
This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
*
Slightly slower than {@link #at16l} since it range-checks within a chunk.
* @return Low half of a 128-bit UUID, or throws if the value is missing. */
final long at16l_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return at16l((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** High half of a 128-bit UUID, or throws if the value is missing.
*
*
This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
*
Slightly slower than {@link #at16h} since it range-checks within a chunk.
* @return High half of a 128-bit UUID, or throws if the value is missing. */
final long at16h_abs(long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return at16h((int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** String value using absolute row numbers, or null if missing.
*
*
This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects).
*
*
Slightly slower than {@link #atStr} since it range-checks within a chunk.
* @return String value using absolute row numbers, or null if missing. */
final BufferedString atStr_abs(BufferedString bStr, long i) {
long x = i - (_start>0 ? _start : 0);
if( 0 <= x && x < _len) return atStr(bStr, (int) x);
throw new ArrayIndexOutOfBoundsException(""+_start+" <= "+i+" < "+(_start+ _len));
}
/** Load a {@code double} value using chunk-relative row numbers. Returns Double.NaN
* if value is missing.
* @return double value at the given row, or NaN if the value is missing */
public final double atd(int i) { return _chk2 == null ? atd_impl(i) : _chk2. atd_impl(i); }
/** Load a {@code long} value using chunk-relative row numbers. Floating
* point values are silently rounded to a long. Throws if the value is
* missing.
* @return long value at the given row, or throw if the value is missing */
public final long at8(int i) { return _chk2 == null ? at8_impl(i) : _chk2. at8_impl(i); }
/** Missing value status using chunk-relative row numbers.
*
* @return true if the value is missing */
public final boolean isNA(int i) { return _chk2 == null ?isNA_impl(i) : _chk2.isNA_impl(i); }
/** Low half of a 128-bit UUID, or throws if the value is missing.
*
* @return Low half of a 128-bit UUID, or throws if the value is missing. */
public final long at16l(int i) { return _chk2 == null ? at16l_impl(i) : _chk2.at16l_impl(i); }
/** High half of a 128-bit UUID, or throws if the value is missing.
*
* @return High half of a 128-bit UUID, or throws if the value is missing. */
public final long at16h(int i) { return _chk2 == null ? at16h_impl(i) : _chk2.at16h_impl(i); }
/** String value using chunk-relative row numbers, or null if missing.
*
* @return String value or null if missing. */
public final BufferedString atStr(BufferedString bStr, int i) { return _chk2 == null ? atStr_impl(bStr, i) : _chk2.atStr_impl(bStr, i); }
/** Write a {@code long} using absolute row numbers. There is no way to
* write a missing value with this call. Under rare circumstances this can
* throw: if the long does not fit in a double (value is larger magnitude
* than 2^52), AND float values are stored in Vector. In this case, there
* is no common compatible data representation.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
*
This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
final void set_abs(long i, long l) { long x = i-_start; if (0 <= x && x < _len) set((int) x, l); else _vec.set(i,l); }
/** Write a {@code double} using absolute row numbers; NaN will be treated as
* a missing value.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
*
This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
final void set_abs(long i, double d) { long x = i-_start; if (0 <= x && x < _len) set((int) x, d); else _vec.set(i,d); }
/** Write a {@code float} using absolute row numbers; NaN will be treated as
* a missing value.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
*
This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
final void set_abs( long i, float f) { long x = i-_start; if (0 <= x && x < _len) set((int) x, f); else _vec.set(i,f); }
/** Set the element as missing, using absolute row numbers.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
*
This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
final void setNA_abs(long i) { long x = i-_start; if (0 <= x && x < _len) setNA((int) x); else _vec.setNA(i); }
/** Set a {@code String}, using absolute row numbers.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
*
*
This version uses absolute element numbers, but must convert them to
* chunk-relative indices - requiring a load from an aliasing local var,
* leading to lower quality JIT'd code (similar issue to using iterator
* objects). */
public final void set_abs(long i, String str) { long x = i-_start; if (0 <= x && x < _len) set((int) x, str); else _vec.set(i,str); }
public boolean hasFloat(){return true;}
public boolean hasNA(){return true;}
/** Replace all rows with this new chunk */
public void replaceAll( Chunk replacement ) {
assert _len == replacement._len;
_vec.preWriting(); // One-shot writing-init
_chk2 = replacement;
assert _chk2._chk2 == null; // Replacement has NOT been written into
}
public Chunk deepCopy() {
Chunk c2 = (Chunk)clone();
c2._vec=null;
c2._start=-1;
c2._cidx=-1;
c2._mem = _mem.clone();
return c2;
}
private void setWrite() {
if( _chk2 != null ) return; // Already setWrite
assert !(this instanceof NewChunk) : "Cannot direct-write into a NewChunk, only append";
_vec.preWriting(); // One-shot writing-init
_chk2 = (Chunk)clone(); // Flag this chunk as having been written into
assert _chk2._chk2 == null; // Clone has NOT been written into
}
/** Write a {@code long} with check-relative indexing. There is no way to
* write a missing value with this call. Under rare circumstances this can
* throw: if the long does not fit in a double (value is larger magnitude
* than 2^52), AND float values are stored in Vector. In this case, there
* is no common compatible data representation.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final long set(int idx, long l) {
setWrite();
if( _chk2.set_impl(idx,l) ) return l;
(_chk2 = inflate_impl(new NewChunk(this))).set_impl(idx,l);
return l;
}
/** Write a {@code double} with check-relative indexing. NaN will be treated
* as a missing value.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final double set(int idx, double d) {
setWrite();
if( _chk2.set_impl(idx,d) ) return d;
(_chk2 = inflate_impl(new NewChunk(this))).set_impl(idx,d);
return d;
}
/** Write a {@code float} with check-relative indexing. NaN will be treated
* as a missing value.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final float set(int idx, float f) {
setWrite();
if( _chk2.set_impl(idx,f) ) return f;
(_chk2 = inflate_impl(new NewChunk(this))).set_impl(idx,f);
return f;
}
/** Set a value as missing.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final boolean setNA(int idx) {
setWrite();
if( _chk2.setNA_impl(idx) ) return true;
(_chk2 = inflate_impl(new NewChunk(this))).setNA_impl(idx);
return true;
}
/** Write a {@code String} with check-relative indexing. {@code null} will
* be treated as a missing value.
*
*
As with all the {@code set} calls, if the value written does not fit
* in the current compression scheme, the Chunk will be inflated into a
* NewChunk and the value written there. Later, the NewChunk will be
* compressed (after a {@link #close} call) and written back to the DKV.
* i.e., there is some interesting cost if Chunk compression-types need to
* change.
* @return the set value
*/
public final String set(int idx, String str) {
setWrite();
if( _chk2.set_impl(idx,str) ) return str;
(_chk2 = inflate_impl(new NewChunk(this))).set_impl(idx,str);
return str;
}
/** After writing we must call close() to register the bulk changes. If a
* NewChunk was needed, it will be compressed into some other kind of Chunk.
* The resulting Chunk (either a modified self, or a compressed NewChunk)
* will be written to the DKV. Only after that {@code DKV.put} completes
* will all readers of this Chunk witness the changes.
* @return the passed-in {@link Futures}, for flow-coding.
*/
public Futures close( int cidx, Futures fs ) {
if( this instanceof NewChunk ) _chk2 = this;
if( _chk2 == null ) return fs; // No change?
if( _chk2 instanceof NewChunk ) _chk2 = ((NewChunk)_chk2).new_close();
DKV.put(_vec.chunkKey(cidx),_chk2,fs,true); // Write updated chunk back into K/V
return fs;
}
/** @return Chunk index */
public int cidx() {
assert _cidx != -1 : "Chunk idx was not properly loaded!";
return _cidx;
}
/** Chunk-specific readers. Not a public API */
abstract double atd_impl(int idx);
abstract long at8_impl(int idx);
abstract boolean isNA_impl(int idx);
long at16l_impl(int idx) { throw new IllegalArgumentException("Not a UUID"); }
long at16h_impl(int idx) { throw new IllegalArgumentException("Not a UUID"); }
BufferedString atStr_impl(BufferedString bStr, int idx) { throw new IllegalArgumentException("Not a String"); }
/** Chunk-specific writer. Returns false if the value does not fit in the
* current compression scheme. */
abstract boolean set_impl (int idx, long l );
abstract boolean set_impl (int idx, double d );
abstract boolean set_impl (int idx, float f );
abstract boolean setNA_impl(int idx);
boolean set_impl (int idx, String str) { throw new IllegalArgumentException("Not a String"); }
//Zero sparse methods:
/** Sparse Chunks have a significant number of zeros, and support for
* skipping over large runs of zeros in a row.
* @return true if this Chunk is sparse. */
public boolean isSparseZero() {return false;}
/** Sparse Chunks have a significant number of zeros, and support for
* skipping over large runs of zeros in a row.
* @return At least as large as the count of non-zeros, but may be significantly smaller than the {@link #_len} */
public int sparseLenZero() {return _len;}
public int nextNZ(int rid){ return rid + 1;}
/**
* Get indeces of non-zero values stored in this chunk
* @return array of chunk-relative indices of values stored in this chunk. */
public int nonzeros(int [] res) {
int k = 0;
for( int i = 0; i < _len; ++i)
if(atd(i) != 0)
res[k++] = i;
return k;
}
//NA sparse methods:
/** Sparse Chunks have a significant number of NAs, and support for
* skipping over large runs of NAs in a row.
* @return true if this Chunk is sparseNA. */
public boolean isSparseNA() {return false;}
/** Sparse Chunks have a significant number of NAs, and support for
* skipping over large runs of NAs in a row.
* @return At least as large as the count of non-NAs, but may be significantly smaller than the {@link #_len} */
public int sparseLenNA() {return _len;}
// Next non-NA. Analogous to nextNZ()
public int nextNNA(int rid){ return rid + 1;}
/** Get chunk-relative indices of values (nonnas for nasparse, all for dense)
* stored in this chunk. For dense chunks, this will contain indices of all
* the rows in this chunk.
* @return array of chunk-relative indices of values stored in this chunk. */
public int nonnas(int [] res) {
for( int i = 0; i < _len; ++i) res[i] = i;
return _len;
}
/** Report the Chunk min-value (excluding NAs), or NaN if unknown. Actual
* min can be higher than reported. Used to short-cut RollupStats for
* constant and boolean chunks. */
double min() { return Double.NaN; }
/** Report the Chunk max-value (excluding NAs), or NaN if unknown. Actual
* max can be lower than reported. Used to short-cut RollupStats for
* constant and boolean chunks. */
double max() { return Double.NaN; }
public NewChunk inflate(){
return inflate_impl(new NewChunk(this));
}
/** Chunk-specific bulk inflater back to NewChunk. Used when writing into a
* chunk and written value is out-of-range for an update-in-place operation.
* Bulk copy from the compressed form into the nc._ls8 array. */
public abstract NewChunk inflate_impl(NewChunk nc);
/** Return the next Chunk, or null if at end. Mostly useful for parsers or
* optimized stencil calculations that want to "roll off the end" of a
* Chunk, but in a highly optimized way. */
public Chunk nextChunk( ) { return _vec.nextChunk(this); }
/** @return String version of a Chunk, currently just the class name */
@Override public String toString() { return getClass().getSimpleName(); }
/** In memory size in bytes of the compressed Chunk plus embedded array. */
public long byteSize() {
long s= _mem == null ? 0 : _mem.length;
s += (2+5)*8 + 12; // 2 hdr words, 5 other words, @8bytes each, plus mem array hdr
if( _chk2 != null ) s += _chk2.byteSize();
return s;
}
/** Custom serializers implemented by Chunk subclasses: the _mem field
* contains ALL the fields already. */
public final AutoBuffer write_impl(AutoBuffer bb) {return bb.putA1(_mem);}
@Override
public final byte [] asBytes(){return _mem;}
@Override
public final Chunk reloadFromBytes(byte [] ary){
_mem = ary;
initFromBytes();
return this;
}
protected abstract void initFromBytes();
public final Chunk read_impl(AutoBuffer ab){
_mem = ab.getA1();
initFromBytes();
return this;
}
// /** Custom deserializers, implemented by Chunk subclasses: the _mem field
// * contains ALL the fields already. Init _start to -1, so we know we have
// * not filled in other fields. Leave _vec and _chk2 null, leave _len
// * unknown. */
// abstract public Chunk read_impl( AutoBuffer ab );
// -----------------
// Support for fixed-width format printing
// private String pformat () { return pformat0(); }
// private int pformat__len { return pformat_len0(); }
/** Fixed-width format printing support. Filled in by the subclasses. */
public byte precision() { return -1; } // Digits after the decimal, or -1 for "all"
// protected String pformat0() {
// long min = (long)_vec.min();
// if( min < 0 ) return "% "+pformat_len0()+"d";
// return "%"+pformat_len0()+"d";
// }
// protected int pformat_len0() {
// int len=0;
// long min = (long)_vec.min();
// if( min < 0 ) len++;
// long max = Math.max(Math.abs(min),Math.abs((long)_vec.max()));
// throw H2O.unimpl();
// //for( int i=1; i