All Downloads are FREE. Search and download functionalities are using the official Maven repository.

smile.data.package-info Maven / Gradle / Ivy

There is a newer version: 4.2.0
Show newest version
/*
 * Copyright (c) 2010-2021 Haifeng Li. All rights reserved.
 *
 * Smile is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * Smile is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with Smile.  If not, see .
 */

/**
 * Data and attribute encapsulation classes. A data is a set of datum objects,
 * which are usually defined by attribute-value pairs. The datum object could
 * be very sparse and thus is stored in a list to save space. A datum object
 * may have an associated class label (for classification) or real-valued
 * response value (for regression). Optionally, a datum object or attribute
 * may have a (positive) weight value, whose meaning depends on applications.
 * However, most machine learning methods are not able to utilize this extra
 * weight information. There are, generally speaking, two major types of attributes:
 * 
*
Qualitative variables:
*
The data values are non-numeric categories. Examples: Blood type, Gender.
*
Quantitative variables:
*
The data values are counts or numerical measurements. A quantitative * variable can be either discrete such as the number of students receiving * an 'A' in a class, or continuous such as GPA, salary and so on.
*
* Another way of classifying data is by the measurement scales. In statistics, * there are four generally used measurement scales: *
*
Nominal data:
*
data values are non-numeric group labels. For example, Gender variable * can be defined as male = 0 and female =1.
*
Ordinal data:
*
data values are categorical and may be ranked in some numerically * meaningful way. For example, strongly disagree to strong agree may be * defined as 1 to 5.
*
Continuous data:
*
* Interval data: * data values are ranged in a real interval, which can be as large as * from negative infinity to positive infinity. The difference between two * values are meaningful, however, the ratio of two interval data is not * meaningful. For example temperature, IQ. *
* Ratio data: * both difference and ratio of two values are meaningful. For example, * salary, weight. *
*
* * Besides the semantics of data, it is also very important to pay attention * to the memory efficiency of data. In the Java memory model, all fields in * an object are either a primitive data type, such as byte or int, or a * reference or pointer to an object. Arrays of primitive data types, such as * char[], are also objects. One disadvantage of this model is that, when you * follow object-oriented design practices, data types are often composed of * many different types in order to encapsulate both state and behavior. As a * result, one data type can represent an entire tree of objects, which has * the high cost of overhead and locality. *

* The cost of having many objects is that each object in a JVM must have some * metadata that is associated with it. For example, the java.lang.Class value * that represents the type of that object, or the length of an array object. * The most common approach is to place this metadata at the start of the * object, creating an object header. *

* For a large or complex object, the size of the header is relatively * insignificant. For a small object, however, the size of the header can * become significant. For byte[1], 64 bits of metadata are often required * for a single 8-bit value. Additionally, the JVM is likely to add at least * 3 bytes of padding to ensure that the subsequent object in the heap * starts on an aligned address. The total extra memory requirement for * 8 bits of data is therefore 88 bits. Every object has a similar * associated overhead, so the more objects you have, the greater * the effect on system resources. *

* The structure of Java arrays can exaggerate this overhead. Consider * an array of Complex objects. Each instance of the Complex class has * two double values, of 64 bits each, plus the object header. Assuming * that the header is just the class reference, and occupies only 32 bits, * each Point instance is 8 bytes of data and 4 bytes of extra overhead. * An array of 10 Complex objects consists of the header (class + length * = 8 bytes), plus 10 object references (assuming 4 bytes each = 40 bytes). * If each element of the array contains a unique Complex object, the total * is 160 bytes of data, but 88 bytes of additional overhead. *

* The data locality of a tree of objects also has huge impact to compute * efficiency. Modern hardware relies heavily on caching and prefetching * to provide efficient access. Caching exploits the observation that * memory that was recently accessed is likely to be accessed again soon, * so keeping the most recently accessed data in very fast memory usually * results in the best performance. Data is cached in small blocks, which * are known as cache lines, to exploit another observation: data that is * stored in sequence is often accessed in sequence. Code that accesses * array[i] often proceeds to access array[i+1]. *

* When a data structure is composed of many different objects, an operation * on the information might need to access several objects to locate the * actual data. However, a tree of related objects cannot be guaranteed * to be close enough in memory to appear in the same block of cached memory. * Some JVM configurations attempt to keep related objects close to * each other in memory, but this result is not always possible. * Even when the JVM can place objects next to each other, the space * that is required by the object header lies between the objects, possibly * disrupting the benefit. * * @author Haifeng Li */ package smile.data;





© 2015 - 2025 Weber Informatics LLC | Privacy Policy