smile.data.package-info Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of smile-base Show documentation

smile-base

There is a newer version: 4.2.0

/* * Copyright (c) 2010-2021 Haifeng Li. All rights reserved. * * Smile is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * Smile is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with Smile. If not, see . */ /** * Data and attribute encapsulation classes. A data is a set of datum objects, * which are usually defined by attribute-value pairs. The datum object could * be very sparse and thus is stored in a list to save space. A datum object * may have an associated class label (for classification) or real-valued * response value (for regression). Optionally, a datum object or attribute * may have a (positive) weight value, whose meaning depends on applications. * However, most machine learning methods are not able to utilize this extra * weight information. There are, generally speaking, two major types of attributes: * * Qualitative variables: * The data values are non-numeric categories. Examples: Blood type, Gender. * Quantitative variables: * The data values are counts or numerical measurements. A quantitative * variable can be either discrete such as the number of students receiving * an 'A' in a class, or continuous such as GPA, salary and so on. * * Another way of classifying data is by the measurement scales. In statistics, * there are four generally used measurement scales: * * Nominal data: * data values are non-numeric group labels. For example, Gender variable * can be defined as male = 0 and female =1. * Ordinal data: * data values are categorical and may be ranked in some numerically * meaningful way. For example, strongly disagree to strong agree may be * defined as 1 to 5. * Continuous data: * * Interval data: * data values are ranged in a real interval, which can be as large as * from negative infinity to positive infinity. The difference between two * values are meaningful, however, the ratio of two interval data is not * meaningful. For example temperature, IQ. * * Ratio data: * both difference and ratio of two values are meaningful. For example, * salary, weight. * * * * Besides the semantics of data, it is also very important to pay attention * to the memory efficiency of data. In the Java memory model, all fields in * an object are either a primitive data type, such as byte or int, or a * reference or pointer to an object. Arrays of primitive data types, such as * char[], are also objects. One disadvantage of this model is that, when you * follow object-oriented design practices, data types are often composed of * many different types in order to encapsulate both state and behavior. As a * result, one data type can represent an entire tree of objects, which has * the high cost of overhead and locality. * * The cost of having many objects is that each object in a JVM must have some * metadata that is associated with it. For example, the java.lang.Class value * that represents the type of that object, or the length of an array object. * The most common approach is to place this metadata at the start of the * object, creating an object header. * * For a large or complex object, the size of the header is relatively * insignificant. For a small object, however, the size of the header can * become significant. For byte[1], 64 bits of metadata are often required * for a single 8-bit value. Additionally, the JVM is likely to add at least * 3 bytes of padding to ensure that the subsequent object in the heap * starts on an aligned address. The total extra memory requirement for * 8 bits of data is therefore 88 bits. Every object has a similar * associated overhead, so the more objects you have, the greater * the effect on system resources. * * The structure of Java arrays can exaggerate this overhead. Consider * an array of Complex objects. Each instance of the Complex class has * two double values, of 64 bits each, plus the object header. Assuming * that the header is just the class reference, and occupies only 32 bits, * each Point instance is 8 bytes of data and 4 bytes of extra overhead. * An array of 10 Complex objects consists of the header (class + length * = 8 bytes), plus 10 object references (assuming 4 bytes each = 40 bytes). * If each element of the array contains a unique Complex object, the total * is 160 bytes of data, but 88 bytes of additional overhead. * * The data locality of a tree of objects also has huge impact to compute * efficiency. Modern hardware relies heavily on caching and prefetching * to provide efficient access. Caching exploits the observation that * memory that was recently accessed is likely to be accessed again soon, * so keeping the most recently accessed data in very fast memory usually * results in the best performance. Data is cached in small blocks, which * are known as cache lines, to exploit another observation: data that is * stored in sequence is often accessed in sequence. Code that accesses * array[i] often proceeds to access array[i+1]. *

* When a data structure is composed of many different objects, an operation * on the information might need to access several objects to locate the * actual data. However, a tree of related objects cannot be guaranteed * to be close enough in memory to appear in the same block of cached memory. * Some JVM configurations attempt to keep related objects close to * each other in memory, but this result is not always possible. * Even when the JVM can place objects next to each other, the space * that is required by the object header lies between the objects, possibly * disrupting the benefit. * * @author Haifeng Li */ package smile.data;