All Downloads are FREE. Search and download functionalities are using the official Maven repository.

docs.javahelp.manual.boxes.data.data_loader.html Maven / Gradle / Ivy

There is a newer version: 7.6.6
Show newest version






Data Loader


The data loader is launched from the File menu of a Data box. From this File menu, select Load Data...; you will be asked to select a data file, and once you do, the data loader dialog will be displayed.

The data loader is capable of loading tabular and covariance data files, with variations of each. A tabular data file stores a rectangular mixed continuous/discrete dataset, with optional variable definitions and knowledge specification. A covariance data file stores a lower-triangular covariance matrix, with optional knowledge specification.

(a)

(b)

Figure 1. Flat mixed data file.

It is perhaps easiest to see how data loading works for tabular data files by looking at some examples. Figure 1(a) shows a simple mixed data file; in this file, X and Y are continuous, Z is ambiguously discrete, and W is unambiguously discrete. To load this simple file, we use the following parameter settings:

  • File type = Tabular Data
  • Delimiter = Whitespace
  • Variable names in first row of data? Yes.
  • Case ID's provided? No.
  • Comment marker = //
  • Quote character = "
  • Missing value marker = *
  • Integer columns with up to 10 values are discrete.

Note that some of these parameters do not even apply to this file; we just accept the defaults. The delimiter is the character pattern used to separate tokens on a single line; in this case, any whitespace is sufficient. In this file, variable names do appear in the first row of the data; they are X, Y, Z, and W. It is not necessary to include variable names, but if they are not included, this checkbox should unchecked. Case IDs are identifiers for each row in the data set; in this file they are not included. A comment line is a line beginning with the comment marker. This file contains no comment lines. A quote character is either ' or "; and in pairs surrounds quoted bits of text; this file contains no quoted bits of text. A missing value marker stands in for missing values in the data set. This file contains one missing value, which is indicated using an asterisk (*); that option is selected. Finally, integer columns, like W, can be interpreted either as continuous or as discrete. The final parameter specifies a cutoff for the number of distinct values in a column if it is to be considered discrete;in this case, the cutoff is set to 10. So when W is read in, it will be considered discrete.

Once the data loading parameters are correct, the data is loaded by pressing Load;, at which point focus shifts from the File tab to the Loading Log tab. Information about the loading process is displayed in the Loading Log tab, along with information about read twoCycleErrors.

Figure 1(b) shows how the data loading dialog looks if multiple data sets are selected. Note that in this case, to the left of the Load button, new buttons are added, Previous and Next, to navigate through the data sets, together with an indication (1 / 10) as to which data set is currently in focus. To load a series of data sets, click Load, then Next, then Load, then Next, and so on, until all of the data sets have been loaded. As each data set is loaded, an asterisk appears next to its progress indicator--thus, *1 / 10. One can tell that all data sets have been loaded by clicking back and forth through the data sets using the Previous and Next buttons and making sure all progress indicators have an asterisk beside them. When finished loading all data sets, click "Save."

Figure 2 shows an example of a tabular data file with optional variable definitions and knowledge specification. Comments are included in Figure 2, using the indicated comment marker, which must appear at the beginning of each comment line.

Figure 2: A more complex tabular data example.

It is important to understand why a variable might be defined in a /variables section and why knowledge might be given in a /knowledge section, as these are optional sections of the file. Note that if a /variables section is provided, the data must be prefaced by a line containing only data, to indicate where the data itself begins.

The reason to define a variable in the /variables section is that the definition guessed at by the data loader is incorrect. One may want to specify explicitly that an integral variable is continuous for instance, if one needs to have a variable V1 with a small number of values be discrete but another variable V2 with a larger number of values be continuous. Likewise, one may wish to gain some fine control over the precise definition of a discrete variable, adding in categories for that do not appear in the relevant data column. An example of a variable being specified as continuous is W in Figure 2; and example of a variable being specified as discrete in a particular way is Z in Figure 2.

The reason for including knowledge (of forbidden and required edges) in a data file is that several operations may need to be performed on the data set using common knowledge, so it is a matter of convenience to associate the knowledge with the data file itself, rather than redefining or copying the knowledge each time it is used. Knowledge is specified in a /knowledge section, which may optionally contain any of the following three sections: addtemporal, forbiddirect, and requiredirect. The addtemporal section specifies temporal tiers. a list of rows is specified, each beginning with the number of a tier and followed by a list of variables for that tier. Edges from nodes in higher-numbered tiers to nodes in lower-numbered tiers will be forbidden. The forbiddirect section consists of a list of rows with two variable each in them. The edge from the first variable to the second variable is explicitly forbidden. The requiredirect section consists of a list of rows with two variable each in them. The edge from the first variable to the second variable is explicitly required.

An example of a covariance data file being loaded is given in Figure 3. Note the form of the covariance file. It begins with an optional line containing only "/covariance". This is followed by the sample size, in a single line. The next line contains the list of variable names. Then, the lower triangle of the covariance matrix is specified. Finally, and optionally, knowledge may be specified.

Figure 3. An example covariance file.





© 2015 - 2025 Weber Informatics LLC | Privacy Policy