resources.javahelp.manual.boxes.data.data_loader.html Maven / Gradle / Ivy
Data Loader
The data loader is launched from the File menu of a Data box. From this
File menu, select Load Data...; you will be asked to select a data
file, and once you do, the data loader dialog will be displayed.
The data loader is capable of loading tabular
and covariance data files, with variations of each.
A tabular data file stores a rectangular mixed
continuous/discrete dataset, with optional variable definitions and
knowledge specification. A covariance data file
stores a lower-triangular covariance matrix, with optional knowledge
specification.
(a)
(b)
Figure 1. Flat mixed data
file.
It is perhaps easiest to see how data loading works for tabular data
files by looking at some examples. Figure 1(a) shows a simple mixed
data file; in this file, X and Y are continuous, Z is ambiguously discrete,
and W is unambiguously discrete. To load this simple file, we use the
following parameter settings:
- File type = Tabular Data
- Delimiter = Whitespace
- Variable names in first row of data? Yes.
- Case ID's provided? No.
- Comment marker = //
- Quote character = "
- Missing value marker = *
- Integer columns with up to 10 values
are discrete.
Note that some of these
parameters do not even apply to this file; we just accept the defaults.
The delimiter is the character pattern used to separate tokens on a
single line; in this case, any whitespace is sufficient. In this file,
variable names do appear in the first row of the data; they are X, Y,
Z, and W. It is not necessary to include variable names, but if they
are not included, this checkbox should unchecked. Case IDs are
identifiers for each row in the data set; in this file they are not
included. A comment line is a line beginning with the comment
marker. This file contains no comment lines. A quote
character is either ' or "; and in pairs surrounds quoted bits of
text; this file contains no quoted bits of text. A missing
value marker stands in for missing values in the data set.
This file contains one missing value, which is indicated using an
asterisk (*); that option is selected. Finally, integer columns, like
W, can be interpreted either as continuous or as discrete. The final
parameter specifies a cutoff for the number of distinct values in a
column if it is to be considered discrete;in this case, the cutoff is
set to 10. So when W is read in, it will be considered discrete.
Once the data loading
parameters are correct, the data is loaded by pressing Load;, at
which point focus shifts from the File tab to the Loading
Log tab. Information about the loading process is displayed in the
Loading Log tab, along with information about read twoCycleErrors.
Figure 1(b) shows how the data loading dialog looks if
multiple data sets are selected. Note that in this case, to the left
of the Load button, new buttons are added, Previous and Next, to
navigate through the data sets, together with an indication (1 / 10)
as to which data set is currently in focus. To load a series of data
sets, click Load, then Next, then Load, then Next, and so on, until
all of the data sets have been loaded. As each data set is loaded,
an asterisk appears next to its progress indicator--thus, *1 / 10.
One can tell that all data sets have been loaded by clicking back
and forth through the data sets using the Previous and Next buttons
and making sure all progress indicators have an asterisk beside them.
When finished loading all data sets, click "Save."
Figure 2 shows an example
of a tabular data file with optional variable definitions and knowledge
specification. Comments are included in Figure 2, using the indicated
comment marker, which must appear at the beginning of each comment line.
Figure
2: A more complex tabular data example.
It is important to
understand why a variable might be defined in a /variables section and
why knowledge might be given in a /knowledge section, as these are
optional sections of the file. Note that if a /variables section is
provided, the data must be prefaced by a line containing only data,
to indicate where the data itself begins.
The reason to define a
variable in the /variables section is that the definition guessed
at by the data loader is incorrect. One may want to specify
explicitly that an integral variable is continuous for instance, if
one needs to have a variable V1 with a small number of values be
discrete but another variable V2 with a larger number of values be
continuous. Likewise, one may wish to gain some fine control over the
precise definition of a discrete variable, adding in categories for
that do not appear in the relevant data column. An example of a
variable being specified as continuous is W in Figure 2; and example of
a variable being specified as discrete in a particular way is Z in
Figure 2.
The reason for including
knowledge (of forbidden and required edges) in a data file is that
several operations may need to be performed on the data set using
common knowledge, so it is a matter of convenience to associate the
knowledge with the data file itself, rather than redefining or copying
the knowledge each time it is used. Knowledge is specified in a
/knowledge section, which may optionally contain any of the following
three sections: addtemporal, forbiddirect, and
requiredirect. The addtemporal section specifies temporal
tiers. a list of rows is specified, each beginning with the number of a
tier and followed by a list of variables for that tier. Edges from
nodes in higher-numbered tiers to nodes in lower-numbered tiers will be
forbidden. The forbiddirect section consists of a list of rows
with two variable each in them. The edge from the first variable to the
second variable is explicitly forbidden. The requiredirect
section consists of a list of rows with two variable each in them. The
edge from the first variable to the second variable is explicitly
required.
An example of a covariance
data file being loaded is given in Figure 3. Note the form of the
covariance file. It begins with an optional line containing only
"/covariance". This is followed by the sample size, in a single
line. The next line contains the list of variable names. Then, the
lower triangle of the covariance matrix is specified. Finally, and
optionally, knowledge may be specified.
Figure
3. An example covariance file.