docs.javahelp.manual.common_tasks.defining_discrete_variables.html Maven / Gradle / Ivy
Graph
Defining Discrete Variables
Discrete variables in Tetrad may be described as follows.
- They are assumed to be nominal--that is, the order of categories doesn't matter for searches and estimations.
- When trying to decide whether two variables by same name are equal, their categories are idenfied by name.
- When sending data to algorithms, categories are identified by index only.
Some comments. For point (1), it is clearly a simplification to assume that all discrete variables are nominal, and
it clearly in some cases leads to a loss of information, since if you knew the categories for some variable carried
ordinal information you might be able to use tests of conditional independence that took advantage of this
information. For reasons of speed and flexibility, we've stayed with the nominal independence tests.
For point (2), the problem is that a variable "X" can be defined in two different boxes--say, two different
Bayes Parametric Model boxes or a Bayes Parametric Model box and and a Data box. It's possible that the two
variables have the same number of categories (in fact, when doing estimations, this is desirable) but that in the
one case the categories are <High, Medium, Low> while in the other case the categories are <Low, Medium,
High>. In this case, the mapping of categories should be High-->High, Medium-->Medium, Low-->Low and not
High-->Low, Medium-->Medium, Low-->High. That is, the categories should be identified by name.
However, as regards point (3), it is extremely inefficient, especially in Java, to force algorithms over discrete
variables to deal with names of categories; algorithms need to deal with indices of categories. So when
sending a column of data with variable X, with categories <High, Medium, Low> to an estimator, the estimator
only knows that there are three categories for X, at indices 0, 1, and 2, respectively. It doesn't know about the
names of the categories.
Points (2) and (3) are reconciled in Tetrad using a "bulletin board" system. The first time a list of
categories is encountered, it is posted on a "bulletin board." After that, if that same list of categories
is encountered again, but in some permuted order, the version from the "bulletin board" is retrieved and
used instead. So any particular list of categories can only appear in one order in Tetrad. (This does not imply that
the variables are ordinal; algorithms still interpret these variables as nominal, in that they employ statisitical
tests that do not take advantage of ordinality.)
You can see the effects of "bulletin board" system, for example, in the following situations:
- If you've specified a Bayes Parametric Model and then read data in from a file for the same variables, the order
of the categories for the data will be the same as order of categories in your Bayes PM. Estimations, taking
Bayes PM's and discrete data sets as parents will work smoothly.
- If you create a Bayes PM with variable X, with categories <Low, High>, and later create another Bayes PM
with variable X with categories <High, Low>, the order of categories in the second case will be adjusted
to <Low, High>.