All Downloads are FREE. Search and download functionalities are using the official Maven repository.

docs.javahelp.manual.common_tasks.defining_discrete_variables.html Maven / Gradle / Ivy

There is a newer version: 7.6.6
Show newest version



    
    Graph



Defining Discrete Variables

Discrete variables in Tetrad may be described as follows.

  1. They are assumed to be nominal--that is, the order of categories doesn't matter for searches and estimations.
  2. When trying to decide whether two variables by same name are equal, their categories are idenfied by name.
  3. When sending data to algorithms, categories are identified by index only.

Some comments. For point (1), it is clearly a simplification to assume that all discrete variables are nominal, and it clearly in some cases leads to a loss of information, since if you knew the categories for some variable carried ordinal information you might be able to use tests of conditional independence that took advantage of this information. For reasons of speed and flexibility, we've stayed with the nominal independence tests.

For point (2), the problem is that a variable "X" can be defined in two different boxes--say, two different Bayes Parametric Model boxes or a Bayes Parametric Model box and and a Data box. It's possible that the two variables have the same number of categories (in fact, when doing estimations, this is desirable) but that in the one case the categories are <High, Medium, Low> while in the other case the categories are <Low, Medium, High>. In this case, the mapping of categories should be High-->High, Medium-->Medium, Low-->Low and not High-->Low, Medium-->Medium, Low-->High. That is, the categories should be identified by name.

However, as regards point (3), it is extremely inefficient, especially in Java, to force algorithms over discrete variables to deal with names of categories; algorithms need to deal with indices of categories. So when sending a column of data with variable X, with categories <High, Medium, Low> to an estimator, the estimator only knows that there are three categories for X, at indices 0, 1, and 2, respectively. It doesn't know about the names of the categories.

Points (2) and (3) are reconciled in Tetrad using a "bulletin board" system. The first time a list of categories is encountered, it is posted on a "bulletin board." After that, if that same list of categories is encountered again, but in some permuted order, the version from the "bulletin board" is retrieved and used instead. So any particular list of categories can only appear in one order in Tetrad. (This does not imply that the variables are ordinal; algorithms still interpret these variables as nominal, in that they employ statisitical tests that do not take advantage of ordinality.)

You can see the effects of "bulletin board" system, for example, in the following situations:

  • If you've specified a Bayes Parametric Model and then read data in from a file for the same variables, the order of the categories for the data will be the same as order of categories in your Bayes PM. Estimations, taking Bayes PM's and discrete data sets as parents will work smoothly.
  • If you create a Bayes PM with variable X, with categories <Low, High>, and later create another Bayes PM with variable X with categories <High, Low>, the order of categories in the second case will be adjusted to <Low, High>.

 





© 2015 - 2025 Weber Informatics LLC | Privacy Policy