docs.javahelp.manual.common_tasks.defining_discrete_variables.html Maven / Gradle / Ivy

Go to download



    
    Graph




    
        
            Defining Discrete Variables 
        
    


Discrete variables in Tetrad may be described as follows. 

    They are assumed to be nominal--that is, the order of categories doesn't matter for searches and estimations.
    
    When trying to decide whether two variables by same name are equal, their categories are idenfied by name.
    When sending data to algorithms, categories are identified by index only.

Some comments. For point (1), it is clearly a simplification to assume that all discrete variables are nominal, and
    it clearly in some cases leads to a loss of information, since if you knew the categories for some variable carried
    ordinal information you might be able to use tests of conditional independence that took advantage of this
    information. For reasons of speed and flexibility, we've stayed with the nominal independence tests.
For point (2), the problem is that a variable "X" can be defined in two different boxes--say, two different
    Bayes Parametric Model boxes or a Bayes Parametric Model box and and a Data box. It's possible that the two
    variables have the same number of categories (in fact, when doing estimations, this is desirable) but that in the
    one case the categories are <High, Medium, Low> while in the other case the categories are <Low, Medium,
    High>. In this case, the mapping of categories should be High-->High, Medium-->Medium, Low-->Low and not
    High-->Low, Medium-->Medium, Low-->High. That is, the categories should be identified by name.
However, as regards point (3), it is extremely inefficient, especially in Java, to force algorithms over discrete
    variables to deal with names of categories; algorithms need to deal with indices of categories. So when
    sending a column of data with variable X, with categories <High, Medium, Low> to an estimator, the estimator
    only knows that there are three categories for X, at indices 0, 1, and 2, respectively. It doesn't know about the
    names of the categories.
Points (2) and (3) are reconciled in Tetrad using a "bulletin board" system. The first time a list of
    categories is encountered, it is posted on a "bulletin board." After that, if that same list of categories
    is encountered again, but in some permuted order, the version from the "bulletin board" is retrieved and
    used instead. So any particular list of categories can only appear in one order in Tetrad. (This does not imply that
    the variables are ordinal; algorithms still interpret these variables as nominal, in that they employ statisitical
    tests that do not take advantage of ordinality.)
You can see the effects of "bulletin board" system, for example, in the following situations:

    If you've specified a Bayes Parametric Model and then read data in from a file for the same variables, the order
        of the categories for the data will be the same as order of categories in your Bayes PM. Estimations, taking
        Bayes PM's and discrete data sets as parents will work smoothly.
    
    If you create a Bayes PM with variable X, with categories <Low, High>, and later create another Bayes PM
        with variable X with categories <High, Low>, the order of categories in the second case will be adjusted
        to <Low, High>.