docs.javahelp.manual.common_tasks.handling_missing_data.html Maven / Gradle / Ivy

Go to download



    
    Graph




    
        
            Handling Missing Data 
        
    


Missing data is represented in both the tabular and covariance data editors by an asterisk ("*"). There are
    three ways to create missing data values in a data set:

    In the Data Editor, replace the relevant entries with asterisks, by hand.
    When loading data, load data that contains missing values.
    From the Tools menu in the Data Editor, select "Inject Missing Values Randomly."

The first way explicitly declares that a particular datum is missing. The second way reflects missing data as
    represented in a data file. The third way automatically adds missing data at random in a data file, usually as a
    test of how an algorithm performs under conditions of missing data.
Usually it is a good idea to remove or impute missing data before handing it over to an algorithm. With tabular data,
    one has the option of removing cases containing missing values from a data file, by selecting Tools-->Missing
    Values-->Remove Cases with Missing Values. Selecting instead Tools-->MissingValues-->Replace Missing Values
    with Column Mode will replace missing values with the mode of the column. This will work for either continuous or
    discrete data. Similarly for Tools-->MissingValues-->Replace Missing Values with Column Mean, although this
    operates only on continuous columns. Also, Tools->Missing values-->Replace Missing Values with Extra Category
    works on discrete variables only and addes an extra category to each indicating where the missing values were.
Tools-->Missing values-->Replace Missing Values with Regression Predictions imputes missing values using a
    regression model, as follows. First, in a copy of the data set, missing values are imputed using means. Then for
    each missing value, the column of that missing value is regressed onto the remaining columns in the data set, and
    the predicted value for the case of the missing value using values from the data set copy replaces the missing
    value. 
If an algorithm is handed data with missing values and does not have a means to deal with it sensibly, a message is
    posted asking the use to remove or impute the missing data first. Some algorithm configurations do have sensible
    means to deal with missing data, in which case the data is used. For instance, any algorithm taking a Chi Square or
    G Square independence test as oracle can sensibly ignore rows of variables being compared that contain missing
    values. For fine control over handling of missing data, it is best, however, to impute missing data first.
Algorithms that can handle missing data include:

    Any search algorithm using Chi Square or G Square as an conditional independence oracle.
    ML Bayes Updater
    Dirichlet Bayes updater
    EM Bayes Updater

Any other algorithm that takes data as an input will display an error message if the data contains missing
    values.