All Downloads are FREE. Search and download functionalities are using the official Maven repository.

docs.javahelp.manual.common_tasks.handling_missing_data.html Maven / Gradle / Ivy

There is a newer version: 7.6.6
Show newest version



    
    Graph



Handling Missing Data

Missing data is represented in both the tabular and covariance data editors by an asterisk ("*"). There are three ways to create missing data values in a data set:

  • In the Data Editor, replace the relevant entries with asterisks, by hand.
  • When loading data, load data that contains missing values.
  • From the Tools menu in the Data Editor, select "Inject Missing Values Randomly."

The first way explicitly declares that a particular datum is missing. The second way reflects missing data as represented in a data file. The third way automatically adds missing data at random in a data file, usually as a test of how an algorithm performs under conditions of missing data.

Usually it is a good idea to remove or impute missing data before handing it over to an algorithm. With tabular data, one has the option of removing cases containing missing values from a data file, by selecting Tools-->Missing Values-->Remove Cases with Missing Values. Selecting instead Tools-->MissingValues-->Replace Missing Values with Column Mode will replace missing values with the mode of the column. This will work for either continuous or discrete data. Similarly for Tools-->MissingValues-->Replace Missing Values with Column Mean, although this operates only on continuous columns. Also, Tools->Missing values-->Replace Missing Values with Extra Category works on discrete variables only and addes an extra category to each indicating where the missing values were.

Tools-->Missing values-->Replace Missing Values with Regression Predictions imputes missing values using a regression model, as follows. First, in a copy of the data set, missing values are imputed using means. Then for each missing value, the column of that missing value is regressed onto the remaining columns in the data set, and the predicted value for the case of the missing value using values from the data set copy replaces the missing value.

If an algorithm is handed data with missing values and does not have a means to deal with it sensibly, a message is posted asking the use to remove or impute the missing data first. Some algorithm configurations do have sensible means to deal with missing data, in which case the data is used. For instance, any algorithm taking a Chi Square or G Square independence test as oracle can sensibly ignore rows of variables being compared that contain missing values. For fine control over handling of missing data, it is best, however, to impute missing data first.

Algorithms that can handle missing data include:

  • Any search algorithm using Chi Square or G Square as an conditional independence oracle.
  • ML Bayes Updater
  • Dirichlet Bayes updater
  • EM Bayes Updater

Any other algorithm that takes data as an input will display an error message if the data contains missing values.





© 2015 - 2025 Weber Informatics LLC | Privacy Policy