www.flow.packs.examples.GLM_Example.flow Maven / Gradle / Ivy

Go to download
{
  "version": "1.0.0",
  "cells": [
    {
      "type": "md",
      "input": "# GLM Tutorial\n\nThe purpose of this tutorial is to walk new users through Generalized Linear Analysis (GLM) using H2O Flow.\n\nThose who have never used H2O before should refer to Using Flow - H2O's Web UI for additional instructions on how to run H2O Flow.\n\n**Note**: GLM in the current version of H2O may provide slightly different coefficient values when applying an L1 penalty in comparison with previous versions of H2O.\n\n### Using GLM\nUse GLM when the variable of interest relates to predictions or inferences about a rate, an event, or a continuous measurement or for questions about how a set of environmental conditions influence the dependent variable.\n\nHere are some examples:\n\n- \"What attributes determine which customers will purchase, and which will not?\"\n- \"Given a set of specific manufacturing conditions, how many units produced will fail?\"\n- \"How many customers will contact help support in a given time frame?\"\n\n### Getting Started\nThis tutorial uses a publicly available data set that can be found at the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/.\n\nThe original data are the Abalone data, available from UCI Machine Learning Repository. They are composed of 4177 observations on 9 attributes. All attributes are real valued, and continuous, except for Sex and Rings, found in columns 1 and 9 respectively.\nSex is categorical with 3 levels (male, female, and infant), and Rings is an integer valued count.\n\nIf you don't have any data of your own to work with, you can find some example datasets at https://archive.ics.uci.edu/ml/index.php.\n\nThe goal of this example is to use the Abalone dataset to predict the number of rings (C9), based on the other attributes:\n - sex (C1)\n - length (C2)\n - diameter (C3)\n - height (C4)\n - whole weight (C5)\n - shucked weight (C6)\n - viscera weight (C7)\n - shell weight (C8)\n\n#### Importing Data\nBefore creating a model, import data into H2O:\n\n0. Click the **Assist Me!** button (the last button in the row of buttons below the menus).\n  ![Assist Me](https://raw.githubusercontent.com/h2oai/h2o-3/master/h2o-docs/src/product/flow/images/Flow_AssistMeButton.png)\n0. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. For this example, the dataset is available at https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/abalone.csv.gz. \n0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. "
    },
    {
      "type": "cs",
      "input": "assist"
    },
    {
      "type": "cs",
      "input": "importFiles [ \"https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/abalone.csv.gz\" ]"
    },
    {
      "type": "md",
      "input": "### Parsing Data\nNow, parse the imported data: \n\n0. Click the **Parse these files...** button. \n\n  **Note**: The default options typically do not need to be changed unless the data does not parse correctly. \n\n0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). \n0. If the data uses a separator, select it from the drop-down **Separator** list. \n0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. To have H2O automatically determine if the first row of the dataset contains column names or data, select the **Auto** radio button. \n0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. \n0. To delete the imported dataset after parsing, check the **Delete on done** checkbox. \n\n  **NOTE**: In general, we recommend enabling this option. Retaining data requires memory resources, but does not aid in modeling because unparsed data cannot be used by H2O.\n\n0. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button.  \n\n  **NOTE**: Make sure the parse is complete by confirming progress is 100% before continuing to the next step, model building. For small datasets, this should only take a few seconds, but larger datasets take longer to parse.\n"
    },
    {
      "type": "cs",
      "input": "setupParse paths: [ \"https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/abalone.csv.gz\" ]"
    },
    {
      "type": "cs",
      "input": "parseFiles\n  paths: [\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/abalone.csv.gz\"]\n  destination_frame: \"abalone1.hex\"\n  parse_type: \"CSV\"\n  separator: 32\n  number_columns: 9\n  single_quotes: false\n  column_names: [\"C1\",\"C2\",\"C3\",\"C4\",\"C5\",\"C6\",\"C7\",\"C8\",\"C9\"]\n  column_types: [\"Enum\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\"]\n  delete_on_done: true\n  check_header: 1\n  chunk_size: 4194304"
    },
    {
      "type": "md",
      "input": "### Building a Model\n\n0. Once data are parsed, click the **View** button, then click the **Build Model** button. \n0. Select `Generalized Linear Model` from the drop-down **Select an algorithm** menu, then click the **Build model** button.  \n0. If the parsed Abalone .hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. \n0. To generate a scoring history, select `abalone1.hex` from the **Validation_frame** drop-down list. \n0. In the **Response** field, select the column associated with the number of rings variable (`C9`).\n0. From the drop-down **Family** menu, select `gaussian`. \n0. Enter `0.3` in the **Alpha** field. The alpha parameter is the mixing parameter for the L1 and L2 penalty.\n0. Enter `.002` in the **Lambda** field. \n0. Check the **Lambda_search** checkbox. \n0. Click the **Build Model** button."
    },
    {
      "type": "cs",
      "input": "getFrameSummary \"abalone1.hex\""
    },
    {
      "type": "cs",
      "input": "assist buildModel, null, training_frame: \"abalone1.hex\""
    },
    {
      "type": "cs",
      "input": "buildModel 'glm', {\"model_id\":\"glm-97176b71-c4dc-4d3a-bf36-53b5a4e4ab03\",\"training_frame\":\"abalone1.hex\",\"validation_frame\":\"abalone1.hex\",\"nfolds\":0,\"response_column\":\"C9\",\"ignored_columns\":[],\"ignore_const_cols\":true,\"family\":\"gaussian\",\"solver\":\"IRLSM\",\"alpha\":[0.3],\"lambda\":[0.002],\"lambda_search\":true,\"standardize\":true,\"non_negative\":false,\"score_each_iteration\":false,\"max_iterations\":-1,\"link\":\"family_default\",\"intercept\":true,\"objective_epsilon\":0.00001,\"beta_epsilon\":0.0001,\"gradient_epsilon\":0.0001,\"prior\":-1,\"max_active_predictors\":-1}"
    },
    {
      "type": "md",
      "input": "### GLM Results\n\nTo view the results, click the **View** button. The GLM output includes coefficients (as well as normalized coefficients when standardization is requested). The output also includes: \n\n- Model parameters (hidden)\n- (If standardization is enabled) A bar chart representing the standardized coefficient magnitudes (blue for negative, orange for positive)\n- A graph of the scoring history (objective vs. iteration) \n- Output (model category, validation metrics, and standardized coefficients magnitude)\n- GLM model summary (family, link, regularization, number of total predictors, number of active predictors, number of iterations, training frame)\n-  Scoring history in tabular form (timestamp, duration, iteration, log likelihood, objective)\n-  Training metrics (model, model checksum, frame, frame checksum, description, model category, scoring time, predictions, MSE, r2, residual deviance, null deviance, AIC, null degrees of freedom, residual degrees of freedom) \n-  Coefficients\n-  (If standardization is enabled) Standardized coefficient magnitudes\n-  Preview POJO\n\nIn the results below, the model shows that larger shucked weight (C6) values correspond to fewer rings (C9). \n\nTo view more details, click the **Inspect** button. \n\nTo view the coefficient values, click the **output - Coefficients** link. \n"
    },
    {
      "type": "cs",
      "input": "getModel \"glm-97176b71-c4dc-4d3a-bf36-53b5a4e4ab03\""
    },
    {
      "type": "cs",
      "input": "inspect getModel \"glm-97176b71-c4dc-4d3a-bf36-53b5a4e4ab03\""
    },
    {
      "type": "cs",
      "input": "grid inspect \"output - Coefficients\", getModel \"glm-97176b71-c4dc-4d3a-bf36-53b5a4e4ab03\""
    }
  ]
}