www.flow.packs.examples.Rulefit_Example.flow Maven / Gradle / Ivy

Go to download
{
  "version": "1.0.0",
  "cells": [
    {
      "type": "md",
      "input": "## Rulefit Tutorial\n\nThis tutorial describes how to create a Rulefit model using H2O Flow.\n\nThose who have never used H2O before should refer to Using Flow - H2O's Web UI for additional instructions on how to run H2O Flow.\n\n### H2O Rulefit algorithm\n\nRulefit algorithm combines tree ensembles and linear models to take advantage of both methods: a tree ensemble accuracy and a linear model interpretability. The general algorithm fits a tree ensebmle to the data, builds a rule ensemble by traversing each tree, evaluates the rules on the data to build a rule feature set and fits a sparse linear model (LASSO) to the rule feature set joined with the original feature set.\n\nFor more information, refer to: http://statweb.stanford.edu/~jhf/ftp/RuleFit.pdf by Jerome H. Friedman and Bogden E. Popescu.\n\nWe will train a rulefit model to predict the rules defining whether or not someone will survive:\n\n#### Importing Data\nBefore creating a model, import data into H2O:\n\n0. Click the **Assist Me!** button (the last button in the row of buttons below the menus).\n  ![Assist Me](https://raw.githubusercontent.com/h2oai/h2o-3/master/h2o-docs/src/product/flow/images/Flow_AssistMeButton.png) \n0. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. For this example, the dataset is available at https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv. \n0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. "
    },
    {
      "type": "cs",
      "input": "assist"
    },
    {
      "type": "cs",
      "input": "importFiles [ \"https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/ad.data.gz\" ]"
    },
    {
      "type": "md",
      "input": "### Parsing Data\nNow, parse the imported data: \n\n0. Click the **Parse these files...** button. \n\n  **Note**: The default options typically do not need to be changed unless the data does not parse correctly. \n0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). \n0. If the data uses a separator, select it from the drop-down **Separator** list. \n0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. To have H2O automatically determine if the first row of the dataset contains column names or data, select the **Auto** radio button. \n0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. \n0. To delete the imported dataset after parsing, check the **Delete on done** checkbox. \n\n  **NOTE**: In general, we recommend enabling this option. Retaining data requires memory resources, but does not aid in modeling because unparsed data cannot be used by H2O.\n0. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button.  \n\n  **NOTE**: Make sure the parse is complete by confirming progress is 100% before continuing to the next step, model building. For small datasets, this should only take a few seconds, but larger datasets take longer to parse.\n"
    },
    {
      "type": "cs",
      "input": "setupParse paths: [ \"https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv\" ]"
    },
    {
      "type": "cs",
      "input": "parseFiles\n  paths: [\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv\"]\n  destination_frame: \"titanic.hex\"\n  parse_type: \"CSV\"\n  separator: 44\n  number_columns: 14\n  single_quotes: false\n  column_names: [\"pclass\",\"survived\",\"name\",\"sex\",\"age\",\"sibsp\",\"parch\",\"ticket\",\"fare\",\"cabin\",\"embarked\",\"boat\",\"body\",\"home.dest\"]\n  column_types: [\"Enum\",\"Enum\",\"String\",\"Enum\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Enum\",\"Enum\",\"Numeric\",\"Numeric\",\"Enum\"]\n  delete_on_done: true\n  check_header: 1\n  chunk_size: 4194304"
    },
    {
      "type": "md",
      "input": "### Building a Model\n\n0. Once data are parsed, click the **View** button, then click the **Build Model** button. \n0. Select `Rulefit` from the drop-down **Select an algorithm** menu, then click the **Build model** button.  \n0. If the parsed titanic.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. \n0. In the **seed** drop-down list, specify `1234` for this example. \n0. From the **response_column** drop-down list, select `survived`. \n0. In the **ignored_columns** field, select `name`, `ticket`, `fare`, `cabin`,`embarked`,`boat`,`body`,`home.dest`. \n0. In the **algorithm** field, set whether algorithm will use DRF or GBM to fit a tree enseble. For this example, enter `DRF`. \n0. In the **min_rule_length** and **max_rule_length** fields, set interval of tree enseble depths to be fitted. The bigger this interval is, the more tree ensembles will be fitted (1 per each depth) and the bigger the rule feature set will be. For this exampe, enter `1`,`10`.\n0. In the **max_num_rules** field, the maximum number of rules to return can be set. For this exampe, enter `100`.\n0. In the **model_type** field, the type of base learners in the enseble can be set. For this exampe, enter `RULES`.\n0. In the **rule_generation_ntrees** field, the number of trees for tree enseble can be set. For this exampe, enter `50`.\n0. Click the **Build Model** button. "
    },
    {
      "type": "cs",
      "input": "assist buildModel, null, training_frame: \"titanic.hex\""
    },
    {
      "type": "cs",
      "input": "buildModel 'rulefit', {\"model_id\":\"rulefit-6ed1698c-5e92-4874-a215-65045dfb5798\",\"training_frame\":\"titanic.hex\",\"seed\":1234,\"response_column\":\"survived\",\"ignored_columns\":[\"name\",\"ticket\",\"cabin\",\"embarked\",\"boat\",\"body\",\"home.dest\"],\"algorithm\":\"DRF\",\"min_rule_length\":1,\"max_rule_length\":10,\"max_num_rules\":100,\"model_type\":\"RULES\",\"rule_generation_ntrees\":50,\"distribution\":\"AUTO\"}"
    },
    {
      "type": "md",
      "input": "### Rulefit Output\n\nTo view the output, click the **View** button. The output for the Rulefit model includes:\n\n- model parameters\n- rule importences in tabular form\n- training and validation metrics of the underlying linear model\n\nIn the output - rule importance dropdown there us a table with several rules that can be recapped as:\n\n#### Higgest Likelihood of Survival:\n1. women in class 1 or 2 with 3 siblings/spouses aboard or less\n2. women in class 1 or 2\n\n#### Lowest Likelihood of Survival:\n1. male in class 2 or 3 of age >= 9.5\n2. male of age >= 13.4\n3. male with no parents/children aboard, fare < 26$ and of age >= 15\n4. male of age >= 9\n\nNote: The rules are additive. That means that if a passenger is described by multiple rules, their probability is added together from those rules."
    },
    {
      "type": "md",
      "input": "### Rulefit Output\n\nTo view the output, click the **View** button. The output for the Rulefit model includes:\n\n- model parameters\n- rule importences in tabular form\n- training and validation metrics of the underlying linear model\n\nIn the output - rule importance dropdown there us a table with several rules that can be recapped as:\n\n#### Higgest Likelihood of Survival:\n1. women in class 1 or 2 with 3 siblings/spouses aboard or less\n2. women in class 1 or 2\n\n#### Lowest Likelihood of Survival:\n1. man in class 2 or 3 of age >= 9.5\n2. man of age >= 13.4\n3. man with no parents/children aboard, fare < 26$ and of age >= 15\n4. man of age >= 9\n\nNote: The rules are additive. That means that if a passenger is described by multiple rules, their probability is added together from those rules."
    },
    {
      "type": "cs",
      "input": "grid inspect \"output - Rule Importance\", getModel \"rulefit-6ed1698c-5e92-4874-a215-65045dfb5798\""
    },
    {
      "type": "md",
      "input": "### Rulefit Predict\n\nTo generate a prediction, click the Predict button in the Model cell and select the test file from the drop-down Frame list, then click the Predict button below the drop-down Frame list."
    },
    {
      "type": "cs",
      "input": "predict model: \"rulefit-6ed1698c-5e92-4874-a215-65045dfb5798\""
    }
  ]
}