All Downloads are FREE. Search and download functionalities are using the official Maven repository.

www.flow.packs.examples.GBM_Example.flow Maven / Gradle / Ivy

{
  "version": "1.0.0",
  "cells": [
    {
      "type": "md",
      "input": "# GBM Tutorial\n\nThe purpose of this tutorial is to walk new users through a GBM analysis in H2O Flow. \n\nThose who have never used H2O before should refer to Using Flow - H2O's Web UI for additional instructions on how to run H2O Flow.\n\n## Getting Started\n\nThis tutorial uses a publicly available data set that can be found at:\nhttp://archive.ics.uci.edu/ml/datasets/Arrhythmia.\n\n\nThe original data are the Arrhythmia data set made available by UCI\nMachine Learning repository. They are composed of\n452 observations and 279 attributes.\n\nIf you don't have any data of your own to work with, you can find some example datasets at https://archive.ics.uci.edu/ml/index.php.\n\n### Importing Data\nBefore creating a model, import data into H2O:\n\n0. Click the **Assist Me!** button (the last button in the row of buttons below the menus).\n  ![Assist Me](https://raw.githubusercontent.com/h2oai/h2o-3/master/h2o-docs/src/product/flow/images/Flow_AssistMeButton.png) \n0. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. For this example, the file path is http://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/arrhythmia.csv.gz. \n0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. "
    },
    {
      "type": "cs",
      "input": "assist"
    },
    {
      "type": "cs",
      "input": "importFiles [\"http://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/arrhythmia.csv.gz\"]"
    },
    {
      "type": "md",
      "input": "### Parsing Data\nNow, parse the imported data: \n\n0. Click the **Parse these files...** button. \n  \n   **Note**: The default options typically do not need to be changed unless the data does not parse correctly. \n0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, CSV, or SVMLight). \n0. If the data uses a separator, select it from the drop-down **Separator** list. \n0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. \n0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. \n0. To delete the imported dataset after the parse is complete, check the **Delete on done** checkbox. \n  \n   **NOTE**: In general, we recommend enabling this option. Retaining data requires memory resources, but does not aid in modeling because unparsed data cannot be used by H2O.\n0. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button.  \n\n  **NOTE**: Make sure the parse is complete by clicking the **View Job** button and confirming progress is 100% before continuing to the next step, model building. For small datasets, this should only take a few seconds, but larger datasets take longer to parse."
    },
    {
      "type": "cs",
      "input": "setupParse paths: [\"http://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/arrhythmia.csv.gz\"]"
    },
    {
      "type": "cs",
      "input": "parseFiles\n  paths: [\"http://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/arrhythmia.csv.gz\"]\n  destination_frame: \"arrhythmia.hex\"\n  parse_type: \"CSV\"\n  separator: 44\n  number_columns: 280\n  single_quotes: false\n  column_names: null\n  column_types: [\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\"]\n  delete_on_done: true\n  check_header: -1\n  chunk_size: 4194304"
    },
    {
      "type": "md",
      "input": "### Building a Model\n\n0. Once data are parsed, click the **View** button, then click the **Build Model** button. \n0. Select `Gradient Boosting Machine` from the drop-down **Select an algorithm** menu, then click the **Build model** button. \n0. If the parsed arrhythmia.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. \n0. From the **Ignored_columns** section, select the columns to ignore in the *Available* area to move them to the *Selected* area. For this example, do not select any columns. \n0. From the drop-down **Response** list, select column 1 (`C1`).  \n0. In the **Ntrees** field, specify the number of trees to build  (for this example, `20`). \n0. In the **Max_depth** field, specify the maximum number of edges between the top node and the furthest node as a stopping criteria (for this example, use the default value of `5`). \n0. In the **Min_rows** field, specify the minimum number of observations (rows) to include in any terminal node as a stopping criteria (for this example, `25`). \n0. In the **Nbins** field, specify the number of bins to use for data splitting (for this example, use the default value of `20`). The split points are evaluated at the boundaries at each of these bins. As the value of **Nbins** increases, the algorithm approximates more closely the evaluation of each individual observation as a split point. The cost of this refinement is an increase in computational time.  \n0. In the **Learn_rate** field, specify the tuning parameter (also known as shrinkage) to slow the convergence of the algorithm to a solution, which helps prevent overfitting. For this example, enter `0.3`. \n0. Click the **Build Model** button. "
    },
    {
      "type": "cs",
      "input": "assist buildModel, null, training_frame: \"arrhythmia.hex\""
    },
    {
      "type": "cs",
      "input": "buildModel 'gbm', {\"model_id\":\"gbm-51b9780b-70d0-40d0-9b5a-c723a3f358c1\",\"training_frame\":\"arrhythmia.hex\",\"score_each_iteration\":false,\"response_column\":\"C1\",\"ntrees\":\"20\",\"max_depth\":5,\"min_rows\":\"25\",\"nbins\":20,\"learn_rate\":\"0.3\",\"distribution\":\"AUTO\",\"balance_classes\":false,\"max_confusion_matrix_size\":20,\"class_sampling_factors\":[],\"max_after_balance_size\":5,\"seed\":0}"
    },
    {
      "type": "md",
      "input": "### Viewing GBM Results\n\nTo view the results, click the **View** button. The output for GBM includes the following: \n\n- Model parameters (hidden)\n- A graph of the scoring history (training MSE vs number of trees)\n- A graph of the variable importances\n- Output (model category, validation metrics, initf)\n- Model summary (number of trees, min. depth, max. depth, mean depth, min. leaves, max. leaves, mean leaves)\n- Scoring history in tabular format\n- Training metrics (model name, model checksum name, frame name, description, model category, duration in ms, scoring time, predictions, MSE, R2)\n- Variable importances in tabular format\n- Preview POJO\n\n\n"
    },
    {
      "type": "cs",
      "input": "getModel \"gbm-51b9780b-70d0-40d0-9b5a-c723a3f358c1\""
    },
    {
      "type": "md",
      "input": "### Viewing Predictions\n\nTo view predictions, click the **Predict** button. From the drop-down **Frame** list, select the arrhythmia.hex file and click the **Predict** button. "
    },
    {
      "type": "cs",
      "input": "predict model: \"gbm-51b9780b-70d0-40d0-9b5a-c723a3f358c1\""
    },
    {
      "type": "cs",
      "input": "predict model: \"gbm-51b9780b-70d0-40d0-9b5a-c723a3f358c1\", frame: \"arrhythmia.hex\", predictions_frame: \"prediction-9d6f23f3-45c2-4e1f-a48e-393b1b7de6db\""
    },
    {
      "type": "cs",
      "input": "getFrame \"prediction-9d6f23f3-45c2-4e1f-a48e-393b1b7de6db\""
    }
  ]
}




© 2015 - 2024 Weber Informatics LLC | Privacy Policy