All Downloads are FREE. Search and download functionalities are using the official Maven repository.

www.flow.packs.examples.K-Means_Example.flow Maven / Gradle / Ivy

{
  "version": "1.0.0",
  "cells": [
    {
      "type": "md",
      "input": "# K Means Tutorial\n\nThis tutorial describes how to perform a K-Means analysis. By the end of this tutorial the user should know how to specify, run, and interpret a K-means model in H2O using Flow.\n\nThose who have never used H2O before should refer to Using Flow - H2O's Web UI for additional instructions on how to run H2O Flow.\n\nIn the latest version of H2O, the K-means algorithm has a \"k-modes\" function that allows you to use mixed categorical and real-valued data. By using dissimilarity measures to handle categoricals, replacing cluster means with cluster modes, and using a frequency-based method to update modes in the clustering process to minimize the clustering costs, the k-modes algorithm is scalable in both the number of clusters and the number of records. The k-modes method is used anytime categorical data is present. \n\nFor more information, refer to \"A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining\" and \"Extensions to the k-Means Algorithm for Clustering Large Data Sets with Catgorical Values\" by Zhexue Huang.\n\n### Getting Started\n\nThis tutorial uses a publicly available data set that can be found at http://archive.ics.uci.edu/ml/datasets/seeds.\n\nThe data are composed of 210 observations, 7 attributes, and an a priori grouping assignment. All data are positively valued and continuous. \n\nIf you don't have any data of your own to work with, you can find some example datasets at https://archive.ics.uci.edu/ml/index.php.\n\n#### Importing Data\nBefore creating a model, import data into H2O:\n\n0. Click the **Assist Me!** button (the last button in the row of buttons below the menus).\n  ![Assist Me](https://raw.githubusercontent.com/h2oai/h2o-3/master/h2o-docs/src/product/flow/images/Flow_AssistMeButton.png)  \n0. Click the **importFiles** link and enter the file path to the dataset in the **Search** entry field. \n0. Click the **Add all** link to add the file to the import queue, then click the **Import** button. \n"
    },
    {
      "type": "cs",
      "input": "assist"
    },
    {
      "type": "cs",
      "input": "importFiles"
    },
    {
      "type": "cs",
      "input": "importFiles [ \"http://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/seeds_dataset.txt\" ]"
    },
    {
      "type": "md",
      "input": "### Parsing Data\nNow, parse the imported data: \n\n0. Click the **Parse these files...** button. \n**Note**: The default options typically do not need to be changed unless the data does not parse correctly. \n0. From the drop-down **Parser** list, select the file type of the data set (Auto, XLS, XLSX, CSV, or SVMLight). \n0. If the data uses a separator, select it from the drop-down **Separator** list. \n0. If the data uses a column header as the first row, select the **First row contains column names** radio button. If the first row contains data, select the **First row contains data** radio button. You can also select the **Auto** radio button to have H2O automatically determine if the first row of the dataset contains the column names or data. \n0. If the data uses apostrophes ( `'` - also known as single quotes), check the **Enable single quotes as a field quotation character** checkbox. \n0. To delete the imported dataset after the parse is complete, check the **Delete on done** checkbox. \n\n   **NOTE**: In general, we recommend enabling this option. Retaining data requires memory resources, but does not aid in modeling because unparsed data cannot be used by H2O.\n\n0. Review the data in the **Edit Column Names and Types** section, then click the **Parse** button.  \n\n  **NOTE**: Make sure the parse is complete by confirming progress is 100% before continuing to the next step, model building. For small datasets, this should only take a few seconds, but larger datasets take longer to parse.\n"
    },
    {
      "type": "cs",
      "input": "setupParse paths: [ \"http://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/seeds_dataset.txt\" ]"
    },
    {
      "type": "cs",
      "input": "parseFiles\n  paths: [\"http://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/seeds_dataset.txt\"]\n  destination_frame: \"seeds_dataset.hex\"\n  parse_type: \"CSV\"\n  separator: 9\n  number_columns: 8\n  single_quotes: false\n  column_types: [\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\",\"Numeric\"]\n  delete_on_done: true\n  check_header: -1\n  chunk_size: 4194304"
    },
    {
      "type": "md",
      "input": "### Building a Model\n\n0. Once data are parsed, click the **View** button, then click the **Build Model** button. \n0. Select `K-means` from the drop-down **Select an algorithm** menu, then click the **Build model** button. \n0. If the parsed arrhythmia.hex file is not already listed in the **Training_frame** drop-down list, select it. Otherwise, continue to the next step. \n0. From the **Ignored_columns** section, select the columns to ignore in the *Available* area to move them to the *Selected* area. For this example, select column 8 (the a priori known clusters for this dataset). \n0. In the **K** field, specify the number of clusters. For this example, enter `3`.  \n0. In the **Max_iterations** field, specify the maximum number of iterations. For this example, enter `100`. \n0. From the drop-down **Init** menu, select the initialization mode. For this example, select **PlusPlus**. \n   - Random initialization randomly samples the `k`-specified value of the rows of the training data as cluster centers. \n   - PlusPlus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen. \n   - Furthest initialization chooses one initial center at random and then chooses the next center to be the point furthest away in terms of Euclidean distance. \n   - User initialization requires the corresponding **User_points** parameter. To define a specific initialization point, select the imported dataset .hex file from the drop-down **User_points** list, then select **User** from the drop-down **Init** list.\n   \n     **Note**: The user-specified points dataset must have the same number of columns as the training dataset.  \n\n0. Uncheck the **Standardize** checkbox to disable column standardization. \n0. Click the **Build Model** button. \n"
    },
    {
      "type": "cs",
      "input": "getFrameSummary \"seeds_dataset.hex\""
    },
    {
      "type": "cs",
      "input": "assist buildModel, null, training_frame: \"seeds_dataset.hex\""
    },
    {
      "type": "cs",
      "input": "buildModel 'kmeans', {\"model_id\":\"kmeans-d1cc0a90-3a54-4eac-8b06-1d66f1f57741\",\"training_frame\":\"seeds_dataset.hex\",\"ignored_columns\":[\"C8\"],\"k\":\"3\",\"max_iterations\":\"100\",\"init\":\"PlusPlus\",\"standardize\":false,\"seed\":1425597869002366000}"
    },
    {
      "type": "md",
      "input": "### K-Means Output\n\nOutput is a matrix of the cluster assignments, and the\ncoordinates of the cluster centers in terms of the originally\nchosen attributes. Your cluster centers may differ slightly.\nK-Means randomly chooses starting points and converges on\noptimal centroids. The cluster number is arbitrary, and should\nbe thought of as a factor.\n\nBy default, the following output displays:\n\n- Model parameters (hidden)\n- A graph of the scoring history (number of iterations vs. average within the cluster's sum of squares) \n- Output (model category, validation metrics if applicable, and centers std)\n- Model Summary (number of clusters, number of categorical columns, number of iterations, avg. within sum of squares, avg. sum of squares, avg. between the sum of squares)\n- Scoring history (number of iterations, avg. change of standardized centroids, avg. within cluster sum of squares)\n- Training metrics (model name, checksum name, frame name, frame checksum name, description if applicable, model category, duration in ms, scoring time, predictions, MSE, avg. within sum of squares, avg. between sum of squares)\n- Centroid statistics (centroid number, size, within sum of squares)\n- Cluster means (centroid number, column)\n- Preview POJO"
    },
    {
      "type": "cs",
      "input": "getModel \"kmeans-d1cc0a90-3a54-4eac-8b06-1d66f1f57741\""
    },
    {
      "type": "md",
      "input": "### Making Predictions\n\nTo make a prediction based on the model, click the **Predict** button in the **Model** cell. Select a data frame to use for the prediction, then click the **Predict** button. "
    },
    {
      "type": "cs",
      "input": "predict model: \"kmeans-d1cc0a90-3a54-4eac-8b06-1d66f1f57741\""
    },
    {
      "type": "cs",
      "input": "predict model: \"kmeans-d1cc0a90-3a54-4eac-8b06-1d66f1f57741\", frame: \"seeds_dataset.hex\", predictions_frame: \"prediction-2e8d0460-8bc0-4dc1-9cad-df0a3a9d2eb1\""
    }
  ]
}




© 2015 - 2024 Weber Informatics LLC | Privacy Policy