co.cask.cdap.examples.wordcount.package-info Maven / Gradle / Ivy

Go to download
/*
 * Copyright © 2014 Cask Data, Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
/**
 * This package contains the WordCount sample Application that counts words and tracks the associations between words.
 * This is a slightly modified version of the classic WordCount example. 
 * 
 * This WordCount Application consists of:
 * 
 * 1. A Stream named wordStream that receives strings of words to be counted.
 * 
 * 2. A Flow named WordCounter processes the strings from the Stream and calculates the word counts
 *    and other word statistics using four Flowlets:
 *    - The splitter splits the input string into words and aggregates and persists global statistics;
 *    - The counter takes words as inputs and calculates and persists per-word statistics;
 *    - The unique Flowlet calculates the unique number of words seen;
 *    - The associator stores word associations between all of the words in each input string.
 *
 * 3. A Service named ``RetrieveCounts`` that serves read requests for calculated statistics,
 *    word counts and associations. It exposes these endpoints:
 *    - ``/stats`` returns the total number of words, the number of unique words, and the average word length;
 *    - ``/count/{word}`` returns the word count of a specified word and its word associations,
 *      up to the specified limit or a pre-set limit of ten if not specified;
 *    - ``/assoc/{word1}/{word2}`` returns the top associated words (those with the highest counts).
 *
 * 4. Four Datasets used by the Flow and Service to model, store, and serve the data:
 *    - A core Table named wordStats to track global word statistics;
 *    - A system KeyValueTable Dataset named wordCounts counts the occurrences of each word;
 *    - A custom UniqueCountTable Dataset named uniqueCount determines and counts the number of unique words seen;
 *    - A custom AssociationTable Dataset named wordAssocs tracks associations between words.
 */
package co.cask.cdap.examples.wordcount;