com.intel.analytics.bigdl.example.textclassification.README.md Maven / Gradle / Ivy

## Summary
 This example use a (pre-trained GloVe embedding) to convert word to vector,
 and uses it to train the text classification model on a 20 Newsgroup dataset
 with 20 different categories. This model can achieve around 90% accuracy after 2 epochs training.
(It was first described in: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)
## Data
* Embedding: 100-dimensional pre-trained GloVe embeddings of 400k words which trained on a 2014 dump of English Wikipedia.
* Training data: "20 Newsgroup dataset" which containing 20 categories and with totally 19997 texts.

## Steps to run this example:
1.  Download [Pre-train GloVe word embeddings](http://nlp.stanford.edu/data/glove.6B.zip)

    ```shell
    wget http://nlp.stanford.edu/data/glove.6B.zip
    unzip -q glove.6B.zip -d glove.6B
    ```

2.  Download [20 Newsgroup dataset](http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html) as the training data

    ```shell
    wget http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
    tar zxf 20news-18828.tar.gz
    ```

3.  Put those data under BASE_DIR, and the final structure would look like this:

    ```
    [~/textclassification]$ tree . -L 1
    .
    ├── 20news-18828
    └── glove.6B
    ```

4.  Run the commands:
    * Spark local:
      * Execute:

        ```shell
        BASE_DIR=${PWD} # where is the data
        spark-submit --master "local[physical_core_number]" --driver-memory 20g                     \
                   --class com.intel.analytics.bigdl.example.textclassification.TextClassifier \
                   bigdl-VERSION-jar-with-dependencies.jar --batchSize 128              \
                   --baseDir ${BASE_DIR} --partitionNum 4
        ```

    * Spark cluster:
      * Standalone execute:

        ```shell
        MASTER=xxx.xxx.xxx.xxx:xxxx
        BASE_DIR=${PWD} # where is the data
        spark-submit --master ${MASTER} --driver-memory 20g --executor-memory 20g  \
                   --total-executor-cores 32 --executor-cores 8                                \
                   --class com.intel.analytics.bigdl.example.textclassification.TextClassifier \
                   bigdl-VERSION-jar-with-dependencies.jar --batchSize 128              \
                   --baseDir ${BASE_DIR} --partitionNum 32
        ```
        * Yarn client execute:
        
                ```shell
                BASE_DIR=${PWD} # where is the data
                spark-submit --master yarn --driver-memory 20g --executor-memory 20g  \
                           --num-executor 4 --executor-cores 8                                \
                           --class com.intel.analytics.bigdl.example.textclassification.TextClassifier \
                           bigdl-VERSION-jar-with-dependencies.jar --batchSize 128              \
                           --baseDir ${BASE_DIR} --partitionNum 32
                ```

      * NOTE: The total batch is: 128 and the batch per node is 128/nodeNum

4. Verify:
   * Search accuracy from log:
   ``` 
   [Epoch 1 0/15964][Iteration 1][Wall Clock 0.0s]
   
   top1 accuracy is Accuracy(correct: 14749, count: 15964, accuracy: 0.9238912
      553244801)