com.johnsnowlabs.nlp.annotators.DocumentNormalizer.scala Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of spark-nlp-silicon_2.12 Show documentation
spark-nlp-silicon
There is a newer version: 5.5.0
Show newest version
/*
 * Copyright 2017-2022 John Snow Labs
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.johnsnowlabs.nlp.annotators

import com.johnsnowlabs.nlp.AnnotatorType.DOCUMENT
import com.johnsnowlabs.nlp.{Annotation, AnnotatorModel, AnnotatorType, HasSimpleAnnotate}
import org.apache.spark.ml.param.{BooleanParam, Param, StringArrayParam}
import org.apache.spark.ml.util.{DefaultParamsReadable, Identifiable}

import java.nio.charset.{Charset, StandardCharsets}
import scala.collection.mutable.ListBuffer
import scala.util.matching.Regex
import scala.util.matching.Regex.Match
import scala.util.{Failure, Success, Try}
import scala.xml.XML

/** Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents,
  * from document type columns into Sentence. Removes all dirty characters from text following one
  * or more input regex patterns. Can apply not wanted character removal with a specific policy.
  * Can apply lower case normalization.
  *
  * For extended examples of usage, see the
  * [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/document-normalizer/document_normalizer_notebook.ipynb Examples]].
  *
  * ==Example==
  * {{{
  * import spark.implicits._
  * import com.johnsnowlabs.nlp.DocumentAssembler
  * import com.johnsnowlabs.nlp.annotator.DocumentNormalizer
  * import org.apache.spark.ml.Pipeline
  *
  * val documentAssembler = new DocumentAssembler()
  *   .setInputCol("text")
  *   .setOutputCol("document")
  *
  * val cleanUpPatterns = Array("<[^>]*>")
  *
  * val documentNormalizer = new DocumentNormalizer()
  *   .setInputCols("document")
  *   .setOutputCol("normalizedDocument")
  *   .setAction("clean")
  *   .setPatterns(cleanUpPatterns)
  *   .setReplacement(" ")
  *   .setPolicy("pretty_all")
  *   .setLowercase(true)
  *
  * val pipeline = new Pipeline().setStages(Array(
  *   documentAssembler,
  *   documentNormalizer
  * ))
  *
  * val text =
  *   """
  * 
  *   THE WORLD'S LARGEST WEB DEVELOPER SITE
  *   THE WORLD'S LARGEST WEB DEVELOPER SITE
  *   Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..
  * 
  *
  *