All Downloads are FREE. Search and download functionalities are using the official Maven repository.

stopwords.stop_words.txt Maven / Gradle / Ivy

There is a newer version: 0.8.2
Show newest version
#
#  "Commons Clause" License, https://commonsclause.com/
#
#  The Software is provided to you by the Licensor under the License,
#  as defined below, subject to the following condition.
#
#  Without limiting other conditions in the License, the grant of rights
#  under the License will not include, and the License does not grant to
#  you, the right to Sell the Software.
#
#  For purposes of the foregoing, "Sell" means practicing any or all of
#  the rights granted to you under the License to provide to third parties,
#  for a fee or other consideration (including without limitation fees for
#  hosting or consulting/support services related to the Software), a
#  product or service whose value derives, entirely or substantially, from
#  the functionality of the Software. Any license notice or attribution
#  required by the License must also include this Commons Clause License
#  Condition notice.
#
#  Software:    NLPCraft
#  License:     Apache 2.0, https://www.apache.org/licenses/LICENSE-2.0
#  Licensor:    Copyright (C) 2018 DataLingvo, Inc. https://www.datalingvo.com
#
#      _   ____      ______           ______
#     / | / / /___  / ____/________ _/ __/ /_
#    /  |/ / / __ \/ /   / ___/ __ `/ /_/ __/
#   / /|  / / /_/ / /___/ /  / /_/ / __/ /_
#  /_/ |_/_/ .___/\____/_/   \__,_/_/  \__/
#         /_/
#

# Basic predefined stop-words.
#
# Configuration contains:
# - Words (processed as stem)
# - Words with POSes list (processed as lemma)
# - Words with wildcard, symbol `*` (processed as lemma)
#
# Words and POSes can me marked as excluded (symbol `~` before word)
# Word can be marked as case sensitive (symbol `@` before word)
#
# Restrictions:
# - POSes list cannot be defined for multiple words.
# - Only one wildcard can be defined in the word.
# - Wildcard cannot be applied to chunks of words.
# - Only one case sensitive flag can be defined in the word.
#
# Examples:
# ========
# decent                - Includes word 'decent'.
# *ent                  - Includes all words ending with 'ent'.
# *ent | NN             - Includes all words with POS NN ending with 'ent'.
# *ent | ~NN ~JJ        - Includes all words beside POSes NN and JJ ending with 'ent'.
# ~dif*ly | JJ JJR JJS  - Excludes all JJ/JJR/JJS words starting with 'diff' and ending with 'ly'.
# ~may | MD             - Excludes 'may' MD.
# * | MD                - All words with MD POS.
# ~@US                  - US is not stop word (exception).
#
# Invalid syntax examples:
# ========================
# te*ni*                    - Too many wildcards
# tech* pers*               - Too many wildcards.
# @Technical @Personal      - Too many case sensitive flags.
# @Technical Personal | JJ  - POSes cannot be defined for chunks of words.
#

# POSes list.
* | UH
* | ,
* | POS
* | :
* | .
* | --
* | MD
* | EX
* | DT

# POSES list exceptions.
~may

# Postfixes list.
*ent | ~NN ~NNS ~NNP ~NNPS
*ant | ~NN ~NNS ~NNP ~NNPS
*ive | ~NN ~NNS ~NNP ~NNPS ~CD
*ly | ~NN ~NNS ~NNP ~NNPS
*ry | ~NN ~NNS ~NNP ~NNPS
*ial | ~NN ~NNS ~NNP ~NNPS
*able | ~NN ~NNS ~NNP ~NNPS
*able | ~NN ~NNS ~NNP ~NNPS
*ible | ~NN ~NNS ~NNP ~NNPS
*less | ~NN ~NNS ~NNP ~NNPS

# Postfixes list exceptions.
~less
~monthly
~daily
~weekly
~quarterly
~yearly
~badly
~poorly
~different

# Words of concrete POSes.
key | JJ JJR JJS
vital | JJ JJR JJS
critical | JJ JJR JJS
pressing | JJ JJR JJS
paramount | JJ JJR JJS
high-priority | JJ JJR JJS
must-have | JJ JJR JJS

# Words of any POSes.
a
an
avg
average
the
etc
fair
approximate
decent
generous
good
ok
okay
so
please
well
objective
reasonable
unbiased
sincere
trustworthy
civil
candid
honest
impartial
legitimate
straightforward
moderate
subjective
partial
rough
fuzzy
now
all right
let
website
web-site
web site
hey
datalingvo
datalingo
data lingvo
data lingo
dl
lol
lulz
omg
omfg
of the essence
gr8
lmao
wtf
xoxo
j/k
jk
fyi
imho
imo
btw
fwiw
thx
wth
afaik
abt
afaic
aka
a.k.a.
awol
b2b
b2c
byod
ciao
cmon
eta
huh
nsfw
otoh
plz
pls
rotfl
tgif
zzzz
zzz

# GEO abbreviations exceptions.
# Cities.
~la
~sf
~kc
~hk

# States.
~al
~ak
~az
~ar
~ca
~co
~ct
~de
~fl
~ga
~hi
~id
~il
~in
~ia
~ks
~ky
~la
~me
~md
~ma
~mi
~mn
~ms
~mo
~mt
~ne
~nv
~nh
~nj
~nm
~ny
~nc
~nd
~oh
~ok
~or
~pa
~ri
~sc
~sd
~tn
~tx
~ut
~vt
~va
~wa
~wv
~wi
~wy

# Upper case exceptions.
~@US




© 2015 - 2025 Weber Informatics LLC | Privacy Policy