org.languagetool.resource.en.12dicts-readme.html Maven / Gradle / Ivy
The 12dicts Word Lists
Introduction
12dicts is a collection of English word lists. It differs in several important
ways from most of the other free word lists you can download.
- The 12dicts lists are oriented towards common words. If you're looking for
myriads of archaic, scientific or computer jargon words, you should look elsewhere.
- The 12dicts lists have been rigorously checked for errors. (This is not to
say that they are error-free, merely that enough care has been taken that errors
are rather infrequent.)
- 12dicts contains a variety of lists, of different sizes and characteristics.
One size does not fit all. Because each list has different characteristics, I do
not recommend combining them, except as noted below.
Originally, 12dicts was composed of lists derived from a specific set of 12 source
dictionaries. In addition to these "classic" lists, 12dicts now includes lists derived
from other sources. It would perhaps be appropriate to rename 12dicts to something
more generic, such as BAWL (Beale's Assorted Word Lists), but I have not done so in
order to preserve continuity.
A quick summary of the 12dicts lists and their characteristics is as follows:
3esl 6of12 2of12 2of4brif 5desk 2of12inf
Size 21877 32153 41236 60387 61406 81520
Abbreviations Y Y N N N N
Acronyms Y Y N N Y N
British English N N N Y N N
Hyphenations Y Y Y N N N
Inflections N N N Y N Y
Names Y Y N N Y N
Phrases Y Y N N N N
The remainder of this document is organized as follows:
-
This release
-
The classic 12dicts lists
-
The 3esl list
-
The 2of4brif list
-
The 5desk list
-
How 12dicts came to be
-
Conclusions
This release
This is release 4.0 of 12dicts, released Jan. 18, 2003.
It differs from previous versions by containing three additional lists
which are not derived from the "classic" 12dicts sources. Changes to
the classic lists are limited to error corrections.
The classic 12dicts lists
The 12dicts project began as the n-dicts projects, n being a variable whose
value finally stabilized as 12. The purpose of the project was to create a
list of words approximating the common core of the vocabulary of American
English.
The methodology of the project was to record and correlate the words
listed in a number of small dictionaries. The number of dictionaries
so recorded is now 12, comprising 8 ESL (English as a Second Language)
dictionaries and 4 "desk dictionaries". The dictionaries chosen
vary widely by publisher, by style, by completeness and by depth.
In this version of 12dicts, all of them are dictionaries of American
English (three from British publishers). The smallest of them contains
about 20,000 entries, and the largest 46,000. (All totaled, there are
about 75,000 entries, many of which appear in only a single dictionary.)
All but two of them were published in the last seven years.
The 6of12 and 2of12 lists
I initially tried two different ways of winnowing the 12dicts data to
produce lists of common words. Both produced interesting results.
One list, the 6of12 list, contains all words and phrases
listed in 6 of the 12 dictionaries. One way of describing this list
is that it contains those words and phrases which a (seeming) majority
of lexicographers believe are relevant to people learning English,
and/or to everyday usage. This list contains about 32,000 words and
phrases. The other list, the 2of12 list, is more inclusive in that it
includes words listed in as few as two of the source dictionaries, but
less inclusive in that it excludes items of various sorts, including
multiword phrases, proper names and abbreviations. This list contains
about 41,000 words. It is perhaps more suitable for use in areas
like spell checking or word games than the 6of12 list. (Honesty
compels me to admit that neither of these lists is, by itself, a good
choice for spell checking, due to the absence of inflections, proper
names, Roman numerals, etc.)
A third list, 2of12inf.txt, developed later, is of a rather different
character, and is discussed in a later section.
A more precise description of the criteria by which the above lists
were composed is as follows:
6of12 list word selection
-
The 6of12 list contains all non-excluded words and phrases which
appear in 6 or more of the source dictionaries.
-
Prefixes and suffixes are excluded. Abbreviations are included;
however, if they are entirely lower-case and alphabetic, they are
terminated with a colon (":") so they can be easily distinguished
from regular words.
-
Inflections of included words are not themselves included unless
they are separately defined or irregular.
-
It sometimes occurs that a word is listed in several forms (e.g.,
with and without hyphenation) in 6 or more dictionaries, even though
no single form is so listed. In this case, if one spelling is clearly
more accepted, this spelling and this spelling only is listed. If all
spellings seem equally accepted, one spelling has been selected
arbitrarily for inclusion.
- The 6of12 list contains a significant number of words which do not
meet either criterion 1 or 4 above. These words, sometimes called
"signature words", are discussed below. All of these words are
listed in at least one of the source dictionaries.
-
In addition to the ":" suffix discussed above, other special
suffix characters are used to mark words with certain characteristics,
as discussed below.
2of12 list word selection
-
The 2of12 list contains all non-excluded words which appear in at
least 2 of the source dictionaries.
-
This list excludes capitalized words, multiword phrases, and
abbreviations, as well as prefixes and suffixes. It does not
exclude hyphenated words or contractions. If a word occurs in
both a hyphenated and an unhyphenated form, the unhyphenated
form is listed, even if the hyphenated form is generally
preferred.
-
The list excludes spellings which are considered (by a majority
of the dictionaries listing it) to be non-American usage. It
also excludes secondary spellings which are mentioned by fewer
than four of the source dictionaries.
-
Inflections of included words are not themselves included unless
they are separately defined, or irregular.
-
Several of the source dictionaries include listings for obscure
currencies, such as ringgit, khoum and ngwee.
I was unable to regard such words as part of the English "core vocabulary",
and so I required citation in over a third of the dictionaries for
inclusion of monetary units. A side-effect was the elimination
of the word lepton, which, in addition to its use in particle
physics, is also .01 Greek drachmas.
-
This list also includes a small number of signature words, as
discussed below.
Signature words
As indicated, both lists have been augmented with words (and, in the
case of the 6of12 list, phrases) which fail to meet the formal
requirements for inclusion. In the case of the 6of12 list, 1024
words were added (about 3 % of the total). These are all words which,
in the judgment of the compiler, are as familiar as many of the words
which met the criteria for inclusion. Examples of some of the sorts
of words which were added are:
-
Words of the same category as other included words. An example is
the astrological sign Cancer, which alone of all the
astrological signs fails to appear in 6 or more of the dictionaries.
Similarly added were the omitted holidays Thanksgiving and
Christmas Eve.
-
Vulgarities, sexual terms and insults. Some such words were
already included, but most of the source dictionaries were quite
squeamish about them. These words are very widely known indeed;
I hold that any list of "common" words which does not include the
infamous f-word is simply discredited thereby. Some may feel that
it would have been better to leave some or all of these terms
unmentioned. Nevertheless, the expression of blasphemy,
unwarranted contempt and perverse lust, whether in words or in
deeds, is a very human trait. Suppressing the evidence of these
aspects of the human condition in our language makes no more sense
than excluding leprosy, gangrene and dementia,
no matter how unpleasant they may be to contemplate.
-
Conventional conversational phrases so common as to be practically
invisible to native speakers. Examples are thank you, good
night, uh-huh, of course and gesundheit.
-
Sports terminology, especially for football and baseball. (If I,
who am practically sports-blind, noticed this deficiency, it must
be of major proportions indeed.)
Note that the signature words in the 6of12 list can be identified via
the suffix character "+", and eliminated if desired.
A much smaller set of words (49) was added to the 2of12 list. These
were of two sorts:
-
Signature words from the 6of12 list which were not already present
in the 2of12 list, and which are not excluded due to being
abbreviations, phrases, etc.
-
Inflections of irregular verbs not explicitly mentioned in 2
source dictionaries, such as outfought and reheard.
Annotations
Some of the 6of12 list entries are annotated with a suffix character,
giving additional information about the associated word. The
annotations can be easily removed with an editor or script if
they are unwanted.
These annotations are:
: The word is an otherwise unmarked abbreviation. This suffix
may appear in combination with another suffix.
& The word is primarily a non-American usage.
# The word is generally held to be a variant or less preferred
form of another word.
< This form of a word is held to be the primary form by fewer
dictionaries than some other form of the word.
^ This form of the word was selected arbitrarily from a set of
variants, none of which was clearly preferred.
= Roughly, this indicates a "second class" word, as described
below.
+ The word is a signature word.
The reasons a word might be marked with the = annotation are:
-
The word is an inflection which was defined in the same
entry as the base word.
-
The word is a derived word (-ly, -ness or
-er/or) which was not defined in a separate entry.
-
The word appeared in a list of undefined words with a
common prefix, such as un- or re-.
The words in the 2of12 list are not annotated.
The 2of12inf list
The 2of12inf list is of a rather different character from the two
original "classic" lists. Conceptually,
it is simple. It consists of all the words in the 2of12 list, plus
their inflections, amounting to about 81,000 words. This list may
be more useful than the other lists for applications like word games.
It was created to help Kevin Atkinson in his Aspell and SCOWL projects
(for which, follow this link).
Unlike the 6of12 and
2of12 lists, this list is not based exclusively on the contents of my
12 source dictionaries, and for this reason it has, I feel, less
authority than the other classic 12dicts lists. It also probably has a
significantly higher error rate than the other lists, for reasons
explained below.
The criteria defining the 2of12inf list are as follows:
-
The 2of12inf list contains all non-excluded words which appear in
at least 2 of the source dictionaries.
-
This list excludes capitalized words, multiword phrases,
abbreviations, contractions, hyphenated words and single-letter
words, as well as prefixes and suffixes.
-
The list does not exclude secondary spellings, non-American usages
or monetary units.
-
The list includes inflections of all included words. Any
inflection mentioned or clearly implied by any of the source
dictionaries is included (i.e., two citations are not required).
Additionally, some inflections have been added from other sources.
-
Plurals of "uncountable" nouns were included, annotated with the
"%" suffix character. See below for an extended discussion of
the inclusion of these words.
-
Signature words from the other lists, plus their inflections, were
added. No other signature words were added.
Though the 2of12inf list still consists mostly of very common words,
criteria 3 through 5 above cause the 2of12inf list to contain a greater
proportion of unfamiliar and unusual words than the other classic
12dicts lists.
The 2of12inf list was not derived directly from the 12 source
dictionaries. The starting point was a subset of Kevin Atkinson's
AGID list, a list of words, parts of speech and inflections derived
from public-domain sources, notably Moby Words and WordNet. (See the
file agid.txt in the 12dicts archive, which is a copy of the AGID "readme",
for more information on the antecedents of AGID.) 2of12inf was created
by a process of editing the AGID subset to remove spurious entries and
those which reflected a more esoteric English vocabulary than the other
12dicts lists, and to add inflections which AGID failed to identify.
This process required significantly less effort than would have been
needed to derive the list directly from the source dictionaries.
Unfortunately, a side effect of the process is that the result is
likely to be somewhat less reliable than the other 12dicts lists.
In particular, Moby Words is notoriously unreliable, and I find it
unlikely that I have successfully identified all the spurious
inflections its use has introduced. It is my hope in the future to
release another edition of 2of12inf which is not derived from AGID,
and therefore not "infected" by Moby Words.
After the first version of the 2of12inf list was released, I replaced
one of the source dictionaries, officially an international dictionary
but in actuality rather British in its orientation, with a more
American dictionary by the same publisher. It was not practical
(nor necessarily desirable) for me to go through the list removing
inflections endorsed only by the superseded dictionary. For this
reason, the 2of12inf list has a slightly more international character
than the other 12dicts lists. It is not altogether clear that this
is a bad thing.
Selection of inflections
Ideally, the 2of12inf list would contain only inflections listed in
one of the 12dicts source dictionaries. This proved not to be
practical. The reason for this has to do with the nature of these
sources, which are mostly ESL dictionaries. An ESL dictionary might
well list the word esophagus, but, because an English learner is
unlikely to need to talk about this organ in the plural, it will
probably not bother to list the plural form esophagi. For words of
this sort, I therefore needed to obtain their inflections from other
sources. Obviously, the decisions on when to include additional
inflections were judgment calls, as were the choices of which
inflections to add.
Adjectival inflections (comparatives and superlatives) proved to be
an especially annoying problem. Only 2 of my 12 source dictionaries
provided remotely reliable information of this sort. In fact, such
information is sparse and inconsistent in most dictionaries of any
size. I relied on a small set of additional dictionaries for this
information, which was mostly disjoint from the sources for plurals
and verb forms. Several of these sources were Scrabble(r)-related,
and therefore inclined to include forms of little plausibility such
as iller/illest or fertiler/fertilest.
Accordingly, I ended up rejecting some of the documented inflections on
grounds of implausibility. I have no doubt that, in the process, I made
a number of errors of both inclusion and exclusion and, in any case, many
of the forms listed have no connection with any of the 12dicts source
dictionaries.
One additional problem in the creation of the 2of12inf list was that
of "uncountable" nouns and their plurals. Some English dictionaries,
especially ESL dictionaries, as well as other linguistic sources
attest to the existence of nouns which cannot be counted, or used in
the plural. Examples of such nouns include mud, rayon, oregano,
chess, fairness, wisdom, aluminum, training, materialism
and chickenpox. This is an entirely commonsense notion, but a
difficulty is the fact that the boundary between the countable and the
uncountable is extremely vague and ill-defined. For example, the word
coffee is ordinarily uncountable, but not when ordering in a
restaurant, as is the word symmetry, except in physics or math.
In general, it is possible to contrive a context where use of the
plural of any noun whatsoever is reasonable.
An alternate position, therefore, is that in fact no nouns are
uncountable, and that any noun which is not already plural possesses
a plural. This position is especially useful in the context of word
games, where words such as zeals and anthraxes
may produce large scores. For this reason, the official Scrabble
dictionaries list words such as thens, onces and
mankinds, which most people find
rather implausible. The fact that the 2of12inf list might well be
useful in gaming contexts, together with the fact that the boundary
between countable and uncountable nouns is so ill-defined, served as
a powerful argument for inclusion of all plural forms, whether
commonly used or not, while its derivation from ESL sources argued
for including only the plurals of countable nouns, however
distinguished.
In the end, I was unable to resolve this dilemma, and adopted a
compromise. The 2of12inf list includes all plurals, but with the
plurals of uncountable nouns marked, making it easy to remove them
if they are not wanted. That left the issue of how to establish
countability. Six of my source dictionaries included information
on countability, which was adequate to decide the status of most of
the included nouns. As for the rest, as usual, I used my best
judgment. I will confess to occasionally overriding the source
dictionaries when I believed they were clearly incorrect. (For
instance, I chose not to mark the word hatreds as an
uncountable plural, in defiance of the opinion of all my sources,
on the grounds that it has been used in too many news stories from
Bosnia to be considered unusual.) It is interesting to note that
most of the plurals I added from auxiliary sources were of words
considered uncountable.
The difficulties listed above, and the fact that I was forced to
exercise personal judgment frequently in creating it, emphasizes a
fundamental difference between this list and the other classic 12dicts
lists. I have tried to make the 6of12 and 2of12 lists reflect only the
source dictionaries, and to keep my own judgments and opinions out of
the picture (except for my addition of signature words). This has
proved impossible to achieve for the 2of12inf list, which accordingly
represents a less authoritative and more arbitrary collection.
Additionally, the 2of12inf list has undergone less proofreading and
validation than the other lists, and I suspect the error rate is
considerably higher than the idealistic goal of 0.02 % I advocate
elsewhere in this document. Nevertheless, I hope it may prove to be
of some use and interest.
I wish to offer my special thanks to Kevin Atkinson, for supplying me
with the AGID list, and for encouraging me to add the inflections. Of
course, any errors that remain in the 2of12inf list are my own
responsibility, and should not be blamed on Kevin, AGID, or even on
Moby.
The 3esl list
The 3esl list represents another attempt to produce an English "core
vocabulary" list. It is about 2/3 of the size of the 6of12 list,
which it resembles in terms of the sorts of words included.
The 3esl list is a far more subjective list than any of the classic
12dicts lists. It was compiled from 3 small ESL dictionaries, using
the same criteria for eligibility as the 6of12 list. I started with
a list composed of all words from the smallest of the 3 sources, plus
all words contained in both of the others. This list was then edited
in the following ways:
-
I removed alternate spellings for included words, such as grey
and off-stage. I also removed very similar synonyms for the
same concept, for instance, removing cable television as a
duplicate of cable TV.
-
I added one form of each word which would have been included if
the sources had agreed on spelling, such as shortchange and
back seat.
-
I removed some words which were present in the smallest of the
sources but seemed too esoteric, such as the symbols of chemical
elements. I did this only for words which were not present in the
other sources.
-
I added some words which were present in only one of the two
larger sources, but which seemed appropriate to add. These words
were frequently of the sort added to the 6of12 list as signature
words, as well as some inflections that often function as words
with meanings of their own, such as comforting and
notes.
All of these changes were quite subjective in nature, and quite
numerous. Probably more than 10 % of the candidate words were added
or removed in this way. For this reason, it is pointless to speak
of signature words for this list; the composition of the list is too
arbitrary for the term to make any sense. (I will note that the list
is still not entirely arbitrary, as I added only words found in
some form in one of the sources, and removed no words present in two
of the sources other than duplicates. Thus, words like front
page were not added, no matter how familiar, and words such
as lugubrious were not removed, despite clearly not being
part of any "core vocabulary".)
Like the 6of12 list, the 3esl list marks lower-case abbreviations
with a ":" suffix, to prevent them from being mistaken for regular
English words.
One final note on this list. The 3esl list contains about 1500 words
not present in the 6of12 list. Because these two lists have the same
rules for the kinds of words included, one could easily combine
the two to produce a slightly larger list including a number of words
whose omission from 6of12 is rather surprising. Be warned that in a
few cases, the spelling chosen for words with multiple spellings is
different in the two lists, and I would recommend that the duplicates
be removed. (I'll be happy to provide a list of the duplicates if
anyone wants one.)
The 2of4brif list
All of the classic 12dicts lists are unabashedly oriented towards
American English. I've received a few expressions of interest in a
British English list. The result is the 2of4brif list. This list
was compiled from 4 large "international" ESL dictionaries, published
by British publishers. To this American, they are more British than
they are international; quite possibly, they seem more American than
international to British readers. It is interesting to note that,
although there were only a third as many sources for this list as for
the 12dicts lists, these dictionaries resembled each other far more
closely than their American counterparts, which could mean that the
2of4brif list is as good an approximation of a "core" British English
vocabulary as the 6of12 list is for American English. (Or, alternately,
it may simply mean that my choice of sources was too narrow.)
This criteria for inclusion in this list were basically those of the
2of12inf list. In particular, inflections are included for all words,
but hyphenated words, contractions, phrases, proper names and
abbreviations are all excluded. One important difference between
the two is the way in which inflections were determined for inclusion.
The 2of12inf list includes some inflections found in one (or even none)
of its sources. Further, as discussed in detail above,
it includes plurals for words which are not normally
considered to have plurals. The 2of4brif list differs in both of
these regards. It includes only inflections endorsed by two or more
of the sources, specifically excluding any plural forms for nouns
listed as uncountable.
The 2of4brif list includes no signature words as such. I made a small
number of adjustments for consistency, such as making sure that
-ise and -ize spellings were equally
represented, and adding plurals for ordinal numbers. (Why
fourteenth would be defined as a fraction, but not
seventeenth, I must simply regard as a mystery.) These
edits were so few, and so clearly harmless, that I have not marked them.
Prospective users of the 2of4brif list should realize that it was
compiled by an American. If my sources contained any glaring errors
(and most dictionaries have a few), I might well not have noticed,
and perpetuated them in the list. The fact that two citations were
required is some protection against such an event, but no guarantee.
As the 2of4brif list is very similar in makeup to the 2of12inf list,
a user who wants a larger, more international list than either could
reasonably merge the two. If you do this, you should remove the
unusual plurals (marked with a "%") from the 2of12inf list in the
process, for consistency.
The 5desk list
I created the 5desk list in an attempt to do a better /usr/dict/words
(about which I offer many harsh criticisms elsewhere in this document).
The sorts of words admitted are the same sorts that /usr/dict/words
contains. Though somewhat larger in size than most versions of
/usr/dict/words, this is still a short word list, striving for inclusion
of words one is likely to encounter rather than the complete jargon of
every possible scientific, artistic or occult endeavor.
5desk was assembled primarily from five "desk dictionaries". It
was augmented by words from five minor sources, including a "vocabulary
builder" and a collection of proper names. The list excludes
prefixes, suffixes, phrases, hyphenated words, contractions and most
abbreviations and acronyms. There was no requirement for multiple
listings; all qualifying words from each of the sources were included.
Inflections of included words were not included themselves except when
irregular, or separately defined. Variant and non-American spellings
were not excluded, and no signature words were added.
Words commonly considered to be abbreviations/acronyms were included
if they contained at least one upper case character, and were defined
with an explicit part of speech. This excluded items like Mr and
Feb, which are abbreviations in the classic sense, but allowed words
like DNA and ATM, which are used far more frequently than that
which they abbreviate. While there is a trend in modern dictionaries
to list such words as nouns (or occasionally verbs, adverbs, etc.),
it is a trend in progress, and rather inconsistently applied. For
this reason, the set of such words in the 5desk list is somewhat
incoherent, including SPCA but not PETA,
AIDS but not SIDS, KGB but
not CIA, and PDQ but not ASAP.
One class of commonly-used words is regrettably absent from the 5desk
list, because I was unable to find a satisfactory source for them.
This is the class of commercial names such as Exxon, Tylenol,
Pepsi and Chevy. This is probably forgivable,
as this class of names is as ephemeral and transitory as teenage slang.
The one-time household words Kool, Ovaltine, Philco and
Ipana serve now only as answers to trivia questions,
with modern wonders like Starbucks, Google, Ritalin
and TiVo taking their place on the tongues of the trendy.
The 5desk list has clearly moved beyond any "core vocabulary" concept.
It includes quite esoteric words (ogee, pleonastic), very
uncommon spellings (thiamine, yuppy), and obscure geographical
and historical names (Paricutin, Nevelson). Like
/usr/dict/words, it is frequently inconsistent and arbitrary, but I
hope at the least I have avoided including spelling errors, and
overlooking the stuff of everyday conversation. Perhaps it will be
useful as a compromise between basic lists such as 3esl, and truly
massive lists like Mendel Cooper's ENABLE.
How 12dicts came to be
It may have occurred to some to wonder about how something like the
n-dicts project came to be (though I assume that anyone who bothers
to download this archive must already have some idea that such a
project could be of interest).
Some years ago, there was a post to the sci.crypt Usenet newsgroup,
on the subject of creating PGP passphrases using randomly selected
entries from a supplied list of very short words. (If this sounds
interesting, follow
this link for an expanded version of the post.) The word list,
which was extracted from /usr/dict/words on some UNIX system, seemed
to me ill-suited to its intended purpose. It included arcane acronyms
(bstj, fmc), misspellings (diety, ouvre) and
words of amazing obscurity (bhoy, kombu). I decided I
could do better (and eventually did).
This caused me to start downloading English word lists, of which there
are many, from the Internet. I was not impressed by the overall
quality of these lists, and the few which were high-quality were
all-inclusive, burying the everyday words under a mountain of archaisms
and esoterica.
The flaws of the vast majority of these lists are worth recounting:
-
Failure to proofread. Many of these lists are littered with
misspellings and typos, sometimes approaching gibberish. (I
presume, for instance, that the bizarre string nondploe,
which was found in a purported Scrabble word list, is a typo
for something more or less legitimate, but I have no idea what.)
Working on my own lists has helped me understand that 100 %
accuracy is a very demanding goal, seldom actually achieved, but
I still feel it reasonable to expect no more than 1 or 2 errors
per 10,000 words.
-
Acceptance of completely undocumented lazy spellings, such as
bullseye and courtmartial.
-
Failure to respect capitalization.
-
Failure to distinguish abbreviations from other entries.
-
Treating esoteric computer jargon, and especially UNIX jargon,
as everyday English. (Beware any list which includes bitblt,
emacs, inode or lvalue.)
-
Apparently random word selection. For instance, the most common
version of /usr/dicts/words contains a large set of apparently
randomly chosen personal names (uncapitalized, of course, and
missing wanda, marge, polly and sid).
-
Inconsistent inflection. Some lists include all inflections of
their vocabulary, while others include only singulars and
infinitives. Either policy is fine, and has its advantages. I
am personally very annoyed when inflected forms appear at random.
I find this generally happens when a compiler merges several lists
with different characteristics, with no attempt to reconcile their
divergent styles.
-
Omission of everyday words. I've seen a purported general-purpose list
that includes bremsstrahlung, yet omits log and
beer. Or that includes saxophone but not
sax, and rhinoceros but not rhino.
Of course, due to my original purpose in seeking out common short
words, I found this especially annoying.
One result of my frustration with this situation was my working with
Mendel Cooper on ENABLE (for further information, check out
this
link), which was close to unique in having an active caretaker,
one clearly concerned with quality, and in being oriented towards
American rather than British English. But ENABLE is an all-encompassing
list and, even if it had been complete at the time I started my search
for a list of common words, it would not have been what I wanted for
that reason.
I finally decided that only starting from scratch with a systematic
approach was likely to get me what I was looking for, and that
dictionaries intended for non-native speakers of English were the
best possible source for words that are in some cases so familiar
that we never think of them. This has led to the 12dicts lists,
which I hope have managed to avoid the flaws recited above.
(I should acknowledge one form of inconsistency exhibited by the
12dicts lists, which is that sometimes related words are spelled
inconsistently. For instance, the 2of12 list contains both
broadminded and broad-mindedness. This
generally occurs as a result of the methodology used to build the lists.
In the case of broadminded, only one source dictionary listed
broadmindedness, which was therefore excluded. I felt unequal
to trying to correct these inconsistencies, some of which are real and not
mere artifacts of 12dicts, such as the contrast between self-conscious
and unselfconscious.)
Conclusions
When I released the first version of 12dicts in 1999, I assumed I was
done with it. It hasn't worked out that way. Before I declare it finished
for a second time, there are a few more things I'd like to accomplish.
-
As mentioned above, I would like to rework the 2of12inf list to remove
the dependency on the Moby lists.
-
As may be seen by inspecting the table of file characteristics, the
12dicts files now form a spectrum of word lists, with contents ranging
from the extremely common to the mildly esoteric. I would like to
extend the spectrum further by applying the 12dicts methodology to
dictionaries of larger size. Whether I will ever get the time for a
project this large remains to be seen. If it ever comes to pass,
it will probably be released separately from 12dicts itself, as
anything larger than the 5desk list will be too large to even pretend
to represent a "core English" vocabulary. (Even the 5desk list itself
is too large for that purpose.)
-
It is possible that in the future the "n" of n-dicts will increase
again, but, in fact, consideration of an additional dictionary now
generally ends with the discovery that its vocabulary matches 12dicts
pretty closely. At the very least, this phenomenon gives me hope that
the 12dicts lists have now fulfilled their basic purpose.
The 12dicts lists were compiled by Alan Beale. I explicitly release
them to the public domain, but request acknowledgment of their use.
(Actually, the dependency of the 2of12inf list on AGID prevents its
release into the public domain. However, I do not impose any additional
requirements on its use beyond those imposed by AGID and its sources,
as described in agid.txt.) Feel free to send comments, suggestions,
inquiries and/or large sums of money to me at
[email protected]. If you find 12dicts useful, I'd love to hear about it.