
doc.transformers.se_tpb_xmldetection.html Maven / Gradle / Ivy
se_tpb_xmldetection
Transformer documentation: se_tpb_xmldetection
Transformer Purpose
This transformer can do abbreviation, initialism and acronym detection, sentence
detection and word detection in XML documents.
Multiple XML grammars are supported; only a configuration
file is needed to support a new grammar but so far only support for DTBook documents has
been added.
The internal Java BreakIterator
is used to perform the sentence and word
detection, so any language supported by Java should work with this transformer.
xml:lang
markup is used to switch the language.
The abbreviation, initialism and acronym detection is based on word lists in
configuration files. So far, there are configuration files for english, swedish and
french. The transformer will not fail catastrophically if it finds a language it has no
configuration file for, it will simply mean that no abbreviations or acronyms will be
found for that particular language.
This transformer differentiates between three types of abbreviations.
In initialisms, each letter is pronounced (e.g. HTML). An acronyms is
pronounced as a word (e.g. DAISY), where an abbreviation is pronounced by
spelling out the abbreviation (e.g. is pronounced as "for example").
Input Requirements
A document having a doctype declaration or root element XML namespace supported by the
configuration files.
Output
On success
An XML document having abbreviation and acronym markup, sentence markup or word markup.
On error
On error, this transformer will throw an exception and abort execution.
Configuration/Customization
Parameters (tdf)
- input
- Required. Path to the input XML document
- output
- Required. Path of the output XML document
- doAbbrAcronymDetection
- Optional. If set to
true
(the default), abbreviation, initialism
and acronym detection will be performed.
- doSentenceDetection
- Optional. If set to
true
(the default), sentence detection will
be performed.
- doWordDetection
- Optional. If set to
true
(the default), word detection will be performed.
- copyReferredFiles
- Optional. If set to
true
(the default), referred files, such as images
referenced from a DTBook document, will be copied to the output.
- customLang
- Optional. A file containing custom abbreviations, initialisms and acronyms. These
abbreviations and acronyms will be available regardless of language.
- doOverride
- Optional. If set to
true
(defalt is false
), the abbreviations,
initialisms and acronyms in the custom language file will override the language specific ones
defined in the different language dependant configuration files.
- doSingleSentAdd
- Optional. If set to
false
(defalt is true
which is also the original
behavior of this transformer), sent elements will not be added in the case where they would become the only
descendant of the parent element.
Extended configurability
File format for abbreviations and acronyms
The language
root element basically contains three sub elements
(initialism
, acronym
and abbreviation
). Each of these
elements can have three attributes:
- before
- A regular expression describing the text before an abbreviation, initialism
or acronym.
- after
- A regular expression describing the text after an abbreviation, initialism
or acronym.
- suffix
- A regular expression describing the allowed suffixes to the acronym (such as
a plural 's').
Each abbreviation, initialism or acronym consists of a key
element. Each
key
has one or more name
elements describing the string(s) to
be matched. The expansion
element contains the expanded version of the
abbreviation, initialism or acronym.
<key>
<name>o.s.v.</name>
<name>o.s.v</name>
<name>osv.</name>
<name>o s v</name>
<expansion>och så vidare</expansion>
</key>
This section can be expanded.
File format for XML grammar definitions
This section remains to be written
Further development
Dependencies
StAX is used for XML processing.
Author
Linus Ericson, TPB
Licensing
LGPL