All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.transformers.se_tpb_xmldetection.html Maven / Gradle / Ivy

The newest version!





se_tpb_xmldetection



Transformer documentation: se_tpb_xmldetection

Transformer Purpose

This transformer can do abbreviation, initialism and acronym detection, sentence detection and word detection in XML documents.

Multiple XML grammars are supported; only a configuration file is needed to support a new grammar but so far only support for DTBook documents has been added.

The internal Java BreakIterator is used to perform the sentence and word detection, so any language supported by Java should work with this transformer. xml:lang markup is used to switch the language.

The abbreviation, initialism and acronym detection is based on word lists in configuration files. So far, there are configuration files for english, swedish and french. The transformer will not fail catastrophically if it finds a language it has no configuration file for, it will simply mean that no abbreviations or acronyms will be found for that particular language.

This transformer differentiates between three types of abbreviations. In initialisms, each letter is pronounced (e.g. HTML). An acronyms is pronounced as a word (e.g. DAISY), where an abbreviation is pronounced by spelling out the abbreviation (e.g. is pronounced as "for example").

Input Requirements

A document having a doctype declaration or root element XML namespace supported by the configuration files.

Output

On success

An XML document having abbreviation and acronym markup, sentence markup or word markup.

On error

On error, this transformer will throw an exception and abort execution.

Configuration/Customization

Parameters (tdf)

input
Required. Path to the input XML document
output
Required. Path of the output XML document
doAbbrAcronymDetection
Optional. If set to true (the default), abbreviation, initialism and acronym detection will be performed.
doSentenceDetection
Optional. If set to true (the default), sentence detection will be performed.
doWordDetection
Optional. If set to true (the default), word detection will be performed.
copyReferredFiles
Optional. If set to true (the default), referred files, such as images referenced from a DTBook document, will be copied to the output.
customLang
Optional. A file containing custom abbreviations, initialisms and acronyms. These abbreviations and acronyms will be available regardless of language.
doOverride
Optional. If set to true (defalt is false), the abbreviations, initialisms and acronyms in the custom language file will override the language specific ones defined in the different language dependant configuration files.
doSingleSentAdd
Optional. If set to false (defalt is true which is also the original behavior of this transformer), sent elements will not be added in the case where they would become the only descendant of the parent element.

Extended configurability

File format for abbreviations and acronyms

The language root element basically contains three sub elements (initialism, acronym and abbreviation). Each of these elements can have three attributes:

before
A regular expression describing the text before an abbreviation, initialism or acronym.
after
A regular expression describing the text after an abbreviation, initialism or acronym.
suffix
A regular expression describing the allowed suffixes to the acronym (such as a plural 's').

Each abbreviation, initialism or acronym consists of a key element. Each key has one or more name elements describing the string(s) to be matched. The expansion element contains the expanded version of the abbreviation, initialism or acronym.

<key>
	<name>o.s.v.</name>
	<name>o.s.v</name>
	<name>osv.</name>
	<name>o s v</name>
	<expansion>och så vidare</expansion>
</key>

This section can be expanded.

File format for XML grammar definitions

This section remains to be written

Further development

Dependencies

StAX is used for XML processing.

Author

Linus Ericson, TPB

Licensing

LGPL





© 2015 - 2025 Weber Informatics LLC | Privacy Policy