doc.transformers.se_tpb_xmldetection.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of pipeline1-adapter Show documentation
The newest version!





se_tpb_xmldetection



Transformer documentation: se_tpb_xmldetection




Transformer Purpose
Input Requirements
Output
	
		On success
		On error
		
	
Configuration/Customization
	
		Parameters (tdf)
		Extended configurability
	
	
Further development
Dependencies
Author
Licensing



Transformer Purpose

This transformer can do abbreviation, initialism and acronym detection, sentence
detection and word detection in XML documents.
Multiple XML grammars are supported; only a configuration
file is needed to support a new grammar but so far only support for DTBook documents has
been added.
The internal Java BreakIterator is used to perform the sentence and word
detection, so any language supported by Java should work with this transformer. 
xml:lang markup is used to switch the language.
The abbreviation, initialism and acronym detection is based on word lists in
configuration files. So far, there are configuration files for english, swedish and
french. The transformer will not fail catastrophically if it finds a language it has no
configuration file for, it will simply mean that no abbreviations or acronyms will be
found for that particular language.

This transformer differentiates between three types of abbreviations.
In initialisms, each letter is pronounced (e.g. HTML). An acronyms is
pronounced as a word (e.g. DAISY), where an abbreviation is pronounced by
spelling out the abbreviation (e.g. is pronounced as "for example").

Input Requirements

A document having a doctype declaration or root element XML namespace supported by the
configuration files.

Output

On success

An XML document having abbreviation and acronym markup, sentence markup or word markup.

On error

On error, this transformer will throw an exception and abort execution.


Configuration/Customization

	Parameters (tdf)
	
	
	input
	Required. Path to the input XML document
	output
	Required. Path of the output XML document
	doAbbrAcronymDetection
	Optional. If set to true (the default), abbreviation, initialism
	and acronym	detection will be performed.	
	doSentenceDetection
	Optional. If set to true (the default), sentence detection will
	be performed.
	doWordDetection
	Optional. If set to true (the default), word detection will be performed.
	copyReferredFiles
	Optional. If set to true (the default), referred files, such as images
	referenced from a DTBook document, will be copied to the output.
	customLang
	Optional. A file containing custom abbreviations, initialisms and acronyms. These
	abbreviations and acronyms will be available regardless of language.
	doOverride
	Optional. If set to true (defalt is false), the abbreviations,
	initialisms	and acronyms in the custom language file will override the language specific ones
	defined in the different language dependant configuration files.
	doSingleSentAdd
	Optional. If set to false (defalt is true which is also the original 
	behavior of this transformer), sent elements will not be added in the case where they would become the only 
	descendant of the parent element.	
		
	
	
	Extended configurability
	
	File format for abbreviations and acronyms
	The language root element basically contains three sub elements
	(initialism, acronym and abbreviation). Each of these
	elements can have three attributes:
	
	before
A regular expression describing the text before an abbreviation, initialism
	or acronym.
	after
A regular expression describing the text after an abbreviation, initialism
	or acronym.
	suffix
A regular expression describing the allowed suffixes to the acronym (such as
	a plural 's').
	
	Each abbreviation, initialism or acronym consists of a key element. Each 
	key has one or more name elements describing the string(s) to
	be matched. The expansion element contains the expanded version of the
	abbreviation, initialism or acronym.
	
	<key>
	<name>o.s.v.</name>
	<name>o.s.v</name>
	<name>osv.</name>
	<name>o s v</name>
	<expansion>och så vidare</expansion>
</key>

	This section can be expanded.
	
	
	File format for XML grammar definitions
	This section remains to be written
	
Further development


Dependencies

StAX is used for XML processing.

Author

Linus Ericson, TPB

Licensing

LGPL