
doc.transformers.int_daisy_unicodeNormalizer.html Maven / Gradle / Ivy
int_daisy_unicodeNormalizer
Transformer documentation: int_daisy_unicodeNormalizer
Transformer Purpose
Performs unicode normalization on all XML documents in a fileset using one
of the four standard normalization forms provided by the Unicode Consortium.
For more information on the reasons for and practice of Unicode normalization, see:
- http://www.w3.org/TR/charmod-norm/
- http://www.unicode.org/reports/tr15/
- http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/Normalizer.html
Input Requirements
The transformer is written to work on any file/fileset that can be represented by the org.daisy.util.fileset
package.
Normalization will only be done on XML members of the input fileset; all other types of members pass through untouched.
If no file in the fileset is of type XML, then the whole fileset will pass through untouched. It is therefore safe to place this transformer in contexts whose dataflow varies considerably.
Output
On success
A file/fileset whose XML members has been normalized using one of the four Unicode normalization algorithms. See parameters
On error
No specific recovery scheme. On error, this transformer will send a fatal message, then throw an exception and abort.
Configuration/Customization
Parameters (tdf)
- input
- pathspec of the manifest member of input fileset
- output
- pathspec of output directory
- textnodesOnly
- If valued true, will only normalize element text nodes (and not attribute values, and other types of valuecarrying nodes). Default: false.
- normalizationForm
-
Selects normalization form to use. Allowed values: NFD|NFKD|NFC|NFKC
. Default: NFC
, which is the one recommended in Character Model for the World Wide Web.
Extended configurability
None.
Further development
No known refactoring wishes at the time of writing.
Dependencies
- IBM icu4j (at time of writing: icu4j_3_4_4.jar)
Author
Markus Gylling, Daisy Consortium
Licensing
LGPL