
doc.scripts.CharacterRepertoireManipulator.html Maven / Gradle / Ivy
Pipeline Script: Character Repertoire Manipulator
Pipeline Script: Character Repertoire Manipulator
Overview
This script lets you manipulate the character repertoire of the XML documents in a fileset. Practically, this means to replace a certain character with one or several other characters.
Character repertoire manipulation is done for example when preparing an XML file for a specific output medium.
One example is speech synthesizers, who typically doesnt recognize and correctly pronounce all
characters in the Unicode repertoire. Another example is when an XML document is being prepared for Braille.
The manipulation process is multilayered. You can use tables that explicitly define replacement strings for a set of characters.
You can also use generic Unicode-based transliteration routines. See further Configuration..
Configuration
- Input file
- Required. The input XML file. This can either be a fileset manifest (NCC, OPF, etc) or a standalone XML document. If it is a manifest, all XML files of the fileset will be manipulated.
- Output directory
- Required. The directory to store the result in
- Substitution table(s)
- A list of one or several tables containing substitution strings. See Substituion Table Syntax.
- Exclude
- Optional. The name of a character set defining a set of characters that should be excluded from substitution. Default is none.
- Fallback to non-spacing mark removal
- Optional. Determines whether a character substitution attempt should fallback to a transliteration to nonspacing mark removal (typically disaccentuation) attempt if a replacement text was not found in user provided substitution table(s). Default is false.
- Fallback to Latin
- Optional. Determines whether a character substitution attempt should fallback to a transliteration to Latin attempt if a replacement text was not found in user provided substitution table(s). Default is false.
- Fallback to UCD names
- Optional. Determines whether a character substitution attempt should fallback to names in the UCD table if a replacement text is not found in user provided substitution table(s). Default is false.
- Output encoding
- Optional. Select the output characterset encoding. If not set, the output encoding will be the default utf-8.
- Linebreaks
- Optional.Select the type of linebreak to use; values are UNIX, Mac, Dos or System default. The default is the system default.
Substituion Table Syntax
The character translation table with a mapping between characters
and their replacement strings must comply to the xml format used in
java.util.Properties. See http://java.sun.com/dtd/properties.dtd
and java.util.Properties
for details.
The key attribute of the entry element must be a hex value
representing a unicode codepoint, and the entry element value an
arbitrary length string of characters.
Example of replacement text table (this also exists as a real
file (example-table.xml) in the transformer directory):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>
This is an example of an input translation table for int_daisy_unicodeTranscoder.
The key attribute contains the hex codepoint to be translated,
and the entry text node the replacement string.
The entries match two hebrew characters and some other stuff.
The table can be built using: www.unicode.org/Public/UNIDATA/UnicodeData.txt
</comment>
<entry key="05E2">hebrew ayin</entry>
<entry key="05DD">hebrew final mem</entry>
<entry key="00A5">currency yen</entry>
<entry key="00AE">registered sign</entry>
</properties>
Appendix: List of Transformers used
The documents linked below are parts of the Transformer technical documentation. These are developer and systems-administrator centric documents.