
doc.transformers.int_daisy_unicodeTranscoder.html Maven / Gradle / Ivy
int_daisy_unicodeTranscoder
Transformer documentation: int_daisy_unicodeTranscoder
Transformer Purpose
Performs character set transcoding on all XML documents in a
fileset, roundtripping through a Unicode representation.
Can optionally replace characters in the XML file with
substitution strings. This latter feature is intended primarily for use
when preparing an XML file for a specific output medium: one example is
speech synthesizers (who typically doesnt recognize and pronounce all
characters in the Unicode repertoire). Another example is when an XML document is being prepared for Braille.
Input Requirements
The transformer is written to work on any file/fileset that can
be represented by the org.daisy.util.fileset
package.
Character set transcoding will only be done on XML members of the
input fileset; all other types of members pass through untouched.
If no file in the fileset is of type XML, then the whole fileset
will pass through untouched. It is therefore safe to place this
transformer in contexts whose dataflow varies considerably.
Output
On success
A file/fileset whose XML members has been transcoded, and
optionally has had certain characters substituted by replacement
strings. See parameters
On error
No specific recovery scheme. On error, this transformer will send
a fatal message, then throw an exception and abort.
Configuration/Customization
Parameters (tdf)
- input
- The input XML file (standalone or manifest)
- output
- The output directory
- outputEncoding
- Character set encoding of the output file(s). If not set, the input characterset will be maintained.
- performCharacterSubstitution
- Enables/disables the optional character substitution process.
- substitutionTables
- An optional list of tables containing substitution strings. The provided table must comply to the Java Properties XML format. This parameter only has effect if the parameter performCharacterReplacement is set to true.
- excludeFromSubstitution
- A character set name defining a set of characters that should be excluded from translation. This parameter only has effect if the parameter performCharacterReplacement is set to true.
- fallbackToLatinTransliteration
- Determines whether a character substitution attempt should fallback to a transliteration to Latin attempt if a replacement text was not found in user provided substitution table(s). This parameter only has effect if the parameter performCharacterReplacement is set to true.
- fallbackToNonSpacingMarkRemovalTransliteration
- Determines whether a character substitution attempt should fallback to a transliteration to nonspacing mark removal (typically disaccentuation) attempt if a replacement text was not found in user provided substitution table(s). This parameter only has effect if the parameter performCharacterReplacement is set to true.
- fallbackToUCD
- Determines whether a character substitution attempt should fallback to names in the UCD table if a replacement text is not found in user provided substitution table(s). This parameter only has effect if the parameter performCharacterReplacement is set to true.
- substituteInAttributeValues
- Determines whether character substitution processing should be applied to attribute values. If this is set to false, substitution processing will only be applied to element text nodes. This parameter only has effect if the parameter performCharacterReplacement is set to true.
Extended configurability
Replacement processing
The substitution is made using different attempts in a series of preference;
each successor is considered a fallback to its predecessor.
- Locate a replacement string in one or several user provided tables;
- Optional fallback: attempt to create a replacement using transliteration by nonspacing mark removal;
- Optional fallback: attempt to create a replacement using transliteration to Latin characters;
- Optional fallback: retrieve a replacement string based on UCD names
All fallbacks are disabled by default.
By setting an "exclusion reportoire" a set of characters are defined which are considered "allowed": replacement
will not be attempted on a character that is a member of an excluded repertoire.
The use of this class may result in a change in unicode character composition between input and output.
If you need a certain normalization form, normalize after the use of this class.
Replacement string table syntax
The character translation table with a mapping between characters
and their replacement strings must comply to the xml format used in
java.util.Properties. See http://java.sun.com/dtd/properties.dtd
and java.util.Properties
for details.
The key attribute of the entry element must be a hex value
representing a unicode codepoint, and the entry element value an
arbitrary length string of characters.
Example of replacement text table (this also exists as a real
file (example-table.xml) in the transformer directory):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>
This is an example of an input translation table for int_daisy_unicodeTranscoder.
The key attribute contains the hex codepoint to be translated,
and the entry text node the replacement string.
The entries match two hebrew characters and some other stuff.
The table can be built using: www.unicode.org/Public/UNIDATA/UnicodeData.txt
</comment>
<entry key="05E2">hebrew ayin</entry>
<entry key="05DD">hebrew final mem</entry>
<entry key="00A5">currency yen</entry>
<entry key="00AE">registered sign</entry>
</properties>
Further development
Note: after a priori code review, the sjsxp StAX implementation
seems safer to use than Woodstox when it comes to transcoding. This
should be tested.
Dependencies
- IBM
icu4j (at time of writing: icu4j_3_4_4.jar)
Author
Markus Gylling, Daisy Consortium
Licensing
LGPL