All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.transformers.int_daisy_unicodeTranscoder.html Maven / Gradle / Ivy

The newest version!





int_daisy_unicodeTranscoder



Transformer documentation: int_daisy_unicodeTranscoder

Transformer Purpose

Performs character set transcoding on all XML documents in a fileset, roundtripping through a Unicode representation.

Can optionally replace characters in the XML file with substitution strings. This latter feature is intended primarily for use when preparing an XML file for a specific output medium: one example is speech synthesizers (who typically doesnt recognize and pronounce all characters in the Unicode repertoire). Another example is when an XML document is being prepared for Braille.

Input Requirements

The transformer is written to work on any file/fileset that can be represented by the org.daisy.util.fileset package.

Character set transcoding will only be done on XML members of the input fileset; all other types of members pass through untouched.

If no file in the fileset is of type XML, then the whole fileset will pass through untouched. It is therefore safe to place this transformer in contexts whose dataflow varies considerably.

Output

On success

A file/fileset whose XML members has been transcoded, and optionally has had certain characters substituted by replacement strings. See parameters

On error

No specific recovery scheme. On error, this transformer will send a fatal message, then throw an exception and abort.

Configuration/Customization

Parameters (tdf)

input
The input XML file (standalone or manifest)
output
The output directory
outputEncoding
Character set encoding of the output file(s). If not set, the input characterset will be maintained.
performCharacterSubstitution
Enables/disables the optional character substitution process.
substitutionTables
An optional list of tables containing substitution strings. The provided table must comply to the Java Properties XML format. This parameter only has effect if the parameter performCharacterReplacement is set to true.
excludeFromSubstitution
A character set name defining a set of characters that should be excluded from translation. This parameter only has effect if the parameter performCharacterReplacement is set to true.
fallbackToLatinTransliteration
Determines whether a character substitution attempt should fallback to a transliteration to Latin attempt if a replacement text was not found in user provided substitution table(s). This parameter only has effect if the parameter performCharacterReplacement is set to true.
fallbackToNonSpacingMarkRemovalTransliteration
Determines whether a character substitution attempt should fallback to a transliteration to nonspacing mark removal (typically disaccentuation) attempt if a replacement text was not found in user provided substitution table(s). This parameter only has effect if the parameter performCharacterReplacement is set to true.
fallbackToUCD
Determines whether a character substitution attempt should fallback to names in the UCD table if a replacement text is not found in user provided substitution table(s). This parameter only has effect if the parameter performCharacterReplacement is set to true.
substituteInAttributeValues
Determines whether character substitution processing should be applied to attribute values. If this is set to false, substitution processing will only be applied to element text nodes. This parameter only has effect if the parameter performCharacterReplacement is set to true.

Extended configurability

Replacement processing

The substitution is made using different attempts in a series of preference; each successor is considered a fallback to its predecessor.

  1. Locate a replacement string in one or several user provided tables;
  2. Optional fallback: attempt to create a replacement using transliteration by nonspacing mark removal;
  3. Optional fallback: attempt to create a replacement using transliteration to Latin characters;
  4. Optional fallback: retrieve a replacement string based on UCD names

All fallbacks are disabled by default.

By setting an "exclusion reportoire" a set of characters are defined which are considered "allowed": replacement will not be attempted on a character that is a member of an excluded repertoire.

The use of this class may result in a change in unicode character composition between input and output. If you need a certain normalization form, normalize after the use of this class.

Replacement string table syntax

The character translation table with a mapping between characters and their replacement strings must comply to the xml format used in java.util.Properties. See http://java.sun.com/dtd/properties.dtd and java.util.Properties for details.

The key attribute of the entry element must be a hex value representing a unicode codepoint, and the entry element value an arbitrary length string of characters.

Example of replacement text table (this also exists as a real file (example-table.xml) in the transformer directory):


<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>		
	<comment>
	  This is an example of an input translation table for int_daisy_unicodeTranscoder.
	  The key attribute contains the hex codepoint to be translated,
	  and the entry text node the replacement string.
	  The entries match two hebrew characters and some other stuff.
	  The table can be built using: www.unicode.org/Public/UNIDATA/UnicodeData.txt
	</comment>	
	<entry key="05E2">hebrew ayin</entry>	
	<entry key="05DD">hebrew final mem</entry>		
	<entry key="00A5">currency yen</entry>
	<entry key="00AE">registered sign</entry>
</properties>
	

Further development

Note: after a priori code review, the sjsxp StAX implementation seems safer to use than Woodstox when it comes to transcoding. This should be tested.

Dependencies

  • IBM icu4j (at time of writing: icu4j_3_4_4.jar)

Author

Markus Gylling, Daisy Consortium

Licensing

LGPL





© 2015 - 2025 Weber Informatics LLC | Privacy Policy