doc.transformers.int_daisy_unicodeTranscoder.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of pipeline1-adapter Show documentation
The newest version!





int_daisy_unicodeTranscoder



Transformer documentation: int_daisy_unicodeTranscoder




	Transformer Purpose
	Input Requirements
	Output
	
		On success
		On error
	
	
	Configuration/Customization
	
		Parameters (tdf)
		Extended configurability
	
	
	Further development
	Dependencies
	Author
	Licensing



Transformer Purpose

Performs character set transcoding on all XML documents in a
fileset, roundtripping through a Unicode representation.
Can optionally replace characters in the XML file with
substitution strings. This latter feature is intended primarily for use
when preparing an XML file for a specific output medium: one example is
speech synthesizers (who typically doesnt recognize and pronounce all
characters in the Unicode repertoire). Another example is when an XML document is being prepared for Braille.

Input Requirements

The transformer is written to work on any file/fileset that can
be represented by the org.daisy.util.fileset package.
Character set transcoding will only be done on XML members of the
input fileset; all other types of members pass through untouched.
If no file in the fileset is of type XML, then the whole fileset
will pass through untouched. It is therefore safe to place this
transformer in contexts whose dataflow varies considerably.

Output

On success

A file/fileset whose XML members has been transcoded, and
optionally has had certain characters substituted by replacement
strings. See parameters

On error

No specific recovery scheme. On error, this transformer will send
a fatal message, then throw an exception and abort.

Configuration/Customization

Parameters (tdf)


	input
	The input XML file (standalone or manifest)

	output	
	The output directory
	
	outputEncoding
	Character set encoding of the output file(s). If not set, the input characterset will be maintained.
	
	performCharacterSubstitution
	Enables/disables the optional character substitution process.
	
	substitutionTables
	An optional list of tables containing substitution strings. The provided table must comply to the Java Properties XML format. This parameter only has effect if the parameter performCharacterReplacement is set to true.
	
	excludeFromSubstitution
	A character set name defining a set of characters that should be excluded from translation. This parameter only has effect if the parameter performCharacterReplacement is set to true.
	
	fallbackToLatinTransliteration
    Determines whether a character substitution attempt should fallback to a transliteration to Latin attempt if a replacement text was not found in user provided substitution table(s). This parameter only has effect if the parameter performCharacterReplacement is set to true.
	
	fallbackToNonSpacingMarkRemovalTransliteration
	Determines whether a character substitution attempt should fallback to a transliteration to nonspacing mark removal (typically disaccentuation) attempt if a replacement text was not found in user provided substitution table(s). This parameter only has effect if the parameter performCharacterReplacement is set to true.
	
	fallbackToUCD
 	Determines whether a character substitution attempt should fallback to names in the UCD table if a replacement text is not found in user provided substitution table(s). This parameter only has effect if the parameter performCharacterReplacement is set to true.
	
	substituteInAttributeValues
	Determines whether character substitution processing should be applied to attribute values. If this is set to false, substitution processing will only be applied to element text nodes. This parameter only has effect if the parameter performCharacterReplacement is set to true.


Extended configurability


Replacement processing
  The substitution is made using different attempts in a series of preference;
  each successor is considered a fallback to its predecessor.
  
    Locate a replacement string in one or several user provided tables;
    Optional fallback: attempt to create a replacement using transliteration by nonspacing mark removal;
    Optional fallback: attempt to create a replacement using transliteration to Latin characters;
    Optional fallback: retrieve a replacement string based on UCD names
  
  
  All fallbacks are disabled by default.
  
  By setting an "exclusion reportoire" a set of characters are defined which are considered "allowed": replacement
  will not be attempted on a character that is a member of an excluded repertoire.

  The use of this class may result in a change in unicode character composition between input and output. 
  If you need a certain normalization form, normalize after the use of this class.
  
Replacement string table syntax
The character translation table with a mapping between characters
and their replacement strings must comply to the xml format used in
java.util.Properties. See http://java.sun.com/dtd/properties.dtd
and java.util.Properties
for details.

The key attribute of the entry element must be a hex value
representing a unicode codepoint, and the entry element value an
arbitrary length string of characters.

Example of replacement text table (this also exists as a real
file (example-table.xml) in the transformer directory):


<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>		
	<comment>
	  This is an example of an input translation table for int_daisy_unicodeTranscoder.
	  The key attribute contains the hex codepoint to be translated,
	  and the entry text node the replacement string.
	  The entries match two hebrew characters and some other stuff.
	  The table can be built using: www.unicode.org/Public/UNIDATA/UnicodeData.txt
	</comment>	
	<entry key="05E2">hebrew ayin</entry>	
	<entry key="05DD">hebrew final mem</entry>		
	<entry key="00A5">currency yen</entry>
	<entry key="00AE">registered sign</entry>
</properties>
	


Further development


Note: after a priori code review, the sjsxp StAX implementation
seems safer to use than Woodstox when it comes to transcoding. This
should be tested.

Dependencies


	IBM
	icu4j (at time of writing: icu4j_3_4_4.jar)


Author

Markus Gylling, Daisy Consortium

Licensing

LGPL