doc.scripts.CharacterRepertoireManipulator.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of pipeline1-adapter Show documentation
The newest version!




	
	Pipeline Script: Character Repertoire Manipulator
	


Pipeline Script: Character Repertoire Manipulator


	
		Overview
		Configuration
		Appendix: List of Transformers used
	


Overview
This script lets you manipulate the character repertoire of the XML documents in a fileset. Practically, this means to replace a certain character with one or several other characters.
Character repertoire manipulation is done for example when preparing an XML file for a specific output medium. 
One example is speech synthesizers, who typically doesnt recognize and correctly pronounce all
characters in the Unicode repertoire. Another example is when an XML document is being prepared for Braille.

The manipulation process is multilayered. You can use tables that explicitly define replacement strings for a set of characters.
You can also use generic Unicode-based transliteration routines. See further Configuration..

Configuration
	

		Input file
		Required. The input XML file. This can either be a fileset manifest (NCC, OPF, etc) or a standalone XML document. If it is a manifest, all XML files of the fileset will be manipulated.
	
		Output directory
		Required. The directory to store the result in
	
		Substitution table(s)
		A list of one or several tables containing substitution strings. See Substituion Table Syntax.
	
		Exclude
		Optional. The name of a character set defining a set of characters that should be excluded from substitution. Default is none.
	
		Fallback to non-spacing mark removal
		Optional. Determines whether a character substitution attempt should fallback to a transliteration to nonspacing mark removal (typically disaccentuation) attempt if a replacement text was not found in user provided substitution table(s). Default is false.
	
		Fallback to Latin
		Optional. Determines whether a character substitution attempt should fallback to a transliteration to Latin attempt if a replacement text was not found in user provided substitution table(s). Default is false.
	
		Fallback to UCD names
		Optional. Determines whether a character substitution attempt should fallback to names in the UCD table if a replacement text is not found in user provided substitution table(s). Default is false.
	
		Output encoding
		Optional. Select the output characterset encoding. If not set, the output encoding will be the default utf-8.
			
		Linebreaks
		Optional.Select the type of linebreak to use; values are UNIX, Mac, Dos or System default. The default is the system default.

	
		
	Substituion Table Syntax
The character translation table with a mapping between characters
and their replacement strings must comply to the xml format used in
java.util.Properties. See http://java.sun.com/dtd/properties.dtd
and java.util.Properties
for details.

The key attribute of the entry element must be a hex value
representing a unicode codepoint, and the entry element value an
arbitrary length string of characters.

Example of replacement text table (this also exists as a real
file (example-table.xml) in the transformer directory):


<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>		
	<comment>
	  This is an example of an input translation table for int_daisy_unicodeTranscoder.
	  The key attribute contains the hex codepoint to be translated,
	  and the entry text node the replacement string.
	  The entries match two hebrew characters and some other stuff.
	  The table can be built using: www.unicode.org/Public/UNIDATA/UnicodeData.txt
	</comment>	
	<entry key="05E2">hebrew ayin</entry>	
	<entry key="05DD">hebrew final mem</entry>		
	<entry key="00A5">currency yen</entry>
	<entry key="00AE">registered sign</entry>
</properties>
	

		
Appendix: List of Transformers used
The documents linked below are parts of the Transformer technical documentation. These are developer and systems-administrator centric documents.


	Unicode Transcoder
	Charset Switcher