All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.scripts.CharacterRepertoireManipulator.html Maven / Gradle / Ivy

The newest version!




	
	Pipeline Script: Character Repertoire Manipulator
	


Pipeline Script: Character Repertoire Manipulator

Overview

This script lets you manipulate the character repertoire of the XML documents in a fileset. Practically, this means to replace a certain character with one or several other characters.

Character repertoire manipulation is done for example when preparing an XML file for a specific output medium. One example is speech synthesizers, who typically doesnt recognize and correctly pronounce all characters in the Unicode repertoire. Another example is when an XML document is being prepared for Braille.

The manipulation process is multilayered. You can use tables that explicitly define replacement strings for a set of characters. You can also use generic Unicode-based transliteration routines. See further Configuration..

Configuration

Input file
Required. The input XML file. This can either be a fileset manifest (NCC, OPF, etc) or a standalone XML document. If it is a manifest, all XML files of the fileset will be manipulated.
Output directory
Required. The directory to store the result in
Substitution table(s)
A list of one or several tables containing substitution strings. See Substituion Table Syntax.
Exclude
Optional. The name of a character set defining a set of characters that should be excluded from substitution. Default is none.
Fallback to non-spacing mark removal
Optional. Determines whether a character substitution attempt should fallback to a transliteration to nonspacing mark removal (typically disaccentuation) attempt if a replacement text was not found in user provided substitution table(s). Default is false.
Fallback to Latin
Optional. Determines whether a character substitution attempt should fallback to a transliteration to Latin attempt if a replacement text was not found in user provided substitution table(s). Default is false.
Fallback to UCD names
Optional. Determines whether a character substitution attempt should fallback to names in the UCD table if a replacement text is not found in user provided substitution table(s). Default is false.
Output encoding
Optional. Select the output characterset encoding. If not set, the output encoding will be the default utf-8.
Linebreaks
Optional.Select the type of linebreak to use; values are UNIX, Mac, Dos or System default. The default is the system default.

Substituion Table Syntax

The character translation table with a mapping between characters and their replacement strings must comply to the xml format used in java.util.Properties. See http://java.sun.com/dtd/properties.dtd and java.util.Properties for details.

The key attribute of the entry element must be a hex value representing a unicode codepoint, and the entry element value an arbitrary length string of characters.

Example of replacement text table (this also exists as a real file (example-table.xml) in the transformer directory):


<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>		
	<comment>
	  This is an example of an input translation table for int_daisy_unicodeTranscoder.
	  The key attribute contains the hex codepoint to be translated,
	  and the entry text node the replacement string.
	  The entries match two hebrew characters and some other stuff.
	  The table can be built using: www.unicode.org/Public/UNIDATA/UnicodeData.txt
	</comment>	
	<entry key="05E2">hebrew ayin</entry>	
	<entry key="05DD">hebrew final mem</entry>		
	<entry key="00A5">currency yen</entry>
	<entry key="00AE">registered sign</entry>
</properties>
	

Appendix: List of Transformers used

The documents linked below are parts of the Transformer technical documentation. These are developer and systems-administrator centric documents.

  1. Unicode Transcoder
  2. Charset Switcher




© 2015 - 2025 Weber Informatics LLC | Privacy Policy