doc.developer.localizing-and-customizing-pipeline-narrator.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of pipeline1-adapter Show documentation
The newest version!





Localizing and customizing Pipeline Narrator
 


Localizing and customizing Pipeline Narrator
  Martin Blomberg
  Latest update: 2006-08-24
  


	Introduction
	Available Localizations
		
			Default Configuration
		
	
	Narrator Transformer Localization
		
			
			Abbreviation and Acronym Detection
			Structure Announcer
			Sentence Detection
			Synchronization Point Normalizer
			Speech Generator
			File Set Creator			
			Audio Encoder
			Z3986-2005 to Daisy 2.02 converter

		
	
	User Interface Localization
		
			Default messages.properties
			Localized Messages
		
	
	
	Committing Localizations
	Author
	Licensing


	
	
Introduction
	
		The purpose of this document is to give anyone trying to localize Pipeline Narrator 
		clues of where to find localizable features, and of which files to edit or 
		create new versions of. 
		
		Localizing Pipeline Narrator means adjusting Pipeline Narrator to produce 
		digital talking books in languages not yet covered by Narrator. It could as well mean 
		localizing the user interface. The sections Narrator Transformer 
		Localization and User Interface Localization describe 
		each of those tasks. Note: user interface localization is not necessary in order
		to localize the production of books.
	
	
	
		You'll have several chances to fill in language codes when localizing Narrator.
		These language codes are the lower-case, two-letter codes as defined by 
		ISO-639. You can find a full list of these codes at a number of sites, such as:
		
		http://www.loc.gov/standards/iso639-2/englangn.html. 
	
	
	
		Note: This document is not transformer documentation - to learn more about each 
		one of the transformers, please read the respective transformer documentation 
		which should be found in in the doc/transformers/ directory.
	
	
	
	Available Localizations
	
		There is no neat way of finding out what localizations are
		available in your Pipeline Narrator installation. The easiest way is 
		to examine the files in each transformer directory and see what they
		contain, or run a book with xml:lang="xx", where xx is
		your language code, and see what comes out. 
	
	
	Default Configuration
	
		Pipeline Narrator is supposed to work for English texts out of the box. The default 
		configuration is what's used at TPB when producing university level course 
		literature in English. There are more settings than the localizable to tweak, and they're
		described elsewhere. Please read each transformer documentation to learn more.
	
	
			

		
	Narrator Transformer Localization
	
	
	
	
	
	Abbreviation and Acronym Detection (se_tpb_xmldetection)
	Transformer documentation.
	
	
	
		se_tpb_xmldetection is a highly language dependant transformer when used for
		abbreviation and acronym detection (see 
		Sentence Detection for other usage). 
		Despite the transformer name, it isn't really xml that is detected, but rather patterns 
		and strings in the text. Such patterns and strings are defined in certain language
		files that reside in 
		../../transformers/se_tpb_xmldetection/lang/.
		The language files contain abbreviations, acronyms and initialisms together
		with their corresponding expansions, for the TTS to read. That way, the TTS
		may be able to say "that is" instead of just "i e", and so on.
	
	
	
	
		If you are using Narrator to produce digital talking books in a language not yet
		covered by Narrator, you probably want to write your own language file. 
		A short example follows, but you may want to consult the 
		transformer documentation 
		for a more thorough description on how to write such files.
	
	
	
<language xml:lang="en">
    <initialism before=".*[\s(]|^" after="([\-,\.\s:;?!)].*)|$" suffix="s|:s">
        <key>
            <name>ACP</name>
            <expansion>African, Caribbean and Pacific Countries</expansion>
        </key>
    </initialism>

    <acronym before=".*[\s(]|^" after="([\-,\.\s:;?!)].*)|$" suffix="s|:s">
        <key>
            <name>DAISY</name>
            <expansion id="daisyBook">Digital Accessible Information System</expansion>
        </key>
    </acronym>

    <abbreviation before=".*[\s(]|^" after="([,\.\s:;?!)].*)|$">
        <key>
            <name>e.g.</name>
            <name>eg.</name>
            <expansion>for example</expansion>
        </key>
    </abbreviation>
</language>	

	
	
		In the above example, there are three main elements: initialism, acronym
		and abbreviation. All three can have multiple key children.
	
	
		
			
				Initialisms are things supposed to be spelled out, in this example "A, C, P" 
				rather than having the TTS mumble something quite unintelligible.
			
		
		
		
			
				Acronyms are supposed to be read out like a word. In this case, proper acronym
				mark-up is added to the document.
			
		
		
		
			
				Abbreviations are exchanged at TTS processing-time. The expansion, instead of
				the name, is read by the TTS. 
			
		
	
	
	
		Once you have produced a file for your language, you have to tell Narrator the file exists.
		You do so by editing the file 
		../../transformers/se_tpb_xmldetection/lang.xml,
		adding the mapping between a language code and your new file.
	
	
	
	
	
	Structure Announcer (se_tpb_annonsator)
	Transformer documentation.
	
		Structure announcer adds spoken introductions and/or terminations of structures,
		such as tables, sidebars and notes. The announcements are read by the TTS and needs a rewrite if a language
		not yet covered by Narrator is being used. The announcements are found in
		../../transformers/se_tpb_annonsator/type
		directory. The file dtbook-2005.xml
		contains the announcements made in a book that complies to the DTBook 2005 standard.
	
	
	
		The file contains rule elements, each one with the attribute match which 
		contains an xpath defining which elements the rule should be to applied to.
		Typically, localizing Narrator, no new rules have to be added. What you need to
		add is instead the lang child of the rule element, with
		the lang attribute matching your language. 
		The lang element has two optional
		children: before and after that contain the text to be read
		before and after any matching structure from the book.
	
	
	
		The file also contains an element called copy. That element contains 
		xslt code dealing with getting spoken announcements of list items in numbered list
		(<list type="ol"...). If you want the spoken announcements to appear
		in lists with roman numerals, you have to edit the file adding a <xsl:when 
		test="lang('xx')">... where xx is your language code. You'll see
		tests for lang('yy') and the easiest way is just to copy one of them,
		and change the language code and the announcement text. If you don't have numbered
		lists using roman numerals, you can skip this and your lists will be fine anyway.
	
	
	
	
	Sentence Detection (se_tpb_xmldetection)
	Transformer documentation.
	
		The sentence detection uses Java's java.text.BreakIterator to find sentence
		boundaries. All localization is done automagically by Java using the document's current locale.
	
	
	
	
	Synchronization Point Normalization (se_tpb_syncPointNormalizer)
	Transformer documentation.
	
		Language agnostic.
	
	
	
	
	
	Speech Generation (se_tpb_speechgen2)
	
	
		
			Transformer documentation.
	
	
	
		se_tpb_speechgenerator takes care of the audio file/speech generation. It has
		several language specific features that need to be adjusted to get the most out
		of the system.
	
	
	
		
			TTS Builder Configuration
			
				se_tpb_speechgenerator is mainly configured using the file
				
					ttsbuilder.xml. That is the file to edit to change file names for
				the following features.
				Please refer to the 
				
					transformer documentation and the multilanguage support documentation for a more thorough description
				of the transformer configuration.
			
		
		
		
			Regular Expressions
			
				Every chunk of text sent to the TTS optionally goes through a search-replace routine.
				The routine consists of a list of regexes to use, specified using a certain xml format.
				At run-time, the regular expressions are read from disk according to what's in the 
				ttsbuilder.xml-file associated with the parameter name generalRegexFilename.
				You can edit the supplied file or create a new one and change the TTS builder 
				configuration to point to that one instead.
			
			
				ttsbuilder.xml: parameter name: generalRegexFilename.
			
		
		
		
			Years
			
				Most numbers in text really are years. To have the TTS actually read the numbers as years,
				not like ordinary numbers, (1952 » "nineteen fifty two" instead of 
				"one thousand nine hundred and fifty two")
				regular expressions can be used. For Swedish and English, the expressions
				have already been completed and the English ones can be found in 
				year_en.xml.
				If your language makes a difference between reading an ordinary number and reading a
				year, a localization of Narrator should contain a localization of such a file.
			
			
				ttsbuilder.xml parameter name: yearFilename.
			
		
		
		
			XSLT
			
				Every sync point is extracted from the content document with its xml context
				intact. An xsl transformation is done on that small xml fragment using text
				as output format. This gives the ability to add text to some elements
				(for example: add the word "page" prior to the text node from a pagenum) 
				or add ssml before and after some constructs.
			
			
				Some announcements are made using xslt instead of
				se_tpb_annonsator. The reason for that is that the
				announcement is possible to place in the same sentence, giving the synthetic voice 
				better flow. For example, the xslt announcement of the element 
				<pagenum id="p-7">7</pagenum> would be "page 7"
				whereas se_tpb_annonsator produces "Page. 7". Those two text strings
				gives very different output from the synthesis. Localization of such rules 
				is made by adding your own xml:lang='something' on the xslt 
				match-attribute.
			
			
			
				ttsbuilder.xml parameter name: xsltFilename.
			
		
		
		
			Character Translation Table
			
				Some TTS systems are unable to pronounce some characters. For example, 
				an English TTS might not be able to pronounce the Swedish characters "å", "ä" and "ö".
				To prevent TTS crashes, you are able to translate certain characters to 
				arbitrary text strings using a simple key-value mapping. The file containing 
				the mapping uses Java's properties xml format, with the hex codepoint as
				key and the replacement string as the value.
			
			
				ttsbuilder.xml: parameter name: characterTranslationTable.
			
		
	
	
	
	
	
	File Set Creator (se_tpb_filesetcreator)
	Transformer documentation.
	
		A Z39.86 fileset contains a resource file. To add more languages, just extend the 
		existing file by adding more resources with another xml:lang. Note that audio must be 
		supplied. 
	
	
	
	
	
	Audio Encoder (se_tpb_dtbAudioEncoder)
	Transformer documentation.
	Language agnostic.
	
	
	
	
	Z3986-2005 to Daisy 2.02 Converter (se_tpb_zed2daisy202)
	Transformer documentation.
	Language agnostic.
			
	
		User Interface Localization
	
		The Pipeline transformers make use of the internationalization features
		in the DMFC package. That way the messages displayed via the standard
		EventSender	during transformer execution are localizable. There is no need
		to do interface localization in order to produce books in different languages.
	
	
	Default messages.properties
	
		In every transformer directory, there is a file called messages.properties.
		The file has a simple syntax and describes a key-value mapping. The key is typically 
		a fairly understandable and descriptive name of a message, and the value is the message 
		itself, i.e. what's supposed to be printed on screen. messages.properties
		contains the default messages as defined by the transformer developer. The language used
		should be English. The file should not be removed or edited.
	
	
		The following example of a messages.properties file comes from the 
		se_tpb_filesetcreator-transformer.
		Lines starting with # are considered comments. The left hand side of the 
		equals sign is the key (the message name) and the right hand side the value (message text).
		The curly braces in the message text denote parameters sent 
		by the transformer:
	
	
	
	########## Message properties for FileSetCreator ##########
# {0} is the current input filename
USING_INPUT_FILE = Using input file {0}
# {0} is the current output directory name
USING_OUTPUT_DIR = Using output directory {0}
SEARCHING_FOR_REFERRED_FILES = Searching for referred files...
GENERATING_SMIL = Generating SMIL files...
GENERATING_NCX = Generating NCX...
GENERATING_OPF = Generating OPF...
AUDIO_FILE_COPY = Copying audio files...
DONE = Done!
	
	
	
	Localized Messages
	
		If you'd like to rewrite some of the messages, or have messages displayed in 
		a language other than the default English, there is the possibility to 
		do so by adding a localized message properties file. The file must of course follow
		the same simple syntax as messages.properties and also have the same
		keys. You only need to change the values, and save the file with a name like
		messages_xx.properties, where xx is your language code.
		The localized file is to be placed in the transformer directory.
	
	
	
		The name of a Swedish message properties file is messages_sv.properties 
		and to write a Swedish localization of the file shown above, one could produce
		the following: 
	
	
	########## Message properties for FileSetCreator ##########
USING_INPUT_FILE = Använder som indata {0}
USING_OUTPUT_DIR = Använder som utkatalog {0}
SEARCHING_FOR_REFERRED_FILES = Söker efter refererade filer...
GENERATING_SMIL = Genererar SMIL-filer...
GENERATING_NCX = Genererar NCX...
GENERATING_OPF = Genererar OPF...
AUDIO_FILE_COPY = Kopierar filer...
DONE = Klart!		
	
		

		
Committing Localizations
	
		If you have produced a localization of Narrator and want to share it with others,
		please contact one of the developers listed as administrator on the sourceforge 
		daisymfc project members list. That way your localization may be committed to
		the project CVS, free for anyone to download and possibly included in future releases of 
		Pipeline Narrator.
	
	

Author
	Martin Blomberg, TPB


Licensing
	LGPL