All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.developer.localizing-and-customizing-pipeline-narrator.html Maven / Gradle / Ivy

The newest version!





Localizing and customizing Pipeline Narrator
 


Localizing and customizing Pipeline Narrator

Martin Blomberg

Latest update: 2006-08-24

Introduction

The purpose of this document is to give anyone trying to localize Pipeline Narrator clues of where to find localizable features, and of which files to edit or create new versions of. Localizing Pipeline Narrator means adjusting Pipeline Narrator to produce digital talking books in languages not yet covered by Narrator. It could as well mean localizing the user interface. The sections Narrator Transformer Localization and User Interface Localization describe each of those tasks. Note: user interface localization is not necessary in order to localize the production of books.

You'll have several chances to fill in language codes when localizing Narrator. These language codes are the lower-case, two-letter codes as defined by ISO-639. You can find a full list of these codes at a number of sites, such as: http://www.loc.gov/standards/iso639-2/englangn.html.

Note: This document is not transformer documentation - to learn more about each one of the transformers, please read the respective transformer documentation which should be found in in the doc/transformers/ directory.

Available Localizations

There is no neat way of finding out what localizations are available in your Pipeline Narrator installation. The easiest way is to examine the files in each transformer directory and see what they contain, or run a book with xml:lang="xx", where xx is your language code, and see what comes out.

Default Configuration

Pipeline Narrator is supposed to work for English texts out of the box. The default configuration is what's used at TPB when producing university level course literature in English. There are more settings than the localizable to tweak, and they're described elsewhere. Please read each transformer documentation to learn more.

Narrator Transformer Localization

Abbreviation and Acronym Detection (se_tpb_xmldetection)

Transformer documentation.

se_tpb_xmldetection is a highly language dependant transformer when used for abbreviation and acronym detection (see Sentence Detection for other usage). Despite the transformer name, it isn't really xml that is detected, but rather patterns and strings in the text. Such patterns and strings are defined in certain language files that reside in ../../transformers/se_tpb_xmldetection/lang/. The language files contain abbreviations, acronyms and initialisms together with their corresponding expansions, for the TTS to read. That way, the TTS may be able to say "that is" instead of just "i e", and so on.

If you are using Narrator to produce digital talking books in a language not yet covered by Narrator, you probably want to write your own language file. A short example follows, but you may want to consult the transformer documentation for a more thorough description on how to write such files.

<language xml:lang="en">
    <initialism before=".*[\s(]|^" after="([\-,\.\s:;?!)].*)|$" suffix="s|:s">
        <key>
            <name>ACP</name>
            <expansion>African, Caribbean and Pacific Countries</expansion>
        </key>
    </initialism>

    <acronym before=".*[\s(]|^" after="([\-,\.\s:;?!)].*)|$" suffix="s|:s">
        <key>
            <name>DAISY</name>
            <expansion id="daisyBook">Digital Accessible Information System</expansion>
        </key>
    </acronym>

    <abbreviation before=".*[\s(]|^" after="([,\.\s:;?!)].*)|$">
        <key>
            <name>e.g.</name>
            <name>eg.</name>
            <expansion>for example</expansion>
        </key>
    </abbreviation>
</language>	

In the above example, there are three main elements: initialism, acronym and abbreviation. All three can have multiple key children.

  • Initialisms are things supposed to be spelled out, in this example "A, C, P" rather than having the TTS mumble something quite unintelligible.

  • Acronyms are supposed to be read out like a word. In this case, proper acronym mark-up is added to the document.

  • Abbreviations are exchanged at TTS processing-time. The expansion, instead of the name, is read by the TTS.

Once you have produced a file for your language, you have to tell Narrator the file exists. You do so by editing the file ../../transformers/se_tpb_xmldetection/lang.xml, adding the mapping between a language code and your new file.

Structure Announcer (se_tpb_annonsator)

Transformer documentation.

Structure announcer adds spoken introductions and/or terminations of structures, such as tables, sidebars and notes. The announcements are read by the TTS and needs a rewrite if a language not yet covered by Narrator is being used. The announcements are found in ../../transformers/se_tpb_annonsator/type directory. The file dtbook-2005.xml contains the announcements made in a book that complies to the DTBook 2005 standard.

The file contains rule elements, each one with the attribute match which contains an xpath defining which elements the rule should be to applied to. Typically, localizing Narrator, no new rules have to be added. What you need to add is instead the lang child of the rule element, with the lang attribute matching your language. The lang element has two optional children: before and after that contain the text to be read before and after any matching structure from the book.

The file also contains an element called copy. That element contains xslt code dealing with getting spoken announcements of list items in numbered list (<list type="ol"...). If you want the spoken announcements to appear in lists with roman numerals, you have to edit the file adding a <xsl:when test="lang('xx')">... where xx is your language code. You'll see tests for lang('yy') and the easiest way is just to copy one of them, and change the language code and the announcement text. If you don't have numbered lists using roman numerals, you can skip this and your lists will be fine anyway.

Sentence Detection (se_tpb_xmldetection)

Transformer documentation.

The sentence detection uses Java's java.text.BreakIterator to find sentence boundaries. All localization is done automagically by Java using the document's current locale.

Synchronization Point Normalization (se_tpb_syncPointNormalizer)

Transformer documentation.

Language agnostic.

Speech Generation (se_tpb_speechgen2)

Transformer documentation.

se_tpb_speechgenerator takes care of the audio file/speech generation. It has several language specific features that need to be adjusted to get the most out of the system.

  • TTS Builder Configuration

    se_tpb_speechgenerator is mainly configured using the file ttsbuilder.xml. That is the file to edit to change file names for the following features. Please refer to the transformer documentation and the multilanguage support documentation for a more thorough description of the transformer configuration.

  • Regular Expressions

    Every chunk of text sent to the TTS optionally goes through a search-replace routine. The routine consists of a list of regexes to use, specified using a certain xml format. At run-time, the regular expressions are read from disk according to what's in the ttsbuilder.xml-file associated with the parameter name generalRegexFilename. You can edit the supplied file or create a new one and change the TTS builder configuration to point to that one instead.

    ttsbuilder.xml: parameter name: generalRegexFilename.

  • Years

    Most numbers in text really are years. To have the TTS actually read the numbers as years, not like ordinary numbers, (1952 » "nineteen fifty two" instead of "one thousand nine hundred and fifty two") regular expressions can be used. For Swedish and English, the expressions have already been completed and the English ones can be found in year_en.xml. If your language makes a difference between reading an ordinary number and reading a year, a localization of Narrator should contain a localization of such a file.

    ttsbuilder.xml parameter name: yearFilename.

  • XSLT

    Every sync point is extracted from the content document with its xml context intact. An xsl transformation is done on that small xml fragment using text as output format. This gives the ability to add text to some elements (for example: add the word "page" prior to the text node from a pagenum) or add ssml before and after some constructs.

    Some announcements are made using xslt instead of se_tpb_annonsator. The reason for that is that the announcement is possible to place in the same sentence, giving the synthetic voice better flow. For example, the xslt announcement of the element <pagenum id="p-7">7</pagenum> would be "page 7" whereas se_tpb_annonsator produces "Page. 7". Those two text strings gives very different output from the synthesis. Localization of such rules is made by adding your own xml:lang='something' on the xslt match-attribute.

    ttsbuilder.xml parameter name: xsltFilename.

  • Character Translation Table

    Some TTS systems are unable to pronounce some characters. For example, an English TTS might not be able to pronounce the Swedish characters "å", "ä" and "ö". To prevent TTS crashes, you are able to translate certain characters to arbitrary text strings using a simple key-value mapping. The file containing the mapping uses Java's properties xml format, with the hex codepoint as key and the replacement string as the value.

    ttsbuilder.xml: parameter name: characterTranslationTable.

File Set Creator (se_tpb_filesetcreator)

Transformer documentation.

A Z39.86 fileset contains a resource file. To add more languages, just extend the existing file by adding more resources with another xml:lang. Note that audio must be supplied.

Audio Encoder (se_tpb_dtbAudioEncoder)

Transformer documentation.

Language agnostic.

Z3986-2005 to Daisy 2.02 Converter (se_tpb_zed2daisy202)

Transformer documentation.

Language agnostic.

User Interface Localization

The Pipeline transformers make use of the internationalization features in the DMFC package. That way the messages displayed via the standard EventSender during transformer execution are localizable. There is no need to do interface localization in order to produce books in different languages.

Default messages.properties

In every transformer directory, there is a file called messages.properties. The file has a simple syntax and describes a key-value mapping. The key is typically a fairly understandable and descriptive name of a message, and the value is the message itself, i.e. what's supposed to be printed on screen. messages.properties contains the default messages as defined by the transformer developer. The language used should be English. The file should not be removed or edited.

The following example of a messages.properties file comes from the se_tpb_filesetcreator-transformer. Lines starting with # are considered comments. The left hand side of the equals sign is the key (the message name) and the right hand side the value (message text). The curly braces in the message text denote parameters sent by the transformer:

########## Message properties for FileSetCreator ##########
# {0} is the current input filename
USING_INPUT_FILE = Using input file {0}
# {0} is the current output directory name
USING_OUTPUT_DIR = Using output directory {0}
SEARCHING_FOR_REFERRED_FILES = Searching for referred files...
GENERATING_SMIL = Generating SMIL files...
GENERATING_NCX = Generating NCX...
GENERATING_OPF = Generating OPF...
AUDIO_FILE_COPY = Copying audio files...
DONE = Done!
	

Localized Messages

If you'd like to rewrite some of the messages, or have messages displayed in a language other than the default English, there is the possibility to do so by adding a localized message properties file. The file must of course follow the same simple syntax as messages.properties and also have the same keys. You only need to change the values, and save the file with a name like messages_xx.properties, where xx is your language code. The localized file is to be placed in the transformer directory.

The name of a Swedish message properties file is messages_sv.properties and to write a Swedish localization of the file shown above, one could produce the following:

########## Message properties for FileSetCreator ##########
USING_INPUT_FILE = Använder som indata {0}
USING_OUTPUT_DIR = Använder som utkatalog {0}
SEARCHING_FOR_REFERRED_FILES = Söker efter refererade filer...
GENERATING_SMIL = Genererar SMIL-filer...
GENERATING_NCX = Genererar NCX...
GENERATING_OPF = Genererar OPF...
AUDIO_FILE_COPY = Kopierar filer...
DONE = Klart!		
	

Committing Localizations

If you have produced a localization of Narrator and want to share it with others, please contact one of the developers listed as administrator on the sourceforge daisymfc project members list. That way your localization may be committed to the project CVS, free for anyone to download and possibly included in future releases of Pipeline Narrator.

Author

Martin Blomberg, TPB

Licensing

LGPL





© 2015 - 2025 Weber Informatics LLC | Privacy Policy