
doc.developer.localizing-and-customizing-pipeline-narrator.html Maven / Gradle / Ivy
Localizing and customizing Pipeline Narrator
Localizing and customizing Pipeline Narrator
Latest update: 2006-08-24
Introduction
The purpose of this document is to give anyone trying to localize Pipeline Narrator
clues of where to find localizable features, and of which files to edit or
create new versions of.
Localizing Pipeline Narrator means adjusting Pipeline Narrator to produce
digital talking books in languages not yet covered by Narrator. It could as well mean
localizing the user interface. The sections Narrator Transformer
Localization and User Interface Localization describe
each of those tasks. Note: user interface localization is not necessary in order
to localize the production of books.
You'll have several chances to fill in language codes when localizing Narrator.
These language codes are the lower-case, two-letter codes as defined by
ISO-639. You can find a full list of these codes at a number of sites, such as:
http://www.loc.gov/standards/iso639-2/englangn.html.
Note: This document is not transformer documentation - to learn more about each
one of the transformers, please read the respective transformer documentation
which should be found in in the doc/transformers/
directory.
Available Localizations
There is no neat way of finding out what localizations are
available in your Pipeline Narrator installation. The easiest way is
to examine the files in each transformer directory and see what they
contain, or run a book with xml:lang="xx", where xx is
your language code, and see what comes out.
Default Configuration
Pipeline Narrator is supposed to work for English texts out of the box. The default
configuration is what's used at TPB when producing university level course
literature in English. There are more settings than the localizable to tweak, and they're
described elsewhere. Please read each transformer documentation to learn more.
Narrator Transformer Localization
Abbreviation and Acronym Detection (se_tpb_xmldetection)
se_tpb_xmldetection is a highly language dependant transformer when used for
abbreviation and acronym detection (see
Sentence Detection for other usage).
Despite the transformer name, it isn't really xml that is detected, but rather patterns
and strings in the text. Such patterns and strings are defined in certain language
files that reside in
../../transformers/se_tpb_xmldetection/lang/.
The language files contain abbreviations, acronyms and initialisms together
with their corresponding expansions, for the TTS to read. That way, the TTS
may be able to say "that is" instead of just "i e", and so on.
If you are using Narrator to produce digital talking books in a language not yet
covered by Narrator, you probably want to write your own language file.
A short example follows, but you may want to consult the
transformer documentation
for a more thorough description on how to write such files.
<language xml:lang="en">
<initialism before=".*[\s(]|^" after="([\-,\.\s:;?!)].*)|$" suffix="s|:s">
<key>
<name>ACP</name>
<expansion>African, Caribbean and Pacific Countries</expansion>
</key>
</initialism>
<acronym before=".*[\s(]|^" after="([\-,\.\s:;?!)].*)|$" suffix="s|:s">
<key>
<name>DAISY</name>
<expansion id="daisyBook">Digital Accessible Information System</expansion>
</key>
</acronym>
<abbreviation before=".*[\s(]|^" after="([,\.\s:;?!)].*)|$">
<key>
<name>e.g.</name>
<name>eg.</name>
<expansion>for example</expansion>
</key>
</abbreviation>
</language>
In the above example, there are three main elements: initialism, acronym
and abbreviation. All three can have multiple key children.
-
Initialisms are things supposed to be spelled out, in this example "A, C, P"
rather than having the TTS mumble something quite unintelligible.
-
Acronyms are supposed to be read out like a word. In this case, proper acronym
mark-up is added to the document.
-
Abbreviations are exchanged at TTS processing-time. The expansion, instead of
the name, is read by the TTS.
Once you have produced a file for your language, you have to tell Narrator the file exists.
You do so by editing the file
../../transformers/se_tpb_xmldetection/lang.xml,
adding the mapping between a language code and your new file.
Structure Announcer (se_tpb_annonsator)
Structure announcer adds spoken introductions and/or terminations of structures,
such as tables, sidebars and notes. The announcements are read by the TTS and needs a rewrite if a language
not yet covered by Narrator is being used. The announcements are found in
../../transformers/se_tpb_annonsator/type
directory. The file dtbook-2005.xml
contains the announcements made in a book that complies to the DTBook 2005 standard.
The file contains rule elements, each one with the attribute match which
contains an xpath defining which elements the rule should be to applied to.
Typically, localizing Narrator, no new rules have to be added. What you need to
add is instead the lang child of the rule element, with
the lang attribute matching your language.
The lang element has two optional
children: before and after that contain the text to be read
before and after any matching structure from the book.
The file also contains an element called copy. That element contains
xslt code dealing with getting spoken announcements of list items in numbered list
(<list type="ol"...). If you want the spoken announcements to appear
in lists with roman numerals, you have to edit the file adding a <xsl:when
test="lang('xx')">... where xx is your language code. You'll see
tests for lang('yy') and the easiest way is just to copy one of them,
and change the language code and the announcement text. If you don't have numbered
lists using roman numerals, you can skip this and your lists will be fine anyway.
Sentence Detection (se_tpb_xmldetection)
The sentence detection uses Java's java.text.BreakIterator to find sentence
boundaries. All localization is done automagically by Java using the document's current locale.
Synchronization Point Normalization (se_tpb_syncPointNormalizer)
Language agnostic.
Speech Generation (se_tpb_speechgen2)
se_tpb_speechgenerator takes care of the audio file/speech generation. It has
several language specific features that need to be adjusted to get the most out
of the system.
-
TTS Builder Configuration
se_tpb_speechgenerator is mainly configured using the file
ttsbuilder.xml. That is the file to edit to change file names for
the following features.
Please refer to the
transformer documentation and the multilanguage support documentation for a more thorough description
of the transformer configuration.
-
Regular Expressions
Every chunk of text sent to the TTS optionally goes through a search-replace routine.
The routine consists of a list of regexes to use, specified using a certain xml format.
At run-time, the regular expressions are read from disk according to what's in the
ttsbuilder.xml-file associated with the parameter name generalRegexFilename.
You can edit the supplied file or create a new one and change the TTS builder
configuration to point to that one instead.
ttsbuilder.xml: parameter name: generalRegexFilename.
-
Years
Most numbers in text really are years. To have the TTS actually read the numbers as years,
not like ordinary numbers, (1952 » "nineteen fifty two" instead of
"one thousand nine hundred and fifty two")
regular expressions can be used. For Swedish and English, the expressions
have already been completed and the English ones can be found in
year_en.xml.
If your language makes a difference between reading an ordinary number and reading a
year, a localization of Narrator should contain a localization of such a file.
ttsbuilder.xml parameter name: yearFilename.
-
XSLT
Every sync point is extracted from the content document with its xml context
intact. An xsl transformation is done on that small xml fragment using text
as output format. This gives the ability to add text to some elements
(for example: add the word "page" prior to the text node from a pagenum)
or add ssml before and after some constructs.
Some announcements are made using xslt instead of
se_tpb_annonsator. The reason for that is that the
announcement is possible to place in the same sentence, giving the synthetic voice
better flow. For example, the xslt announcement of the element
<pagenum id="p-7">7</pagenum> would be "page 7"
whereas se_tpb_annonsator produces "Page. 7". Those two text strings
gives very different output from the synthesis. Localization of such rules
is made by adding your own xml:lang='something' on the xslt
match-attribute.
ttsbuilder.xml parameter name: xsltFilename.
-
Character Translation Table
Some TTS systems are unable to pronounce some characters. For example,
an English TTS might not be able to pronounce the Swedish characters "å", "ä" and "ö".
To prevent TTS crashes, you are able to translate certain characters to
arbitrary text strings using a simple key-value mapping. The file containing
the mapping uses Java's properties xml format, with the hex codepoint as
key and the replacement string as the value.
ttsbuilder.xml: parameter name: characterTranslationTable.
File Set Creator (se_tpb_filesetcreator)
A Z39.86 fileset contains a resource file. To add more languages, just extend the
existing file by adding more resources with another xml:lang. Note that audio must be
supplied.
Audio Encoder (se_tpb_dtbAudioEncoder)
Language agnostic.
Z3986-2005 to Daisy 2.02 Converter (se_tpb_zed2daisy202)
Language agnostic.
User Interface Localization
The Pipeline transformers make use of the internationalization features
in the DMFC package. That way the messages displayed via the standard
EventSender during transformer execution are localizable. There is no need
to do interface localization in order to produce books in different languages.
Default messages.properties
In every transformer directory, there is a file called messages.properties
.
The file has a simple syntax and describes a key-value mapping. The key is typically
a fairly understandable and descriptive name of a message, and the value is the message
itself, i.e. what's supposed to be printed on screen. messages.properties
contains the default messages as defined by the transformer developer. The language used
should be English. The file should not be removed or edited.
The following example of a messages.properties
file comes from the
se_tpb_filesetcreator-transformer.
Lines starting with #
are considered comments. The left hand side of the
equals sign is the key (the message name) and the right hand side the value (message text).
The curly braces in the message text denote parameters sent
by the transformer:
########## Message properties for FileSetCreator ##########
# {0} is the current input filename
USING_INPUT_FILE = Using input file {0}
# {0} is the current output directory name
USING_OUTPUT_DIR = Using output directory {0}
SEARCHING_FOR_REFERRED_FILES = Searching for referred files...
GENERATING_SMIL = Generating SMIL files...
GENERATING_NCX = Generating NCX...
GENERATING_OPF = Generating OPF...
AUDIO_FILE_COPY = Copying audio files...
DONE = Done!
Localized Messages
If you'd like to rewrite some of the messages, or have messages displayed in
a language other than the default English, there is the possibility to
do so by adding a localized message properties file. The file must of course follow
the same simple syntax as messages.properties
and also have the same
keys. You only need to change the values, and save the file with a name like
messages_xx.properties
, where xx
is your language code.
The localized file is to be placed in the transformer directory.
The name of a Swedish message properties file is messages_sv.properties
and to write a Swedish localization of the file shown above, one could produce
the following:
########## Message properties for FileSetCreator ##########
USING_INPUT_FILE = Använder som indata {0}
USING_OUTPUT_DIR = Använder som utkatalog {0}
SEARCHING_FOR_REFERRED_FILES = Söker efter refererade filer...
GENERATING_SMIL = Genererar SMIL-filer...
GENERATING_NCX = Genererar NCX...
GENERATING_OPF = Genererar OPF...
AUDIO_FILE_COPY = Kopierar filer...
DONE = Klart!
Committing Localizations
If you have produced a localization of Narrator and want to share it with others,
please contact one of the developers listed as administrator on the sourceforge
daisymfc project members list. That way your localization may be committed to
the project CVS, free for anyone to download and possibly included in future releases of
Pipeline Narrator.
Author
Martin Blomberg, TPB
Licensing
LGPL