
doc.transformers.se_tpb_speechgen2.html Maven / Gradle / Ivy
Transformer documentation: se_tpb_speechgen2
Transformer documentation: se_tpb_speechgen2
Transformer Purpose
Generates audio for a full-text dtbook
file. Makes the input file and generated audio ready for se_tpb_filesetcreator.
This transformer is able to manipulate its input before it is
passed to the tts system. The text is extracted from the document as
xml fragments and xslt can be used on a sync point level. Arbitrary
unicode codepoints can be replaced by user-defined strings, and it is
also possible to use regular expressions in a
search-replace-manner.
This transformer is using a 2-pass approach, i.e. first
reading though the input file, extracting xml fragments to pass to the
the different TTS-systems, and then reads through the file once more to
collect the generated audio files.
Regardless the audio kind, attributes will be placed on
elements representing synch points. Those attributes are smil:clipBegin,
smil:clipEnd and smil:src
with namespace URI http://www.w3.org/2001/SMIL20/.
Input Requirements
This transformer is written to work with a manuscript, that is
a dtbook-2005-1 or dtbook-2005-2 document possibly enriched with
elements and attributes from other namespaces. The input document must
be "synch point normalized", see
se_tpb_syncPointNormalizer for such transformation.
Some elements are supposed to be announced audible. Those
elements must have attributes holding the say-before and say-after text
strings. se_tpb_annonsator
can be used to add those attributes to a dtbook
document. Since those attribute names are configurable, make sure they
match whatever se_tpb_annonsator uses.
Output
On success
Given the expected input the transformer outputs a manuscript,
that is a dtbook-2005-1 or dtbook-2005-2 document with
additional attributes indicating corresponding audio. Those
attributes, clipBegin, clipEnd
and src,
namespace URI http://www.w3.org/2001/SMIL20/,
point out which elements should be represented by audio in the
generated talking book. Output also includes the generated audio files
referrenced by the smil-attributes.
sent-level synchronization should be
used, although configurable. Other usage has not been tested.
On error
No specific recovery scheme. On error, this transformer will
send a fatal message, then throw an exception and abort.
Configuration/Customization
Parameters (tdf)
- inputFilename
- required="true"
The input manuscript file.
Example: /home/books/manuscript.xml
- outputDirectory
- required="true"
Path to the output directory
Example: /home/books/audio
- outputFilename
- required="true"
The desired name of the output manuscript.
Example: /home/books/audio/speechgen-manuscript.xml
- concurrentAudioMerge
- required="false"
Whether the merge of the audio should be done concurrent to the speech
generation or not. Due to license, some TTS systems spend most of their
time sleeping just to avoid being too effective, Loquendo is an example
of that. If that is the case, why not use the time doing something
useful instead, like merging tiny audio clips? Parallel threads will be
spawned to merge the audio.
Possible enum values:
- true
- false
Default: true
- mp3Output
- required="false"
Is mp3 the preferred audio output format? The default option is wav.
Possible enum values:
- true
- false
Default: false
- multiLang
- required="false"
Select whether to use the multi-language support of the TTSBuilder or to always use the default TTS engine configure in ttsBuilder.xml.
Possible enum values:
- true
- false
Default: true
- sgConfigFilename
- required="false"
Speech generator configuration file. See Speech
Generator Configuration for details.
Example: /home/config/file.xml
Default: ${transformer_dir}/config/sgConfig.xml
- ttsBuilderConfig
- required="false"
The tts builder configuration file. See TTS
Builder Configuration for details.
Example: /home/ttsbfiles/file.xml
Default: ${transformer_dir}/ttsbuilder.xml
- ttsBuilderRNG
- required="false"
Tests for the tts builder configuration file using relaxng with
embedded schematron.
Example: /home/ttsbfiles/file.rng
Default: ${transformer_dir}/ttsbuilder-configtest.rng
- doSmilSyncAttributeBasedSyncPointLocation
- required="false", default=false
Determines whether synchronization points should be located using an attribute sync in the SMIL namespace in the input document.
This defaults to false, which means that the original behavior of transformer is the default.
Extended configurability
Speech Generator Configuration
The file pointed to by the tdf variable sgConfig provides the
possibility to affect the processing of the document. Things like on
which elements to synch, merge audio and so on, are configured there. A
description of the possibilities follows together with a short example:
- /sgConfig/absoluteSynch/item
- The names of the the elements that should be synch points,
no matter where they are.
- /sgConfig/containsSynch/item
- The name of the element for synch point level.
- /sgConfig/announceAttributes/item
- Elements of this type show which attributes contain
announcements. Two elements of this kind is allowed, with id values before
(which tells us about which attributes contains "say-before" content)
and after (which tells us about which attributes
contains "say-after" content). On those elements three attributes (plus
id) must be placed:
- uri: the namespace uri of the
announce-attributes.
- prefix: the namespace prefix.
- local: the attribute's local name.
The element body is empty.
- /sgConfig/mergeAudio/item
- Elements at which to divide the audio into different files.
The values can be seen as the element-only xpath tail. level/hd
rather than //level/hd, that is.
- /sgConfig/silence
- There is a possibility to add extra silence after and/or
before certain events in the talking book. Silence is added at the end
of a synch point, never at the beginning. In the current
implementation, the duration of the desired silence is provided in
milliseconds and extra silence can be added upon five different events:
- afterLast: After the last phrase
in an audio clip. Typical usage would be when audio is merged at a
heading, this ability would add silence just before the heading.
- afterFirst: After the first phrase
in an audio clip. Typical usage would be just after a heading.
- beforeAnnouncement: Before an
audible announcement.
- afterAnnouncement: After an
audible announcement.
- afterRegularPhrase: After every
regular phrase that's generated.
An example follows:
<?xml version="1.0" encoding="utf-8"?>
<sgConfig>
<absoluteSynch>
<item>pagenum</item>
<item>noteref</item>
<item>annoref</item>
<item>linenum</item>
</absoluteSynch>
<containsSynch>
<item>sent</item>
</containsSynch>
<announceAttributes>
<item id="before" uri="http://www.daisy.org/ns/pipeline/annon" prefix="annon" local="before"/>
<item id="after" uri="http://www.daisy.org/ns/pipeline/annon" prefix="annon" local="after"/>
</announceAttributes>
<mergeAudio>
<item>h1</item>
<item>h2</item>
<item>h3</item>
<item>h4</item>
<item>h5</item>
<item>h6</item>
<item>level/hd</item>
</mergeAudio>
<silence>
<afterLast>2000</afterLast>
<afterFirst>800</afterFirst>
<beforeAnnouncement>300</beforeAnnouncement>
<afterAnnouncement>300</afterAnnouncement>
<afterRegularPhrase>200</afterRegularPhrase>
</silence>
</sgConfig>
TTS Builder Configuration
se_tpb_speechgenerator uses a simple
factory/builder to get hold of TTS implementations. The factory must be
configured properly since it is not able to locate TTS systems on its
own. The configuration consists of sections that are operating system
specific. As subsections, there are language specific sections. Each
language must contain no more than one TTS system. During runtime, the
TTS Builder configuration file is validated using relaxng and
schematron, but since a DTD is a compact way of showing a document's
structure, here's one:
<!DOCTYPE ttsbuilder [
<!ELEMENT ttsbuilder (os+)>
<!ELEMENT os (property*, lang*)>
<!ELEMENT property (EMPTY)>
<!ELEMENT lang (tts)>
<!ELEMENT tts (param+)>
<!ELEMENT param (EMPTY)>
<!ATTLIST property name CDATA #REQUIRED>
<!ATTLIST property match CDATA #REQUIRED>
<!ATTLIST lang lang CDATA #REQUIRED>
<!ATTLIST tts default (true) CDATA #IMPLIED>
<!ATTLIST tts instances CDATA #IMPLIED>
<!ATTLIST param name CDATA #REQUIRED>
<!ATTLIST param value CDATA #REQUIRED>
]>
Besides the rules expressible in a DTD, there are a few
others, asserted using schematron:
- The length of the lang-attribute
value must be 2. This is to follow the ISO-639 2-letter lower-case
standard used in Java.
- lang-siblings don't have the same lang-attribute
value.
- For each os, there is not more than
one descendant tts which has got the attribute default="true"
to be used as fallback.
- For each tts, there must not be two
descendant params with the same value for
attribute name, except for the value command. For
each name="command" entry, one instance of the
class given by the
name="class" entry will be created, given the
command value, and the
other parameters in a map.
- For each tts, there is a param
with name attribute value="class"
More information on the per-language TTS engine selection is available
in the Speechgen2 multi-language support
separate documentation.
Configuration of a TTS mainly consists of parameters for a
certain TTS wrapper, such as Java class name or command to run a TTS
program. Each TTS system needs its own Java-wrapper, and hence their
configuration can differ extensively. The wrapper communicate with the
TTS system of your choice. The properties read from the TTS Builder
Configuration are passed to the TTS Java wrapper constructor together
with a some utility functions wrapped together in the class
se_tpb_speechgen2.tts.TTSUtils. The TTSUtils
instance will also have a look at some configuration
parameters to be able to provide desired functionality, e.g. regex
filtering, character substitution and so on. TTSUtils will look at
parameters described below. After that, it's up to the wrapper
to decide what to do with remaining parameters. This gives a
developer great possibilities when it comes to creating a TTS wrapper
and its configuration.
TTSUtils will treat parameters as follows:
- regex - comma separated urls
pointing to files containing regular expressions in a
search-replace-manner.
- xslt - comma separated urls
pointing to xslt files supposed to be applied to every sync point xml
fragment.
- year - url pointing to a file
containing regular expressions specific for pronouncing years properly.
- characterSubstitutionTables - a
comma separated list of absolute file paths to character substitution
tables. If this parameter is present, the program will look for the
following two:
- characterExcludeFromSubstitution
- name of character set to exclude from substitution.
- characterFallbackStates - what
to do if no mapping is found, the following values are valid:
- fallbackToNonSpacingMarkRemovalTransliteration
- Determines whether a character substitution attempt
should fallback to a transliteration to nonspacing mark removal
(typically disaccentuation) attempt if a replacement text was not found
in user provided substitution table(s).
- fallbackToLatinTransliteration
- Determines whether a character substitution attempt
should fallback to a transliteration to Latin attempt if a replacement
text was not found in user provided substitution table(s).
- fallbackToUCD
- Determines whether a character substitution attempt
should fallback to names in the UCD table if a replacement text is not
found in user provided substitution table(s).
- timeout - Milliseconds to wait for the
tts before throwing an exception.
The java wrapper can choose to care about the rest of the parameters
sent to it, and it can choose to use the functions in TTSUtils.
An example of the configuration follows:
<?xml version="1.0" encoding="UTF-8"?>
<!-- the Java class parameter must be supplied -->
<!-- ${transformer_dir} variable will be evaluated to the directory where se_tpb_speechgenerator resides. -->
<ttsbuilder>
<!--******************************************************************************
Windows
*******************************************************************************-->
<os>
<!-- all properties must match java's System.getProperties()-properties.
Standard regex match for an os to be usable in this program. -->
<property name="os.name" match="[Ww]indows.*" />
<lang lang="en">
<!-- since xml:lang determines which tts to use when in
this program, provide only one tts per language! -->
<!-- this is configuration for one tts impl. the "default" attribute
should be set to true for one configuration for each os. -->
<tts default="true">
<!-- the Java class name -->
<param name="class" value="se_tpb_speechgen2.tts.adapters.LocalStreamTTS"/>
<!-- the binary SAPI-talking program used for tts conversion -->
<param
name="command"
value="${transformer_dir}/tts/SimpleCommandLineTTS/SimpleCommandLineTTS.exe"/>
<!-- an xml file containing simple search-replace regex rules. -->
<param name="regex" value="${transformer_dir}/regex/general.xml"/>
<!-- xslt applied on each synchpoint -->
<param name="xslt" value="${transformer_dir}/xslt/transform.xsl"/>
<!-- an xml file containing simple search-replace regex rules.
Those rules specifically replaces years in digits with text. -->
<param name="yearFilename" value="${transformer_dir}/config/year_en.xml"/>
<!-- SAPI specific parameter: The value will be used to embed the text in
SAPI's xml-like way. This value will result in the following tags
surrounding the input text:
<voice optional="Gender=Male"></voice>
Where the starting point is <voice optional=""></voice>.
More on SAPI xml codes:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/SAPI51sr/Whitepapers/WP_XML_TTS_Tutorial.asp
-->
<param name="sapiVoiceSelection" value="Gender=Male"/>
<!-- An ability to filter characters and replace them with custom strings. -->
<param
name="characterSubstitutionTables"
value="${transformer_dir}/character-translation-table.xml"/>
<!-- The encoding of the character translation table. -->
<param name="characterFallbackStates" value="fallbackToLatinTransliteration"/>
</tts>
</lang>
</os>
<!--******************************************************************************
Linux
*******************************************************************************-->
<os>
<property name="os.name" match="[Ll]inux.*" />
<lang lang="en">
<tts default="true">
<param name="class" value="se_tpb_speechgen2.tts.adapters.LocalStreamTTS"/>
<param name="regex" value="${transformer_dir}/regex/general.xml"/>
<param name="ttsProperties" value="${transformer_dir}/conf/loquendo.xml"/>
<param name="xslt" value="${transformer_dir}/xslt/loquendo-en.xsl"/>
<param name="year" value="${transformer_dir}/regex/year_en.xml"/>
<!-- character substitution choises -->
<param name="characterSubstitutionTables" value="${transformer_dir}/charsubst/character-translation-table.xml"/>
</tts>
</lang>
</os>
</ttsbuilder>
TTS Java wrappers
The transformer comes with two java TTS wrappers. One is named
LocalStreamTTS and it works in a very simple way. It communicates with
an external TTS program by the standard input and output streams. That
is, it pipes to the external program's standard input stream first a
filename, linebreak, and then- using 1 line - the phrase to be
generated and written to the file pointed out. The external program
generates the audio, writes it to the file, and then prints "OK" to its
standard output stream. If the external program reads an empty line, it
means it is time to exit. If the program does not print "OK" Narrator will stop.
If you need to use a TTS system that can not be used this way, it is
possible to develop your own TTS Java wrapper. To do so, you
develop a java class that implements the se_tpb_speechgen2.tts.TTSAdapter interface. The class should have a constructor taking to parameters, the first one an instance of se_tpb_speechgen2.tts.TTSUtils, and the other one a java.util.Map containing parameters from the configuration file. This lets you use the TTS system -
and possible inter-process communication - of your choice. Once you
have set up a proper TTS Builder Configuration your new TTS wrapper is
ready to run.
Further development
- Refactoring: Instead of letting se_tpb_speechgen2
figuring out which elements represent synch points by searching for
certain element structures with text nodes, an attribute must be
present on those elements, making the synch point search trivial.
Identifying synch points should only be assigned to one transformer,
and a possible candidate in the Narrator transformer chain would be se_tpb_syncPointNormalizer.
- RNG/Schematron test the Speech Generator Configuration file
before running.
- Generate audio for non-empty dc:Creator
and dc:Title. Since the fileset creator uses
those elements in the absence of docAuthor and docTitle
it would be nice to be able to supply audio as well.
- Develop a Java wrapper for use with FreeTTS.
Dependencies
May need to access some TTS system, which is not part of the
Daisy Pipeline.
Author
Martin Blomberg, TPB.
Licensing
LGPL