marytts.tools.dbselection.README.txt Maven / Gradle / Ivy
The newest version!
******************************************************
* Documentation for the Speech Corpus selection tool *
******************************************************
Anna Hunecke, August 2007
The selection tools consist of three Java programs:
- DatabaseSelector : program for selecting the speech corpus
- FeatureMakerMaryServer : program for building the text corpus
from which to select
- SortTestResults : program for sorting the text results according to
the four coverage measures
Furthermore, there are two perl scripts:
- features2sentences.pl : program for converting a list of selected
feature files to a list of sentence files and their content.
- sentences2features.pl : program for converting the file produced
by features2sentences.pl into two files containing a list of feature
files. One file is for those feature files that are wanted in the
final script and one file for the bad feature files that are too be
ignored in further selection steps.
There are also three files that can be used for selection:
- covDef.config : file containing the settings for the selection
algorithm
- featureDefinition.txt : feature definition file used by selection
algorithm. Matches the file german-targetfeatures-selection.config.
- german-targetfeatures-selection.config : file for computing the
features with the FeatureMaker classes. Matches the given feature
definition file.
In the following, the usage of the programs is documented in more
detail.
********************
* DatabaseSelector *
********************
Selects a set of sentences for a speech corpus
*** Usage: ***
java -cp /path/to/mary/java/mary-common.jar
de.dfki.lt.mary.dbselection.DatabaseSelector
-basenames
-featDef
-stop
Optional arguments:
-coverageConfig
-initFile
-selectedSentences
-unwantedSentences
-vectorsOnDisk
-overallLog
-selectionDir
-logCoverageDevelopment
-verbose
*** Arguments: ***
-basenames : The list of feature files to select from. The file
either starts with the number of feature files followed by the actual
list, or it contains just the list. The first version might be
quicker when there are a great number of files.
-featDef : The feature definition for the feature files. It has
to be consistent with the features. The given feature definition file
can be used if the file german-targetfeatures-selection.config was
used for computing the features.
-stop : which stop criterion to use.
There are five stop criteria:
- numSentences : selection stops after n sentences
- simpleDiphones : selection stops when simple diphone coverage has
reached maximum
- clusteredDiphones : selection stops when clustered diphone
coverage has reached maximum
- simpleProsody : selection stops when simple prosody coverage has
reached maximum
- clusteredProsody : selection stops when clustered prosody coverage
has reached maximum
The criteria can be used individually or can be combined. Examples:
- stop criteria simpleDiphones and simpleProsody: selection stops
when both criteria are fulfilled
- stop criteria simpleDiphones and numSentences 300: selection stops
when simpleDiphone coverages reaches maximum or number of
sentences is 300.
-coverageConfig : The config file for the coverage
definition. Contains the settings for the current pass of the
algorithm. Standard config file is selection/covDef.config. You can
use the file covDef.config as a template.
-vectorsOnDisk: if this option is given, the feature vectors are not
loaded into memory during the run of the program. This notably slows down
the run of the program!
-initFile : The file containing the coverage data needed to
initialise the algorithm. This file is automatically created by the
program. Standard init file is selection/init.bin
-selectedSentences : File containing a list of sentences
selected in a previous pass of the algorithm. They are added to the
cover before selection starts. The sentences can be part of the
basename list.
-unwantedSentences : File containing those sentences that are to
be removed from the basename list prior to selection.
-overallLog : Log file for all runs of the program: date,
settings and coverage of the current pass are appended to the end of
the file. This file is needed if you want to analyse your results
with the ResultAnalyser later on.
-selectionDir : the directory where all selection data is
stored. Standard directory is ./selection
-logCoverageDevelopment : If this option is given, the coverage
development over time is stored in text format. It can be converted
into a table/diagram with OpenOffice or similar programs.
-verbose : If this option is given, there will be more output on the
command line during the run of the program.
******************************
* FeatureMakerMaryServer *
******************************
Takes a list of files containing text. For each file, the text is
divided into sentences and for each sentence, the features are
computed and features and sentences are written to disk. Sentences
with unreliable phonetic transcriptions are sorted out. The result is
a list of feature files that can be used by DatabaseSelector.
FeatureMakerMaryServer needs a running Mary server. The most important
thing is that the target feature file
german-targetfeatures-selection.config is used for the computation of
the features. This file has to be in the /conf directory of the Mary
installation.
*** Usage: ***
Startup script for Windows. Save the following lines in .bat
and edit them according to your needs. Start the script from the command line:
@echo off
set MARY_BASE=drive:\path\to\mary
set CLASSPATH="%MARY_BASE%\java\mary-common.jar;
%MARY_BASE%\java\log4j-1.2.8.jar;%MARY_BASE%\java\mary-german.jar;
%MARY_BASE%\java\jsresources.jar"
java -Xmx512m -cp %CLASSPATH%
"-Djava.endorsed.dirs=%MARY_BASE%\lib\endorsed"
"-Dmary.base=%MARY_BASE%"
de.dfki.lt.mary.dbselection.FeatureMakerMaryServer
Startup script for Linux:
export MARY_BASE="/path/to/mary"
export CLASSPATH="$MARY_BASE/java/mary-common.jar:
$MARY_BASE/java/log4j-1.2.8.jar:$MARY_BASE/java/mary-german.jar:
$MARY_BASE/java/jsresources.jar"
java -classpath $CLASSPATH
-Djava.endorsed.dirs=$MARY_BASE/lib/endorsed
-Dmary.base=$MARY_BASE
de.dfki.lt.mary.dbselection.FeatureMakerMaryServer
*** Arguments: ***
-textFiles : File containing the list of text files to be
processed. Default: textFiles.txt
-doneFile : File containing the list of files that have already
been processed. This file is created automatically during the run of
the program. Default: done.txt
-featureDir : Directory where the features are stored. Default:
features1. Per default, appropriate sentence files are stored in
sentences1. The index of feature/sentence directory is increased when
the feature dir is full.
-timeOut
© 2015 - 2025 Weber Informatics LLC | Privacy Policy