All Downloads are FREE. Search and download functionalities are using the official Maven repository.

marytts.tools.dbselection.README.txt Maven / Gradle / Ivy

The newest version!
******************************************************
* Documentation for the Speech Corpus selection tool *
******************************************************

Anna Hunecke, August 2007

The selection tools consist of three Java programs:

- DatabaseSelector : program for selecting the speech corpus

- FeatureMakerMaryServer :  program for building the text corpus
  from which to select

- SortTestResults : program for sorting the text results according to 
  the four coverage measures


Furthermore, there are two perl scripts:

- features2sentences.pl : program for converting a list of selected
  feature files to a list of sentence files and their content.

- sentences2features.pl : program for converting the file produced
  by features2sentences.pl into two files containing a list of feature
  files. One file is for those feature files that are wanted in the
  final script and one file for the bad feature files that are too be
  ignored in further selection steps.


There are also three files that can be used for selection:

- covDef.config : file containing the settings for the selection
  algorithm 

- featureDefinition.txt : feature definition file used by selection
  algorithm. Matches the file german-targetfeatures-selection.config.

- german-targetfeatures-selection.config : file for computing the
  features with the FeatureMaker classes. Matches the given feature
  definition file. 

In the following, the usage of the programs is documented in more
detail. 


********************
* DatabaseSelector *
********************

Selects a set of sentences for a speech corpus

*** Usage: ***

java -cp /path/to/mary/java/mary-common.jar 
de.dfki.lt.mary.dbselection.DatabaseSelector 
-basenames  
-featDef  
-stop 

Optional arguments: 
-coverageConfig 
-initFile 
-selectedSentences 
-unwantedSentences 
-vectorsOnDisk
-overallLog 
-selectionDir 
-logCoverageDevelopment
-verbose


*** Arguments: ***

-basenames  : The list of feature files to select from. The file
 either starts with the number of feature files followed by the actual
 list, or it contains just the list. The first version might be
 quicker when there are a great number of files. 

-featDef  : The feature definition for the feature files. It has
 to be consistent with the features. The given feature definition file
 can be used if the file german-targetfeatures-selection.config was
 used for computing the features.

-stop  : which stop criterion to use. 
 There are five stop criteria: 
  - numSentences  : selection stops after n sentences
  - simpleDiphones : selection stops when simple diphone coverage has
    reached maximum 
  - clusteredDiphones : selection stops when clustered diphone
    coverage has reached maximum 
  - simpleProsody : selection stops when simple prosody coverage has
    reached maximum 
  - clusteredProsody : selection stops when clustered prosody coverage
    has reached maximum 
 
 The criteria can be used individually or can be combined. Examples:
  - stop criteria simpleDiphones and simpleProsody: selection stops
    when both criteria are fulfilled
  - stop criteria simpleDiphones and numSentences 300: selection stops
    when simpleDiphone coverages reaches maximum or number of
    sentences is 300. 


-coverageConfig  : The config file for the coverage
 definition. Contains the settings  for the current pass of the
 algorithm. Standard config file is selection/covDef.config. You can
 use the file covDef.config as a template. 

-vectorsOnDisk: if this option is given, the feature vectors are not
 loaded into memory during the run of the program. This notably slows down
 the run of the program! 

-initFile  : The file containing the coverage data needed to
 initialise  the algorithm. This file is automatically created by the
 program. Standard init file is selection/init.bin

-selectedSentences : File containing a list of sentences
 selected in a previous pass of the algorithm. They are added to the
 cover before selection starts. The sentences can be part of the
 basename list.

-unwantedSentences : File containing those sentences that are to
 be removed from the basename list prior to selection.

-overallLog  : Log file for all runs of the program: date,
 settings and coverage of the current pass are appended to the end of
 the file. This file is needed if you want to analyse your results
 with the ResultAnalyser later on. 

-selectionDir  : the directory where all selection data is
 stored. Standard directory is ./selection

-logCoverageDevelopment : If this option is given, the coverage
 development over time is stored in text format. It can be converted
 into a table/diagram with OpenOffice or similar programs.

-verbose : If this option is given, there will be more output on the
 command line during the run of the program.



******************************
* FeatureMakerMaryServer * 
******************************

Takes a list of files containing text. For each file, the text is
divided into sentences and for each sentence, the features are
computed and features and sentences are written to disk. Sentences
with unreliable phonetic transcriptions are sorted out. The result is
a list of feature files that can be used by DatabaseSelector.

FeatureMakerMaryServer needs a running Mary server. The most important 
thing is that the target feature file 
german-targetfeatures-selection.config is used for the computation of
the features. This file has to be in the /conf directory of the Mary
installation.   


*** Usage: ***

Startup script for Windows. Save the following lines in .bat
and edit them according to your needs. Start the script from the command line:

@echo off
set MARY_BASE=drive:\path\to\mary

set CLASSPATH="%MARY_BASE%\java\mary-common.jar;
%MARY_BASE%\java\log4j-1.2.8.jar;%MARY_BASE%\java\mary-german.jar; 
%MARY_BASE%\java\jsresources.jar"

java -Xmx512m -cp %CLASSPATH%
"-Djava.endorsed.dirs=%MARY_BASE%\lib\endorsed"  
"-Dmary.base=%MARY_BASE%" 
de.dfki.lt.mary.dbselection.FeatureMakerMaryServer 


Startup script for Linux:

export MARY_BASE="/path/to/mary"

export CLASSPATH="$MARY_BASE/java/mary-common.jar:
$MARY_BASE/java/log4j-1.2.8.jar:$MARY_BASE/java/mary-german.jar:
$MARY_BASE/java/jsresources.jar"

java -classpath $CLASSPATH 
-Djava.endorsed.dirs=$MARY_BASE/lib/endorsed 
-Dmary.base=$MARY_BASE 
de.dfki.lt.mary.dbselection.FeatureMakerMaryServer 



*** Arguments: ***

-textFiles : File containing the list of text files to be
 processed. Default: textFiles.txt

-doneFile : File containing the list of files that have already
 been processed. This file is created automatically during the run of
 the program. Default: done.txt

-featureDir : Directory where the features are stored. Default:
 features1. Per default, appropriate sentence files are stored in
 sentences1. The index of feature/sentence directory is increased when
 the feature dir is full. 

-timeOut 




© 2015 - 2025 Weber Informatics LLC | Privacy Policy