
doc.scripts.DTBookFix.html Maven / Gradle / Ivy
Pipeline Script: DTBook Fix
Pipeline Script: DTBook Fix
Overview
This script will attempt to repair and tidy a suboptimal
DTBook document. The script is primarily intended to address structural
problems that occur in files that are output from automated conversion
processes, such as the WordML to DTBook Script.
The actual manipulation routines performed are described in the DTBookFix Categories section below.
Input Requirements
A DTBook document. As is seldom the case in the Pipeline, the document
need not be valid.
Note that the manipulations that are performed are heavily dependent on
which version of DTBook is used (2005-1
, 2005-2
,
etc). If your input document is of a version that is not supported by all or
some of the manipulation routines that are performed, these routines will be
disabled, and warnings will be issued.
Output
Depending on the input document version and your settings, the output
document will have had anything from zero to several different structural
modifications. Note that no guarantees are made that the output will be
valid.
The input document is validated at the end of the process, so watch the
validation messages that are issued towards the end of the process.
If you keep encountering DTBook documents with recurring problems that are
not fixed by this Script, please contact the Pipeline development team.
Configuration
- Input file
- Required. Select input DTBook file
- Output directory
- Required. Select where to store the output
result.
- Active Categories
- Select the type of manipulation to be performed by activating one or
several Categories.
Read more on what is included in each category in the DTBookFix Categories section below.
- Force Execution
- Optional. When checked, DTBookFix will run all
selected categories disregarding the input documents state (by default,
the Repair category is run only if the document is invalid, and the
Tidy category is run only of the document is valid).
- Simplify heading layout
- This is an optional routine within the Tidy
category. Check the box to simplify the level structure by removing
redundant levels.
See further Level cleanup below.
- Tidy inline whitespace
- This is an optional in Tidy category. Check box to
move leading and trailing whitespace outside of em, strong, sub, sup,
noteref and pagenum elements.
See further Tidy inline whitespace
below.
- Fix Character set
- This is an optional routine within the Repair
category. Check box to attempt to fix an invalid stated character set.
See further Character Set recoder
below.
DTBookFix Categories
This section gives a technical summary of the manipulations that are done
within each DTBookFix manipulation category.
The Repair Category
- Level splitter
- Splits a level into several levels if a certain
level1-6
element has several headings on the same level.
- Level 1-6 repair
- Inserts
level1-6
elements where needed to meet the
requirements on proper nesting
- Illegal heading removal
- Changes an illegal heading (for example, an
h3
element
inside a level2
element) into a p
element.
The p
element will have the class attribute value of the
original heading element name (e.g. <p class="h3">
).
- Flatten redundant nesting
- Removes nested p
- Complete structure
- Adds an empty paragraph if the last element in the level is a
heading
- List repair
- Wraps a
list
in li
when the parent of
the list is another list.
- adds a
type
attribute if missing (default value is
"pl")
- Corrects the
depth
attribute if it is incorrect
- Removes the
enum
attribute if the list is not
ordered
- Removes the
start
attribute if the list is not
ordered
- Character Set recoder
- This will run a character set detection algorithm on the
input file, disregarding any stated characterset, and then recode the
entire file. Only enable this if you have explicit issues with
character display, or when your document is reported to be malformed
with stated reasons similar to "invalid byte sequence" or similar.
- IDREF repair
- Adds the
idref
attribute to noteref
and
annoref
elements if missing
- Estimate and give the
idref
attribute a value if
empty
- Add a hash mark in the beginning of all idref attributes that
don't contain a hash mark.
- Empty elements remover
- Removes empty/whitespace elements that must have children.
- Page number type repair
Changes the type
attribute of the
pagenum
element to match the contents of the element (i.e.
the page number value).
Incorrect "normal" page numbers will be changed to "front" if the
contents contains roman numerals and the element is located in the
frontmatter of the book. Incorrect "front" page numbers will be changed
to "normal" if the contents contains arabic numbers. Otherwise the page
attribute will be changed to "special" if it is incorrect.
- Metadata repair
- Fixes Dublin Core metadata name case errors (i.e.
dc:title
is changed to dc:Title
)
- Removes unknown Dublin Core metadata (i.e. dc:Hello)
- Adds a
dtb:uid
from dc:Identifier
, if
missing
- Adds a
dc:Title
from the first doctitle
element of the book, if missing
The Tidy Category
- Level cleanup
- Simplifies the level structure by removing redundant levels
(subordinate levels will be moved upwards). Note that the headings of
the affected levels will also change, which will alter the appearance
of the layout.
- Pagenum mover
- Pagenums in headings are placed before the heading
- Pagenums in words are placed after the word.
- Change inline pagenum to block
- Removes otherwise empty p or li around pagenum (except p in td)
- Empty elements remover
- Removes empty/whitespace elements (p, em, strong, sub, sup), unless
required for validity. E.g. an empty p that is preceded by a heading
and followed only by other empty p is not removed.
- Author and Title addition
- Inserts docauthor and doctitle elements to frontmatter using Dublin
Core metadata.
- Tidy inline whitespace
- Moves leading and trailing whitespace outside of em, strong, sub, sup
and pagenum elements. For example: "this is an<em> example
</em>of what<strong> Tidy inline whitespace
</strong>does" will change to: "this is an
<em>example</em> of what <strong>Tidy inline
whitespace</strong> does". This is a requirement for accurate
braille rendering.
- Indenter
- Performs a "pretty print" of the XML elements in the document.
Appendix: List of Transformers used
The documents linked below are parts of the Transformer technical
documentation. These are developer and systems-administrator centric
documents.