
doc.scripts.Xhtml2Dtbook.html Maven / Gradle / Ivy
Pipeline Script: XHTML to DTBook [BETA]
Pipeline Script: XHTML to DTBook [BETA]
Overview
This script transforms a valid XHTML file in the canonical form into a valid DTBook file.
Input Requirements
A note on the two stage process for creating a DTBook file from an HTML document
Transformation of a general HTML document to a DTBook document, is likely to be a two stage process.
- The first stage will be to turn the HTML document into a canonical form of an XHTML document.
Depending on the state of the HTML document, this will be a process consisting of both automatic and manual processing.
Requirements for the canonical form are given below.
- The second stage is to create the DTBook document from the canonical XHTML.
This will be done in a completely automatic XSLT 2.0 transformation process, normally controlled by the DAISY Pipeline.
The DAISY Consortium does not supply any tools for performing the first stage, as the process required probably will differ among organizations.
For the second stage, the DAISY Consortium has developed the script described in this document.
Requirements for the canonical form of the XHTML document
- The XHTML document must be a valid XHTML Transitional or XHTML Strict document.
- The document must contain at least one
h1
element.
- The first child element (in the XML sense of the term) of the
body
element should be an h1
element.
Any child element placed before the first h1
, will be ignored during transformation, and will not be present in the generated DTBook document.
- All heading elements (
h1
to h6
) must be child elements (in the XML sense of the term) of the
body
element.
This excludes markup such as:
<body>
<div class="start-of-book">
<h1>The title</h1>
:
:
<h1>Content</h1>
:
:
</div>
<div class="main-stuff">
<h1>Chapter 1 How it all began</h1>
:
:
<h1>Chapter 2 How it continued</h1>
:
:
</div>
</body>
-
Heading levels must not be skipped. That is, the next heading after a
h3
can not be a h5
heading.
It must be a h4
heading, or one of h1
, h2
or h3
.
- A heading should not have another heading on the same, or higher, level, as a
first following sibling (in the XML sense of the term).
Thus the following kind of markup should be avoided:
:
.... some text in a paragraph.</p>
<h3>A heading on level 3</h3>
<h3>Another heading on level 3</h3>
<p>Some more text in a paragraph ....
:
Note that the following is perfectly okay (and makes sense):
:
.... some text in a paragraph.</p>
<h3>A heading on level 3</h3>
<h4>A subheading on level 4</h4>
<p>Some more text in a paragraph ....
:
In the cases where a heading has no relevant following siblings before a heading on the same, or higher, level, a "dummy" paragraph is
inserted in the generated DTBook document.
So the following piece of XHTML code:
:
.... some text in a paragraph.</p>
<h3>A heading on level 3</h3>
<h3>Another heading on level 3</h3>
<p>Some more text in a paragraph ....
:
would be transformed into:
:
.... some text in a paragraph.</p>
</level3>
<level3>
<h3>A heading on level 3</h3>
<p class="dummy" />
</level3>
<level3>
<h3>Another heading on level 3</h3>
<p>Some more text in a paragraph ....
:
- The
br
element may not be child elements (in the XML sense of the term) of the body
element.
- The
span
element may not be child elements (in the XML sense of the term) of the body
element,
unless the class
attribute ...
- ... starts with the string page-,
- ... is equal to noteref,
- ... ends with the string -prodnote, or
- ... is equal to caption, and the
span
element is evaluated to be a part of an image group
(more details).
This excludes markup such as:
<body>
<h1>The title</h1>
<span class="sentence">This is a sentence,
and also a child of the body element.</span>
<span class="sentence">And so is this.</span>
:
:
</body>
Rather, you should use:
<body>
<h1>The title</h1>
<p>
<span class="sentence">This is a sentence,
and also a child of the body element.</span>
<span class="sentence">And so is this.</span>
</p>
:
:
</body>
- A
span
element, with a value for the class
attribute starting with the string page-,
must have a text content that, when normalized, is suitable to form part of an id
attribute value in the DTBook file.
The following four examples:
<span class="page-normal">4</span>
<span class="page-normal">
89
</span>
<span class="page-front">xiv</span>
<span class="page-special">B-34</span>
are perfectly okay, and will result in the id
values page-4, page-89,
page-xiv
and page-B-34, respectively, in the DTBook file.
The markup:
<span class="page-normal">page 4</span>
does not comply with this requirement.
- The
div
and blockquote
elements may not have br
or span
as child elements (in the XML sense of the term).
- The
div
and blockquote
elements may not have text content.
This excludes markup such as:
<div>
This is some text before the picture.
<img src="fig01.png" alt="Map: Norway" />
This is some text after the picture.
</div>
Instead you should use:
<div>
<p>This is some text before the picture.</p>
<img src="fig01.png" alt="Map: Norway" />
<p>This is some text after the picture.</p>
</div>
or, perhaps better, skip the div
element:
<p>This is some text before the picture.</p>
<img src="fig01.png" alt="Map: Norway" />
<p>This is some text after the picture.</p>
It is generally recommended to have a markup with a very "flat"
structure. One should especially avoid having block elements inside
block elements, as in the following example:
<p>This is some text before the list.
<ul>
<li>The first list item.</li>
<li>This is the second and last item.</li>
</ul>
And this is the text after the list.</p>
Proper markup should rather be as follows:
<p>This is some text before the list.</p>
<ul>
<li>The first list item.</li>
<li>This is the second and last item.</li>
</ul>
<p>And this is the text after the list.</p>
Configuration
- Input XHTML
- Required. The input XHTML file to be converted.
- Output DTBook
- Required. The output DTBook file to be created.
- Title
- Optional. The title of the publication. If no value is supplied, the information is extracted from the source file, if possible.
- dtb:uid
- Optional. The publications unique identifier. If no value is supplied, the information is extracted from the source file, if possible.
- CSS
- Optional. The Cascading Style Sheet (CSS) to be referenced from the generated DTBook document.
Output
A DTBook document that is hopefully valid. The output is automatically validated, so watch out for error reports.
Appendix: List of Transformers used
The documents linked below are parts of the Transformer technical documentation. These are developer and systems-administrator centric documents.