
doc.transformers.no_hks_xhtml2dtbook.html Maven / Gradle / Ivy
Transformer documentation: no_hks_xhtml2dtbook
Transformer documentation: no_hks_xhtml2dtbook
Transformer Purpose
To transform a valid XHTML file in the canonical form into a valid DTBook file.
Input Requirements
A note on the two stage process for creating a DTBook file from an HTML document
Transformation of a general HTML document to a DTBook document, is likely to be a two stage process.
- The first stage will be to turn the HTML document into a canonical form of an XHTML document.
Depending on the state of the HTML document, this will be a process consisting of both automatic and manual processing.
Requirements for the canonical form are given below.
- The second stage is to create the DTBook document from the canonical XHTML.
This will be done in a completely automatic XSLT 2.0 transformation process, normally controlled by the DAISY Pipeline.
The DAISY Consortium does not supply any tools for performing the first stage, as the process required probably will differ among organizations.
For the second stage, the DAISY Consortium has developed the no_hks_xhtml2dtbook transformer.
Requirements for the canonical form of the XHTML document
- The XHTML document must be a valid XHTML Transitional or XHTML Strict document.
- The document must contain at least one
h1
element.
- The first child element (in the XML sense of the term) of the
body
element should be an h1
element.
Any child element placed before the first h1
, will be ignored during transformation, and will not be present in the generated DTBook document.
- All heading elements (
h1
to h6
) must be child elements (in the XML sense of the term) of the
body
element.
This excludes markup such as:
<body>
<div class="start-of-book">
<h1>The title</h1>
:
:
<h1>Content</h1>
:
:
</div>
<div class="main-stuff">
<h1>Chapter 1 How it all began</h1>
:
:
<h1>Chapter 2 How it continued</h1>
:
:
</div>
</body>
-
Heading levels must not be skipped. That is, the next heading after a
h3
can not be a h5
heading.
It must be a h4
heading, or one of h1
, h2
or h3
.
- A heading should not have another heading on the same, or higher, level, as a
first following sibling (in the XML sense of the term).
Thus the following kind of markup should be avoided:
:
.... some text in a paragraph.</p>
<h3>A heading on level 3</h3>
<h3>Another heading on level 3</h3>
<p>Some more text in a paragraph ....
:
Note that the following is perfectly okay (and makes sense):
:
.... some text in a paragraph.</p>
<h3>A heading on level 3</h3>
<h4>A subheading on level 4</h4>
<p>Some more text in a paragraph ....
:
In the cases where a heading has no relevant following siblings before a heading on the same, or higher, level, a "dummy" paragraph is
inserted in the generated DTBook document.
So the following piece of XHTML code:
:
.... some text in a paragraph.</p>
<h3>A heading on level 3</h3>
<h3>Another heading on level 3</h3>
<p>Some more text in a paragraph ....
:
would be transformed into:
:
.... some text in a paragraph.</p>
</level3>
<level3>
<h3>A heading on level 3</h3>
<p class="dummy" />
</level3>
<level3>
<h3>Another heading on level 3</h3>
<p>Some more text in a paragraph ....
:
- The
br
element may not be child elements (in the XML sense of the term) of the body
element.
- The
span
element may not be child elements (in the XML sense of the term) of the body
element,
unless the class
attribute ...
- ... starts with the string page-,
- ... is equal to noteref,
- ... ends with the string -prodnote, or
- ... is equal to caption, and the
span
element is evaluated to be a part of an image group
(more details).
This excludes markup such as:
<body>
<h1>The title</h1>
<span class="sentence">This is a sentence, and also a child of the body element.</span>
<span class="sentence">And so is this.</span>
:
:
</body>
Rather, you should use:
<body>
<h1>The title</h1>
<p>
<span class="sentence">This is a sentence, and also a child of the body element.</span>
<span class="sentence">And so is this.</span>
</p>
:
:
</body>
- A
span
element, with a value for the class
attribute starting with the string page-,
must have a text content that, when normalized, is suitable to form part of an id
attribute value in the DTBook file.
The following four examples:
<span class="page-normal">4</span>
<span class="page-normal">
89
</span>
<span class="page-front">xiv</span>
<span class="page-special">B-34</span>
are perfectly okay, and will result in the id
values page-4, page-89,
page-xiv
and page-B-34, respectively, in the DTBook file.
The markup:
<span class="page-normal">page 4</span>
does not comply with this requirement.
- The
div
and blockquote
elements may not have br
or span
as child elements (in the XML sense of the term).
- The
div
and blockquote
elements may not have text content.
This excludes markup such as:
<div>
This is some text before the picture.
<img src="fig01.png" alt="Map: Norway" />
This is some text after the picture.
</div>
Instead you should use:
<div>
<p>This is some text before the picture.</p>
<img src="fig01.png" alt="Map: Norway" />
<p>This is some text after the picture.</p>
</div>
or, perhaps better, skip the div
element:
<p>This is some text before the picture.</p>
<img src="fig01.png" alt="Map: Norway" />
<p>This is some text after the picture.</p>
It is generally recommended to have a markup with a very "flat"
structure. One should especially avoid having block elements inside
block elements, as in the following example:
<p>This is some text before the list.
<ul>
<li>The first list item.</li>
<li>This is the second and last item.</li>
</ul>
And this is the text after the list.</p>
Proper markup should rather be as follows:
<p>This is some text before the list.</p>
<ul>
<li>The first list item.</li>
<li>This is the second and last item.</li>
</ul>
<p>And this is the text after the list.</p>
Parameters
Parameters (tdf)
- input
- Path to input XHTML file
- output
- Path to output DTBook file
- title
- The name of the publication. If no value is supplied, the information is extracted from the original file, if possible.
- uid
- The unique identifier for the publication. If no value is supplied, the information is extracted from the original file, if possible.
- outputCSS
- A CSS to use for output textual content files.
Extended configurability
For correct transformation from XHTML to DTBook, the following parameters must be given to the transformation style sheet:
Parameter name
Default value
Comments
uid
[UID]
The unique identifier for the publication.
This parameter should be given a sensible value to be sure that the generated DTBook file has correct meta data.
If using DAISY Pipeline to perform the transformation, the user will be offered the opportunity to specify the identifier.
If the transformation is used by Pipeline as a part of a DAISY 2.02 to DAISY 3.0 DTB migration, DAISY Pipeline
should be able to provide an identifier based on the DAISY 2.02 DTB.
transformationMode
standalone
Used to define how the style sheet transforms the document. If this parameter is given the value
DTBmigration, transformation rules are used that are appropriate for a
migration of a DAISY 2.02 XHTML content to a Z39-86.2005 DTBook file.
Any other value, will result in use of transformation rules suitable for a generic XHTML to DTBook converting process.
title
[DTB_TITLE]
The title of the publication.
This parameter should be given a sensible value to be sure that the generated DTBook file has correct meta data.
If using DAISY Pipeline to perform the transformation, the user will be offered the opportunity to specify the title.
If the transformation is used by Pipeline as a part of a DAISY 2.02 to DAISY 3.0 DTB migration, DAISY Pipeline
should be able to provide a title based on the DAISY 2.02 DTB.
cssURI
[cssURI]
The URI to the Cascading Style Sheet (CSS) to be used for the DTBook file.
If this parameter is not specified, no reference will be made to a style sheet.
transferDcMetadata
false
This parameter is only applicable if the transformation is used as a part of a DAISY 2.02 to DAISY 3.0 DTB migration.
If the parameter is set to true, the transformer will try to transfer
appropriate meta data from the DAISY 2.02 NCC file to the generated DTBook file. In this context,
appropriate meta data is simply meta data with a name
attribute value starting with
dc:. Note: the dc:title meta data is not handled
through this mechanism, as it is specified with the title
parameter.
If this parameter is set to true, the parameter nccURI
must be specified.
nccURI
[nccURI]
This parameter is only applicable if the transformation is used as a part of a DAISY 2.02 to DAISY 3.0 DTB migration,
and if the parameter transferDcMetadata
is set to true.
The parameter is used to specify the URI to the DAISY 2.02 NCC file, in order to facilitate meta data transferring as described above.
If the transformation is used by Pipeline as a part of a DAISY 2.02 DTB to DAISY 3.0 migration, DAISY Pipeline
should be able to provide a suitable value for this parameter.
Output
On success
A DTBook file compliant with the DTBook 2005-2 DTD. The various elements in the XHTML file are handled according to the following information.
XHTML element
Generated DTBook element
Comments
head/meta
head/meta
- Meta data is only transferred from the XHTML document to the DTBook document if
the input parameter
transferDcMetadata
is set to false (or left unspecified).
Follow link to information about parameters.
- The
name
, content
and scheme
attributes are copied.
head/title
frontmatter/doctitle
body
book
- The
id
, class
, title
, dir
and xml:lang
attributes are copied.
h1
to h6
<levelx>
<hx>....</hx>
:
:
</levelx>
- The
id
, title
, dir
and xml:lang
attributes are copied.
- If the
hx
element carries a class
attribute, it is transferred to the corresponding
levelx
element.
Specifically for the h1
elements, the following rules apply:
- If the
class
attribute equals the string frontmatter, the
generated level1
element will be a child of the frontmatter
element.
- If the
class
attribute equals the string rearmatter, the
generated level1
element will be a child of the rearmatter
element.
- If the
class
attribute has any other value, or is absent, the
generated level1
element will be a child of the bodymatter
element.
span
, with a value for the class
attribute equal to
sentence.
sent
- The
id
, title
, dir
and xml:lang
attributes are copied.
span
, where the value for the class
attribute starts
with
the string page-
pagenum
- The
page
attribute gets its value from the class
attribute,
so that if this attribute has a value equal to the string
page-normal, the page
attribute gets
the value normal.
Note that, for the generated DTBook file to be valid, the only other possible values for the XHTML
class
attribute, are the strings
page-front and page-special.
- The
id
attribute gets a value constructed by adding the string
page- to the normalized content of the span
element.
- The
title
, dir
and xml:lang
attributes are copied.
span
, where the value for the class
attribute ends with
the string -prodnote.
prodnote
- The
render
attribute gets its value from the class
attribute,
so that if this attribute has a value equal to the string
optional-prodnote, the render
attribute gets
the value optional.
Note that, for the generated DTBook file to be valid, the only other possible value for the XHTML
class
attribute, is the string
required-prodnote.
- The
id
, title
, dir
and xml:lang
attributes are copied.
If the span
element is evaluated to be a part of an image group (link to more information),
the following rules apply:
- The generated
prodnote
element will be a child of an imggroup
element.
- The
id
attribute will not be copied, but will be generated by the transformation engine.
- An
imgref
attribute will get an appropriate value to reference the relevant img
element.
span
, with a value for the class
attribute equal to
noteref.
noteref
- The
idref
attribute for the noteref
element gets a value based on the value of
the bodyref
attribute for the span
element.
- The
id
, title
, dir
and xml:lang
attributes are copied.
span
, with a value for the class
attribute equal to
caption, and the span
element evaluated to be a part of an image group
(more information).
<imggroup>
<img.../>
<caption>...</caption>
:
</imggroup>
- The
title
, dir
and xml:lang
attributes are copied.
id
and imgref
attributes are created, with appropriate values.
span
, with no class
attribute, or a value for the class
attribute different from any the ones mentioned above.
span
- The
id
, class
, title
, dir
and xml:lang
attributes are copied.
div
, with a value for the class
attribute equal to
notebody.
note
- The
id
, title
, dir
and xml:lang
attributes are copied.
div
, with no class
attribute, or a value for the class
attribute different from the one mentioned above.
div
- The
id
, class
, title
, dir
and xml:lang
attributes are copied.
img
, where the element is evaluated to be a part of an image group.
<imggroup>
<img.../>
:
:
</imggroup>
See section on image groups.
img
, where the element is evaluated not to be a part of an image group.
img
- The
alt
, src
, id
, class
, title
, dir
and xml:lang
attributes are copied.
ol
list
- The
type
attribute will have the value ol.
- The
id
, class
, title
, dir
and xml:lang
attributes are copied.
ul
list
- The
type
attribute will have the value ul.
- The
id
, class
, title
, dir
and xml:lang
attributes are copied.
table
, tr
, td
, th
and col
table
, tr
, td
, th
and col
respectively
- The
rowspan
, colspan
, valign
, id
, class
, title
, dir
and xml:lang
attributes are copied.
p
, blockquote
, li
, dl
, dt
, dd
,
strong
, em
, sub
, sup
and
br
p
, blockquote
, li
, dl
, dt
, dd
,
strong
, em
, sub
, sup
and
br
respectively
- The
id
, class
, title
, dir
and xml:lang
attributes are copied.
- If the
br
element is a child (in the XML sense of the term) of the body
element, it will not be transformed to a br
in the DTBook.
a
, where the value of the href
attribute
does not contain the string .smil#.
a
- The
href
, id
, class
, title
, dir
and xml:lang
attributes are copied.
See also References to SMIL files for more information on how a
elements are handled during transformation.
Other elements than the ones listed above will result in a comment in the DTBook file.
References to SMIL files
If, and only if, the style sheet input parameter transformationMode
is given the value
DTBmigration, then for all XHTML elements listed above, the following rule applies:
If the XHTML element has an a
element as a child (in the XML sense of the term),
and this a
element has an href
attribute with a
value containing the string .smil#, the value of the href
attribute
is transferred to a smilref
attribute for the DTBook element that results from the transformation of the XHTML element.
The a
element will not be transformed in this process.
So the following piece of XHTML code:
<h2 id="baaw_0007"><a href="baaw0004.smil#baaw_0007">Section 1.1</a></h2>
<span class="page-normal" id="baaw_0008"><a href="baaw0004.smil#baaw_0008">4</a></span>
<span class="page-normal" id="baaw_0009"><a href="baaw0004.smil#baaw_0009">5</a></span>
:
would be transformed into:
<level2>
<h2 id="baaw_0007" smilref="baaw0004.smil#baaw_0007">Section 1.1</h2>
<pagenum page="normal" id="page-4" smilref="baaw0004.smil#baaw_0008">4</pagenum>
<pagenum page="normal" id="page-5" smilref="baaw0004.smil#baaw_0009">5</pagenum>
:
</level2>
Required markup for image groups
When an img
element occurs in the input document,
an imggroup
element is created,
if the img
element
has one, or both, of the following elements:
span
element, where the value for the class
attribute ends with the string -prodnote.
span
element, where the value for the class
attribute is equal to the string caption.
as the first following sibling(s) (in the XML sense of the term).
The img
element, and whatever results from transformation of the two elements listed above, will be placed
in the imggroup
element, and
appropriate values for id
and imgref
attributes are created for all elements in the image group.
As an example, the following markup in the XHTML:
:
<h2><a href="smil0026.smil#0001">Town halls in Norway</a></h2>
<p><a href="smil0026.smil#0002">This is the paragraph before the image, caption and description.</a></p>
<img id="fig04" src="file04.png" alt="Picture: The Oslo Town Hall" />
<span class="caption"><a href="smil0026.smil#0003">The town hall in Oslo, located close to the harbor,
is one of the largest brick buildings in the city.</a></span>
<span class="optional-prodnote"><a href="smil0026.smil#0004">A photography showing a rather large building with two towers.</a></span>
<p><a href="smil0026.smil#0005">This is the paragraph after the image, caption and description.</a></p>
:
will result in the following markup in the generated DTBook file (id
and imgref
values may differ):
:
<level2>
<h2 smilref="smil0026.smil#0001">Town halls in Norway</h2>
<p smilref="smil0026.smil#0002">This is the paragraph before the image, caption and description.</p>
<imggroup id="imggrp-d1e340">
<img src="file04.png" alt="Picture: The Oslo Town Hall" id="img-d1e340"/>
<caption imgref="img-d1e340" id="caption-d1e340" smilref="smil0026.smil#0003">The town hall in Oslo,
located close to the harbor, is one of the largest brick buildings in the city.</caption>
<prodnote render="optional" imgref="img-d1e340" id="pnote-d1e340"
smilref="smil0026.smil#0004">A photography showing a rather large building with two towers.</prodnote>
</imggroup>
<p smilref="smil0026.smil#0005">This is the paragraph after the image, caption and description.</p>
:
</level2>
:
However, the following, very similar, markup in the XHTML:
:
<h2><a href="smil0026.smil#0001">Town halls in Norway</a></h2>
<p><a href="smil0026.smil#0002">This is the paragraph before the image, caption and description.</a></p>
<img id="fig04" src="file04.png" alt="Picture: The Oslo Town Hall" />
<span class="caption"><a href="smil0026.smil#0003">The town hall in Oslo, located close to the harbor,
is one of the largest brick buildings in the city.</a></span>
<span class="prodnote"><a href="smil0026.smil#0004">A photography showing a rather large building with two towers.</a></span>
<p><a href="smil0026.smil#0005">This is the paragraph after the image, caption and description.</a></p>
:
will result in the following markup in the generated DTBook file (id
and imgref
values may differ):
:
<level2>
<h2 smilref="smil0026.smil#0001">Town halls in Norway</h2>
<p smilref="smil0026.smil#0002">This is the paragraph before the image, caption and description.</p>
<imggroup id="imggrp-d1e340">
<img src="file04.png" alt="Picture: The Oslo Town Hall" id="img-d1e340"/>
<caption imgref="img-d1e340" id="caption-d1e340" smilref="smil0026.smil#0003">The town hall in Oslo,
located close to the harbor, is one of the largest brick buildings in the city.</caption>
</imggroup>
<span class="prodnote" smilref="smil0026.smil#0004">A photography showing a rather large building with two towers.</span>
<p smilref="smil0026.smil#0005">This is the paragraph after the image, caption and description.</p>
:
</level2>
:
Note that this DTBook markup is in fact invalid. It is left as an exercise to the reader to trace the cause of this error back to the XHTML code.
On error
On error, this transformer will send a fatal message, then throw an exception and abort.
Further development
- Extend the transformation rules to handle more elements in the XHTML document.
- Provide some mechanism for validation of the canonical XHTML document.
Dependencies
- XSLT 2.0 processor
Author
- Markus Gylling (TPB/DC)
- Per Sennels (Huseby)
Licensing
LGPL