All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.transformers.no_hks_xhtml2dtbook.html Maven / Gradle / Ivy

The newest version!




	
	Transformer documentation: no_hks_xhtml2dtbook
	
	


Transformer documentation: no_hks_xhtml2dtbook

Transformer Purpose

To transform a valid XHTML file in the canonical form into a valid DTBook file.

Input Requirements

A note on the two stage process for creating a DTBook file from an HTML document

Transformation of a general HTML document to a DTBook document, is likely to be a two stage process.

  1. The first stage will be to turn the HTML document into a canonical form of an XHTML document. Depending on the state of the HTML document, this will be a process consisting of both automatic and manual processing. Requirements for the canonical form are given below.
  2. The second stage is to create the DTBook document from the canonical XHTML. This will be done in a completely automatic XSLT 2.0 transformation process, normally controlled by the DAISY Pipeline.

The DAISY Consortium does not supply any tools for performing the first stage, as the process required probably will differ among organizations. For the second stage, the DAISY Consortium has developed the no_hks_xhtml2dtbook transformer.

Requirements for the canonical form of the XHTML document

  • The XHTML document must be a valid XHTML Transitional or XHTML Strict document.
  • The document must contain at least one h1 element.
  • The first child element (in the XML sense of the term) of the body element should be an h1 element. Any child element placed before the first h1, will be ignored during transformation, and will not be present in the generated DTBook document.
  • All heading elements (h1 to h6) must be child elements (in the XML sense of the term) of the body element.
    This excludes markup such as:
    <body>
        <div class="start-of-book">
            <h1>The title</h1>
            :
            :
            <h1>Content</h1>
            :
            :
        </div>
        <div class="main-stuff">
            <h1>Chapter 1 How it all began</h1>
            :
            :
            <h1>Chapter 2 How it continued</h1>
            :
            :
        </div>
    </body>
        
  • Heading levels must not be skipped. That is, the next heading after a h3 can not be a h5 heading. It must be a h4 heading, or one of h1, h2 or h3.
  • A heading should not have another heading on the same, or higher, level, as a first following sibling (in the XML sense of the term).
    Thus the following kind of markup should be avoided:
    :
    .... some text in a paragraph.</p>
    <h3>A heading on level 3</h3>
    <h3>Another heading on level 3</h3>
    <p>Some more text in a paragraph ....
    :
    
    Note that the following is perfectly okay (and makes sense):
    :
    .... some text in a paragraph.</p>
    <h3>A heading on level 3</h3>
    <h4>A subheading on level 4</h4>
    <p>Some more text in a paragraph ....
    :
    
    In the cases where a heading has no relevant following siblings before a heading on the same, or higher, level, a "dummy" paragraph is inserted in the generated DTBook document.
    So the following piece of XHTML code:
    :
    .... some text in a paragraph.</p>
    <h3>A heading on level 3</h3>
    <h3>Another heading on level 3</h3>
    <p>Some more text in a paragraph ....
    :
    
    would be transformed into:
    :
    .... some text in a paragraph.</p>
    </level3>
    <level3>
        <h3>A heading on level 3</h3>
        <p class="dummy" />
    </level3>
    <level3>
        <h3>Another heading on level 3</h3>
        <p>Some more text in a paragraph ....
    :
    
  • The br element may not be child elements (in the XML sense of the term) of the body element.
  • The span element may not be child elements (in the XML sense of the term) of the body element, unless the class attribute ...
    • ... starts with the string page-,
    • ... is equal to noteref,
    • ... ends with the string -prodnote, or
    • ... is equal to caption, and the span element is evaluated to be a part of an image group (more details).
    This excludes markup such as:
    <body>
        <h1>The title</h1>
        <span class="sentence">This is a sentence, and also a child of the body element.</span>
        <span class="sentence">And so is this.</span>
        :
        :
    </body>
        
    Rather, you should use:
    <body>
        <h1>The title</h1>
        <p>
            <span class="sentence">This is a sentence, and also a child of the body element.</span>
            <span class="sentence">And so is this.</span>
        </p>
        :
        :
    </body>
        
  • A span element, with a value for the class attribute starting with the string page-, must have a text content that, when normalized, is suitable to form part of an id attribute value in the DTBook file.
    The following four examples:
    <span class="page-normal">4</span>
    
    <span class="page-normal">
        89
    </span>
    
    <span class="page-front">xiv</span>
    
    <span class="page-special">B-34</span>
    are perfectly okay, and will result in the id values page-4, page-89, page-xiv and page-B-34, respectively, in the DTBook file.
    The markup:
    <span class="page-normal">page 4</span>
    does not comply with this requirement.
  • The div and blockquote elements may not have br or span as child elements (in the XML sense of the term).
  • The div and blockquote elements may not have text content. This excludes markup such as:
    <div>
        This is some text before the picture.
        <img src="fig01.png" alt="Map: Norway" />
        This is some text after the picture.
    </div>
    
    Instead you should use:
    <div>
        <p>This is some text before the picture.</p>
        <img src="fig01.png" alt="Map: Norway" />
        <p>This is some text after the picture.</p>
    </div>
    
    or, perhaps better, skip the div element:
    <p>This is some text before the picture.</p>
    <img src="fig01.png" alt="Map: Norway" />
    <p>This is some text after the picture.</p>
    

It is generally recommended to have a markup with a very "flat" structure. One should especially avoid having block elements inside block elements, as in the following example:

<p>This is some text before the list.
    <ul>
        <li>The first list item.</li>
        <li>This is the second and last item.</li>
    </ul>
And this is the text after the list.</p>
Proper markup should rather be as follows:
<p>This is some text before the list.</p>
<ul>
    <li>The first list item.</li>
    <li>This is the second and last item.</li>
</ul>
<p>And this is the text after the list.</p>

Parameters

Parameters (tdf)

input
Path to input XHTML file
output
Path to output DTBook file
title
The name of the publication. If no value is supplied, the information is extracted from the original file, if possible.
uid
The unique identifier for the publication. If no value is supplied, the information is extracted from the original file, if possible.
outputCSS
A CSS to use for output textual content files.

Extended configurability

For correct transformation from XHTML to DTBook, the following parameters must be given to the transformation style sheet:

Parameter name Default value Comments
uid [UID] The unique identifier for the publication.
This parameter should be given a sensible value to be sure that the generated DTBook file has correct meta data. If using DAISY Pipeline to perform the transformation, the user will be offered the opportunity to specify the identifier. If the transformation is used by Pipeline as a part of a DAISY 2.02 to DAISY 3.0 DTB migration, DAISY Pipeline should be able to provide an identifier based on the DAISY 2.02 DTB.
transformationMode standalone Used to define how the style sheet transforms the document. If this parameter is given the value DTBmigration, transformation rules are used that are appropriate for a migration of a DAISY 2.02 XHTML content to a Z39-86.2005 DTBook file.
Any other value, will result in use of transformation rules suitable for a generic XHTML to DTBook converting process.
title [DTB_TITLE] The title of the publication.
This parameter should be given a sensible value to be sure that the generated DTBook file has correct meta data. If using DAISY Pipeline to perform the transformation, the user will be offered the opportunity to specify the title. If the transformation is used by Pipeline as a part of a DAISY 2.02 to DAISY 3.0 DTB migration, DAISY Pipeline should be able to provide a title based on the DAISY 2.02 DTB.
cssURI [cssURI] The URI to the Cascading Style Sheet (CSS) to be used for the DTBook file.
If this parameter is not specified, no reference will be made to a style sheet.
transferDcMetadata false This parameter is only applicable if the transformation is used as a part of a DAISY 2.02 to DAISY 3.0 DTB migration.
If the parameter is set to true, the transformer will try to transfer appropriate meta data from the DAISY 2.02 NCC file to the generated DTBook file. In this context, appropriate meta data is simply meta data with a name attribute value starting with dc:. Note: the dc:title meta data is not handled through this mechanism, as it is specified with the title parameter.
If this parameter is set to true, the parameter nccURI must be specified.
nccURI [nccURI] This parameter is only applicable if the transformation is used as a part of a DAISY 2.02 to DAISY 3.0 DTB migration, and if the parameter transferDcMetadata is set to true.
The parameter is used to specify the URI to the DAISY 2.02 NCC file, in order to facilitate meta data transferring as described above.
If the transformation is used by Pipeline as a part of a DAISY 2.02 DTB to DAISY 3.0 migration, DAISY Pipeline should be able to provide a suitable value for this parameter.

Output

On success

A DTBook file compliant with the DTBook 2005-2 DTD. The various elements in the XHTML file are handled according to the following information.

XHTML element Generated DTBook element Comments
head/meta head/meta
  • Meta data is only transferred from the XHTML document to the DTBook document if the input parameter transferDcMetadata is set to false (or left unspecified). Follow link to information about parameters.
  • The name, content and scheme attributes are copied.
head/title frontmatter/doctitle
body book
  • The id, class, title, dir and xml:lang attributes are copied.
h1 to h6
<levelx>
    <hx>....</hx>
    :
    :
</levelx>
  • The id, title, dir and xml:lang attributes are copied.
  • If the hx element carries a class attribute, it is transferred to the corresponding levelx element.
Specifically for the h1 elements, the following rules apply:
  • If the class attribute equals the string frontmatter, the generated level1 element will be a child of the frontmatter element.
  • If the class attribute equals the string rearmatter, the generated level1 element will be a child of the rearmatter element.
  • If the class attribute has any other value, or is absent, the generated level1 element will be a child of the bodymatter element.
span, with a value for the class attribute equal to sentence. sent
  • The id, title, dir and xml:lang attributes are copied.
span, where the value for the class attribute starts with the string page- pagenum
  • The page attribute gets its value from the class attribute, so that if this attribute has a value equal to the string page-normal, the page attribute gets the value normal. Note that, for the generated DTBook file to be valid, the only other possible values for the XHTML class attribute, are the strings page-front and page-special.
  • The id attribute gets a value constructed by adding the string page- to the normalized content of the span element.
  • The title, dir and xml:lang attributes are copied.
span, where the value for the class attribute ends with the string -prodnote. prodnote
  • The render attribute gets its value from the class attribute, so that if this attribute has a value equal to the string optional-prodnote, the render attribute gets the value optional. Note that, for the generated DTBook file to be valid, the only other possible value for the XHTML class attribute, is the string required-prodnote.
  • The id, title, dir and xml:lang attributes are copied.
If the span element is evaluated to be a part of an image group (link to more information), the following rules apply:
  • The generated prodnote element will be a child of an imggroup element.
  • The id attribute will not be copied, but will be generated by the transformation engine.
  • An imgref attribute will get an appropriate value to reference the relevant img element.
span, with a value for the class attribute equal to noteref. noteref
  • The idref attribute for the noteref element gets a value based on the value of the bodyref attribute for the span element.
  • The id, title, dir and xml:lang attributes are copied.
span, with a value for the class attribute equal to caption, and the span element evaluated to be a part of an image group (more information).
<imggroup>
    <img.../>
    <caption>...</caption>
    :
</imggroup>
  • The title, dir and xml:lang attributes are copied.
  • id and imgref attributes are created, with appropriate values.
span, with no class attribute, or a value for the class attribute different from any the ones mentioned above. span
  • The id, class, title, dir and xml:lang attributes are copied.
div, with a value for the class attribute equal to notebody. note
  • The id, title, dir and xml:lang attributes are copied.
div, with no class attribute, or a value for the class attribute different from the one mentioned above. div
  • The id, class, title, dir and xml:lang attributes are copied.
img, where the element is evaluated to be a part of an image group.
<imggroup>
    <img.../>
    :
    :
</imggroup>
See section on image groups.
img, where the element is evaluated not to be a part of an image group. img
  • The alt, src, id, class, title, dir and xml:lang attributes are copied.
ol list
  • The type attribute will have the value ol.
  • The id, class, title, dir and xml:lang attributes are copied.
ul list
  • The type attribute will have the value ul.
  • The id, class, title, dir and xml:lang attributes are copied.
table, tr, td, th and col table, tr, td, th and col respectively
  • The rowspan, colspan, valign, id, class, title, dir and xml:lang attributes are copied.
p, blockquote, li, dl, dt, dd, strong, em, sub, sup and br p, blockquote, li, dl, dt, dd, strong, em, sub, sup and br respectively
  • The id, class, title, dir and xml:lang attributes are copied.
  • If the br element is a child (in the XML sense of the term) of the body element, it will not be transformed to a br in the DTBook.
a, where the value of the href attribute does not contain the string .smil#. a
  • The href, id, class, title, dir and xml:lang attributes are copied.
See also References to SMIL files for more information on how a elements are handled during transformation.

Other elements than the ones listed above will result in a comment in the DTBook file.

References to SMIL files

If, and only if, the style sheet input parameter transformationMode is given the value DTBmigration, then for all XHTML elements listed above, the following rule applies:
If the XHTML element has an a element as a child (in the XML sense of the term), and this a element has an href attribute with a value containing the string .smil#, the value of the href attribute is transferred to a smilref attribute for the DTBook element that results from the transformation of the XHTML element. The a element will not be transformed in this process.

So the following piece of XHTML code:

<h2 id="baaw_0007"><a href="baaw0004.smil#baaw_0007">Section 1.1</a></h2>
<span class="page-normal" id="baaw_0008"><a href="baaw0004.smil#baaw_0008">4</a></span>
<span class="page-normal" id="baaw_0009"><a href="baaw0004.smil#baaw_0009">5</a></span>
:
would be transformed into:
<level2>
   <h2 id="baaw_0007" smilref="baaw0004.smil#baaw_0007">Section 1.1</h2>
   <pagenum page="normal" id="page-4" smilref="baaw0004.smil#baaw_0008">4</pagenum>
   <pagenum page="normal" id="page-5" smilref="baaw0004.smil#baaw_0009">5</pagenum>
   :
</level2>

Required markup for image groups

When an img element occurs in the input document, an imggroup element is created, if the img element has one, or both, of the following elements:

  • span element, where the value for the class attribute ends with the string -prodnote.
  • span element, where the value for the class attribute is equal to the string caption.

as the first following sibling(s) (in the XML sense of the term). The img element, and whatever results from transformation of the two elements listed above, will be placed in the imggroup element, and appropriate values for id and imgref attributes are created for all elements in the image group.

As an example, the following markup in the XHTML:

    :
    <h2><a href="smil0026.smil#0001">Town halls in Norway</a></h2>
    <p><a href="smil0026.smil#0002">This is the paragraph before the image, caption and description.</a></p>
    <img id="fig04" src="file04.png" alt="Picture: The Oslo Town Hall" />
    <span class="caption"><a href="smil0026.smil#0003">The town hall in Oslo, located close to the harbor,
        is one of the largest brick buildings in the city.</a></span>
    <span class="optional-prodnote"><a href="smil0026.smil#0004">A photography showing a rather large building with two towers.</a></span>
    <p><a href="smil0026.smil#0005">This is the paragraph after the image, caption and description.</a></p>
    :

will result in the following markup in the generated DTBook file (id and imgref values may differ):

    :
    <level2>
       <h2 smilref="smil0026.smil#0001">Town halls in Norway</h2>
       <p smilref="smil0026.smil#0002">This is the paragraph before the image, caption and description.</p>
       <imggroup id="imggrp-d1e340">
          <img src="file04.png" alt="Picture: The Oslo Town Hall" id="img-d1e340"/>
          <caption imgref="img-d1e340" id="caption-d1e340" smilref="smil0026.smil#0003">The town hall in Oslo,
            located close to the harbor, is one of the largest brick buildings in the city.</caption>
          <prodnote render="optional" imgref="img-d1e340" id="pnote-d1e340"
                    smilref="smil0026.smil#0004">A photography showing a rather large building with two towers.</prodnote>
       </imggroup>
       <p smilref="smil0026.smil#0005">This is the paragraph after the image, caption and description.</p>
       :
    </level2>
    :

However, the following, very similar, markup in the XHTML:

    :
    <h2><a href="smil0026.smil#0001">Town halls in Norway</a></h2>
    <p><a href="smil0026.smil#0002">This is the paragraph before the image, caption and description.</a></p>
    <img id="fig04" src="file04.png" alt="Picture: The Oslo Town Hall" />
    <span class="caption"><a href="smil0026.smil#0003">The town hall in Oslo, located close to the harbor,
        is one of the largest brick buildings in the city.</a></span>
    <span class="prodnote"><a href="smil0026.smil#0004">A photography showing a rather large building with two towers.</a></span>
    <p><a href="smil0026.smil#0005">This is the paragraph after the image, caption and description.</a></p>
    :

will result in the following markup in the generated DTBook file (id and imgref values may differ):

    :
    <level2>
       <h2 smilref="smil0026.smil#0001">Town halls in Norway</h2>
       <p smilref="smil0026.smil#0002">This is the paragraph before the image, caption and description.</p>
       <imggroup id="imggrp-d1e340">
          <img src="file04.png" alt="Picture: The Oslo Town Hall" id="img-d1e340"/>
          <caption imgref="img-d1e340" id="caption-d1e340" smilref="smil0026.smil#0003">The town hall in Oslo,
            located close to the harbor, is one of the largest brick buildings in the city.</caption>
       </imggroup>
       <span class="prodnote" smilref="smil0026.smil#0004">A photography showing a rather large building with two towers.</span>
       <p smilref="smil0026.smil#0005">This is the paragraph after the image, caption and description.</p>
       :
    </level2>
    :

Note that this DTBook markup is in fact invalid. It is left as an exercise to the reader to trace the cause of this error back to the XHTML code.

On error

On error, this transformer will send a fatal message, then throw an exception and abort.

Further development

  • Extend the transformation rules to handle more elements in the XHTML document.
  • Provide some mechanism for validation of the canonical XHTML document.

Dependencies

  • XSLT 2.0 processor

Author

  • Markus Gylling (TPB/DC)
  • Per Sennels (Huseby)

Licensing

LGPL





© 2015 - 2025 Weber Informatics LLC | Privacy Policy