doc.scripts.Xhtml2Dtbook.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of pipeline1-adapter Show documentation
The newest version!




	
	Pipeline Script: XHTML to DTBook [BETA]
    
	


Pipeline Script: XHTML to DTBook [BETA]


	Overview
	Input Requirements
        
    		A note on the two stage process for creating a DTBook file from an HTML document
    		Requirements for the canonical form of the XHTML document
        
    
	Output
    Configuration
    Appendix: List of Transformers used



Overview

This script transforms a valid XHTML file in the canonical form into a valid DTBook file.


Input Requirements
A note on the two stage process for creating a DTBook file from an HTML document
Transformation of a general HTML document to a DTBook document, is likely to be a two stage process.


    The first stage will be to turn the HTML document into a canonical form of an XHTML document.
    Depending on the state of the HTML document, this will be a process consisting of both automatic and manual processing.
    Requirements for the canonical form are given below.
    The second stage is to create the DTBook document from the canonical XHTML.
    This will be done in a completely automatic XSLT 2.0 transformation process, normally controlled by the DAISY Pipeline.


    The DAISY Consortium does not supply any tools for performing the first stage, as the process required probably will differ among organizations.
    For the second stage, the DAISY Consortium has developed the script described in this document.

Requirements for the canonical form of the XHTML document

    The XHTML document must be a valid XHTML Transitional or XHTML Strict document.
    The document must contain at least one h1 element.
    The first child element (in the XML sense of the term) of the body element should be an h1 element.
        Any child element placed before the first h1, will be ignored during transformation, and will not be present in the generated DTBook document.
    All heading elements (h1 to h6) must be child elements (in the XML sense of the term) of the
        body element.

        This excludes markup such as:
<body>
    <div class="start-of-book">
        <h1>The title</h1>
        :
        :
        <h1>Content</h1>
        :
        :
    </div>
    <div class="main-stuff">
        <h1>Chapter 1 How it all began</h1>
        :
        :
        <h1>Chapter 2 How it continued</h1>
        :
        :
    </div>
</body>
    
    
    
        Heading levels must not be skipped. That is, the next heading after a h3 can not be a h5 heading.
        It must be a h4 heading, or one of h1, h2 or h3.
    
    A heading should not have another heading on the same, or higher, level, as a
    first following sibling (in the XML sense of the term).

        Thus the following kind of markup should be avoided:
:
.... some text in a paragraph.</p>
<h3>A heading on level 3</h3>
<h3>Another heading on level 3</h3>
<p>Some more text in a paragraph ....
:

    Note that the following is perfectly okay (and makes sense):
:
.... some text in a paragraph.</p>
<h3>A heading on level 3</h3>
<h4>A subheading on level 4</h4>
<p>Some more text in a paragraph ....
:

    In the cases where a heading has no relevant following siblings before a heading on the same, or higher, level, a "dummy" paragraph is
        inserted in the generated DTBook document.

So the following piece of XHTML code:
:
.... some text in a paragraph.</p>
<h3>A heading on level 3</h3>
<h3>Another heading on level 3</h3>
<p>Some more text in a paragraph ....
:

would be transformed into:
:
.... some text in a paragraph.</p>
</level3>
<level3>
    <h3>A heading on level 3</h3>
    <p class="dummy" />
</level3>
<level3>
    <h3>Another heading on level 3</h3>
    <p>Some more text in a paragraph ....
:

    

    The br  element may not be child elements (in the XML sense of the term) of the body element.
    The span element may not be child elements (in the XML sense of the term) of the body element,
        unless the class attribute ...
        
            ... starts with the string page-,
            ... is equal to noteref,
            ... ends with the string -prodnote, or
            ... is equal to caption, and the span element is evaluated to be a part of an image group
            (more details).
        
This excludes markup such as:
<body>
    <h1>The title</h1>
    <span class="sentence">This is a sentence, 
            and also a child of the body element.</span>
    <span class="sentence">And so is this.</span>
    :
    :
</body>
    
    Rather, you should use:
<body>
    <h1>The title</h1>
    <p>
        <span class="sentence">This is a sentence, 
            and also a child of the body element.</span>
        <span class="sentence">And so is this.</span>
    </p>
    :
    :
</body>
    
    
    A span element, with a value for the class attribute starting with the string page-,
        must have a text content that, when normalized, is suitable to form part of an id attribute value in the DTBook file.

        The following four examples:
        <span class="page-normal">4</span>

<span class="page-normal">
    89
</span>

<span class="page-front">xiv</span>

<span class="page-special">B-34</span>
        are perfectly okay, and will result in the id values page-4, page-89,
        page-xiv
        and page-B-34, respectively, in the DTBook file.

        The  markup:
        <span class="page-normal">page 4</span>
        does not comply with this requirement.

    
    The div and blockquote elements may not have br
        or span as child elements (in the XML sense of the term).
    The div and blockquote elements may not have text content.
        This excludes markup such as:
<div>
    This is some text before the picture.
    <img src="fig01.png" alt="Map: Norway" />
    This is some text after the picture.
</div>

    Instead you should use:
<div>
    <p>This is some text before the picture.</p>
    <img src="fig01.png" alt="Map: Norway" />
    <p>This is some text after the picture.</p>
</div>

    or, perhaps better, skip the div element:
<p>This is some text before the picture.</p>
<img src="fig01.png" alt="Map: Norway" />
<p>This is some text after the picture.</p>

    


It is generally recommended to have a markup with a very "flat"
structure. One should especially avoid having block elements inside
block elements, as in the following example:

<p>This is some text before the list.
    <ul>
        <li>The first list item.</li>
        <li>This is the second and last item.</li>
    </ul>
And this is the text after the list.</p>

Proper markup should rather be as follows:
<p>This is some text before the list.</p>
<ul>
    <li>The first list item.</li>
    <li>This is the second and last item.</li>
</ul>
<p>And this is the text after the list.</p>


Configuration

	Input XHTML
	Required. The input XHTML file to be converted.
    Output DTBook
    Required. The output DTBook file to be created.
    Title
    Optional. The title of the publication. If no value is supplied, the information is extracted from the source file, if possible.
    dtb:uid
    Optional. The publications unique identifier. If no value is supplied, the information is extracted from the source file, if possible.
    CSS
    Optional. The Cascading Style Sheet (CSS) to be referenced from the generated DTBook document.




Output
A DTBook document that is hopefully valid. The output is automatically validated, so watch out for error reports.

Appendix: List of Transformers used
The documents linked below are parts of the Transformer technical documentation. These are developer and systems-administrator centric documents.

	XHTML to DTBook
	Validator