All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.scripts.Xhtml2Dtbook.html Maven / Gradle / Ivy

The newest version!




	
	Pipeline Script: XHTML to DTBook [BETA]
    
	


Pipeline Script: XHTML to DTBook [BETA]

Overview

This script transforms a valid XHTML file in the canonical form into a valid DTBook file.

Input Requirements

A note on the two stage process for creating a DTBook file from an HTML document

Transformation of a general HTML document to a DTBook document, is likely to be a two stage process.

  1. The first stage will be to turn the HTML document into a canonical form of an XHTML document. Depending on the state of the HTML document, this will be a process consisting of both automatic and manual processing. Requirements for the canonical form are given below.
  2. The second stage is to create the DTBook document from the canonical XHTML. This will be done in a completely automatic XSLT 2.0 transformation process, normally controlled by the DAISY Pipeline.

The DAISY Consortium does not supply any tools for performing the first stage, as the process required probably will differ among organizations. For the second stage, the DAISY Consortium has developed the script described in this document.

Requirements for the canonical form of the XHTML document

  • The XHTML document must be a valid XHTML Transitional or XHTML Strict document.
  • The document must contain at least one h1 element.
  • The first child element (in the XML sense of the term) of the body element should be an h1 element. Any child element placed before the first h1, will be ignored during transformation, and will not be present in the generated DTBook document.
  • All heading elements (h1 to h6) must be child elements (in the XML sense of the term) of the body element.
    This excludes markup such as:
    <body>
        <div class="start-of-book">
            <h1>The title</h1>
            :
            :
            <h1>Content</h1>
            :
            :
        </div>
        <div class="main-stuff">
            <h1>Chapter 1 How it all began</h1>
            :
            :
            <h1>Chapter 2 How it continued</h1>
            :
            :
        </div>
    </body>
        
  • Heading levels must not be skipped. That is, the next heading after a h3 can not be a h5 heading. It must be a h4 heading, or one of h1, h2 or h3.
  • A heading should not have another heading on the same, or higher, level, as a first following sibling (in the XML sense of the term).
    Thus the following kind of markup should be avoided:
    :
    .... some text in a paragraph.</p>
    <h3>A heading on level 3</h3>
    <h3>Another heading on level 3</h3>
    <p>Some more text in a paragraph ....
    :
    
    Note that the following is perfectly okay (and makes sense):
    :
    .... some text in a paragraph.</p>
    <h3>A heading on level 3</h3>
    <h4>A subheading on level 4</h4>
    <p>Some more text in a paragraph ....
    :
    
    In the cases where a heading has no relevant following siblings before a heading on the same, or higher, level, a "dummy" paragraph is inserted in the generated DTBook document.
    So the following piece of XHTML code:
    :
    .... some text in a paragraph.</p>
    <h3>A heading on level 3</h3>
    <h3>Another heading on level 3</h3>
    <p>Some more text in a paragraph ....
    :
    
    would be transformed into:
    :
    .... some text in a paragraph.</p>
    </level3>
    <level3>
        <h3>A heading on level 3</h3>
        <p class="dummy" />
    </level3>
    <level3>
        <h3>Another heading on level 3</h3>
        <p>Some more text in a paragraph ....
    :
    
  • The br element may not be child elements (in the XML sense of the term) of the body element.
  • The span element may not be child elements (in the XML sense of the term) of the body element, unless the class attribute ...
    • ... starts with the string page-,
    • ... is equal to noteref,
    • ... ends with the string -prodnote, or
    • ... is equal to caption, and the span element is evaluated to be a part of an image group (more details).
    This excludes markup such as:
    <body>
        <h1>The title</h1>
        <span class="sentence">This is a sentence, 
                and also a child of the body element.</span>
        <span class="sentence">And so is this.</span>
        :
        :
    </body>
        
    Rather, you should use:
    <body>
        <h1>The title</h1>
        <p>
            <span class="sentence">This is a sentence, 
                and also a child of the body element.</span>
            <span class="sentence">And so is this.</span>
        </p>
        :
        :
    </body>
        
  • A span element, with a value for the class attribute starting with the string page-, must have a text content that, when normalized, is suitable to form part of an id attribute value in the DTBook file.
    The following four examples:
    <span class="page-normal">4</span>
    
    <span class="page-normal">
        89
    </span>
    
    <span class="page-front">xiv</span>
    
    <span class="page-special">B-34</span>
    are perfectly okay, and will result in the id values page-4, page-89, page-xiv and page-B-34, respectively, in the DTBook file.
    The markup:
    <span class="page-normal">page 4</span>
    does not comply with this requirement.
  • The div and blockquote elements may not have br or span as child elements (in the XML sense of the term).
  • The div and blockquote elements may not have text content. This excludes markup such as:
    <div>
        This is some text before the picture.
        <img src="fig01.png" alt="Map: Norway" />
        This is some text after the picture.
    </div>
    
    Instead you should use:
    <div>
        <p>This is some text before the picture.</p>
        <img src="fig01.png" alt="Map: Norway" />
        <p>This is some text after the picture.</p>
    </div>
    
    or, perhaps better, skip the div element:
    <p>This is some text before the picture.</p>
    <img src="fig01.png" alt="Map: Norway" />
    <p>This is some text after the picture.</p>
    

It is generally recommended to have a markup with a very "flat" structure. One should especially avoid having block elements inside block elements, as in the following example:

<p>This is some text before the list.
    <ul>
        <li>The first list item.</li>
        <li>This is the second and last item.</li>
    </ul>
And this is the text after the list.</p>
Proper markup should rather be as follows:
<p>This is some text before the list.</p>
<ul>
    <li>The first list item.</li>
    <li>This is the second and last item.</li>
</ul>
<p>And this is the text after the list.</p>

Configuration

Input XHTML
Required. The input XHTML file to be converted.
Output DTBook
Required. The output DTBook file to be created.
Title
Optional. The title of the publication. If no value is supplied, the information is extracted from the source file, if possible.
dtb:uid
Optional. The publications unique identifier. If no value is supplied, the information is extracted from the source file, if possible.
CSS
Optional. The Cascading Style Sheet (CSS) to be referenced from the generated DTBook document.

Output

A DTBook document that is hopefully valid. The output is automatically validated, so watch out for error reports.

Appendix: List of Transformers used

The documents linked below are parts of the Transformer technical documentation. These are developer and systems-administrator centric documents.

  1. XHTML to DTBook
  2. Validator




© 2015 - 2025 Weber Informatics LLC | Privacy Policy