doc.scripts.DTBookFix.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of pipeline1-adapter Show documentation
The newest version!




  
  Pipeline Script: DTBook Fix
  



Pipeline Script: DTBook Fix



  Overview
  Input Requirements
  Output
  Configuration
  Appendix: List of Transformers used



Overview

This script will attempt to repair and tidy a suboptimal
DTBook document. The script is primarily intended to address structural
problems that occur in files that are output from automated conversion
processes, such as the WordML to DTBook Script.

The actual manipulation routines performed are described in the DTBookFix Categories section below.

Input Requirements

A DTBook document. As is seldom the case in the Pipeline, the document
need not be valid.

Note that the manipulations that are performed are heavily dependent on
which version of DTBook is used (2005-1, 2005-2,
etc). If your input document is of a version that is not supported by all or
some of the manipulation routines that are performed, these routines will be
disabled, and warnings will be issued.

Output

Depending on the input document version and your settings, the output
document will have had anything from zero to several different structural
modifications. Note that no guarantees are made that the output will be
valid.

The input document is validated at the end of the process, so watch the
validation messages that are issued towards the end of the process.

If you keep encountering DTBook documents with recurring problems that are
not fixed by this Script, please contact the Pipeline development team.

Configuration

  Input file
    Required. Select input DTBook file
  Output directory
    Required. Select where to store the output
    result.
  Active Categories
    Select the type of manipulation to be performed by activating one or
      several Categories.

      
       Read more on what is included in each category in the DTBookFix Categories section below.
  Force Execution
    Optional. When checked, DTBookFix will run all
      selected categories disregarding the input documents state (by default,
      the Repair category is run only if the document is invalid, and the
      Tidy category is run only of the document is valid).
  Simplify heading layout
    This is an optional routine within the Tidy
      category. Check the box to simplify the level structure by removing
      redundant levels.

      See further Level cleanup  below. 
  Tidy inline whitespace
    This is an optional in Tidy category. Check box to
      move leading and trailing whitespace outside of em, strong, sub, sup,
      noteref and pagenum elements.

      See further Tidy inline whitespace
      below.
  Fix Character set
    This is an optional routine within the Repair
      category. Check box to attempt to fix an invalid stated character set.
      See further Character Set recoder
    below.


DTBookFix Categories

This section gives a technical summary of the manipulations that are done
within each DTBookFix manipulation category.

The Repair Category

  
  Level splitter
    Splits a level into several levels if a certain level1-6
      element has several headings on the same level.
  Level 1-6 repair
    Inserts level1-6 elements where needed to meet the
      requirements on proper nesting
  Illegal heading removal
    Changes an illegal heading (for example, an h3 element
      inside a level2 element) into a p element.
      The p element will have the class attribute value of the
      original heading element name (e.g. <p class="h3">).
    
  Flatten redundant nesting
    Removes nested p
  Complete structure
    Adds an empty paragraph if the last element in the level is a
    heading
  List repair
    
        Wraps a list in li when the parent of
          the list is another list.
        adds a type attribute if missing (default value is
          "pl")
        Corrects the depth attribute if it is incorrect
        Removes the enum attribute if the list is not
        ordered
        Removes the start attribute if the list is not
          ordered
      
    
  Character Set recoder
    This will run a character set detection algorithm on the
      input file, disregarding any stated characterset, and then recode the
      entire file. Only enable this if you have explicit issues with
      character display, or when your document is reported to be malformed
      with stated reasons similar to "invalid byte sequence" or similar.
  IDREF repair
    
        Adds the idref attribute to noteref and
          annoref elements if missing
        Estimate and give the idref attribute a value if
          empty
        Add a hash mark in the beginning of all idref attributes that
          don't contain a hash mark.
      
    
  Empty elements remover
    Removes empty/whitespace elements that must have children.
  Page number type repair
    Changes the type attribute of the
      pagenum element to match the contents of the element (i.e.
      the page number value).
      Incorrect "normal" page numbers will be changed to "front" if the
      contents contains roman numerals and the element is located in the
      frontmatter of the book. Incorrect "front" page numbers will be changed
      to "normal" if the contents contains arabic numbers. Otherwise the page
      attribute will be changed to "special" if it is incorrect.
    
  Metadata repair
    
        Fixes Dublin Core metadata name case errors (i.e.
          dc:title is changed to dc:Title)
        Removes unknown Dublin Core metadata (i.e. dc:Hello)
        Adds a dtb:uid from dc:Identifier, if
          missing
        Adds a dc:Title from the first doctitle
          element of the book, if missing
      
    


The Tidy Category

  Level cleanup
    Simplifies the level structure by removing redundant levels
      (subordinate levels will be moved upwards). Note that the headings of
      the affected levels will also change, which will alter the appearance
      of the layout.
  Pagenum mover
    
        Pagenums in headings are placed before the heading
        Pagenums in words are placed after the word.
      
    
  Change inline pagenum to block
    Removes otherwise empty p or li around pagenum (except p in td)
  Empty elements remover
    Removes empty/whitespace elements (p, em, strong, sub, sup), unless
      required for validity. E.g. an empty p that is preceded by a heading
      and followed only by other empty p is not removed.
  Author and Title addition
    Inserts docauthor and doctitle elements to frontmatter using Dublin
      Core metadata.
  Tidy inline whitespace
    Moves leading and trailing whitespace outside of em, strong, sub, sup
      and pagenum elements. For example: "this is an<em> example
      </em>of what<strong> Tidy inline whitespace
      </strong>does" will change to: "this is an
      <em>example</em> of what <strong>Tidy inline
      whitespace</strong> does". This is a requirement for accurate
      braille rendering.
  Indenter
    Performs a "pretty print" of the XML elements in the document.


Appendix: List of Transformers used

The documents linked below are parts of the Transformer technical
documentation. These are developer and systems-administrator centric
documents.

  se_tpb_dtbookFix