
doc.scripts.WordMLtoDTBook.html Maven / Gradle / Ivy
Pipeline Script: WordML to DTBook
Pipeline Script: WordML
to DTBook
Overview
This script converts Word documents saved as XML from within Word 2003 into DTBook. The purpose is to provide an automatic conversion process from structured Word files into DTBook. The output can be used for further processing by other scripts in the Daisy Pipeline, e.g. to produce a Daisy book.
This documentation covers both the simle script "Word 2003 XML to DTBook" and the advanced script "Word 2003 XML to DTBook (production)". What applies to the simple script also apply to the "Word 2003 XML to XHTML" script.
Input Requirements
This script accepts Word documents saved as XML from within Word 2003 as input. To ensure that the output is error free, the following restrictions apply.
Text flows
Only use a single flow of text. Most people only use one text flow, you would have to put some effort into your layout before breaking this rule by accident.
Floating objects
Never use floating objects. This applies to images as well as to text
and other objects. A floating object is an object that is positioned on
a page without reference to surrounding text. To test if an
object is floating, insert about a page of text on any page
preceding the object. If the object remains on the same page and
position as before but the text is different, then it is a floating
object.
Footnotes
To create high quality output containing footnotes, use the footnotes feature in Word.
Note: A production facility with knowledge in DTBook markup might benefit more from semi automatic footnote creation, especially when working with OCR material. Refer to the transformer documentation for further details.
Paragraph styles
The following built-in paragraph styles can be used to structure the document: heading 1, heading 2, heading 3, heading 4, heading 5, heading 6, block text.
The style names given here are in English, the actual names as they appear in Word may be different depending on which version of Word you have purchased. The localized style names will work as described.
Using styles not defined in this list will not cause an error, but will not enhance the result in any way.
Note: The script can be customized to accept other styles. Refer to the transformer documentation for further details.
It is recommended, although not an absolute requirement, that the first heading in a document is a heading 1 and that following headings never have a greater number than the preceding heading plus one. Not following this recommendation will still create an error free output, but might cause subsequent scrips that use it to fail.
Note! Never use a paragraph style on a section of a paragraph. This is a very common mistake and can be very hard to spot. The most common mistake is to select the entire paragraph except the paragraph marker, thus appearing perfectly fine upon visual inspection. The output will be error free, but it will not reflect the authors intention. This is not a malfunction of the script, but a design flaw/feature in Word.
Character styles
The following built-in character styles can be used to structure the document: strong, emphasis, page number.
The style names given here are in English, the actual names as they appear in Word may be different depending on which version of Word you have purchased. The localized style names will work as described.
Using styles not defined in this list will not cause an error, but will not enhance the result in any way.
Note: The script can be customized to accept other styles. Refer to the transformer documentation for further details.
Manual formatting
The following manual formating is preserved: italic, bold, superscript and subscript. Any other formatting done directly on, or close to, a group of characters will not enhance the result and should only be used for layout that does not communicate anything important to the reader. If the layout is important to the reader (as it should be), use styles to express it.
Lists
Use list nesting on list styles only (identified by a list icon next to the name in the Styles and Formatting Pane).
Keep list nesting neat by using the same principle that applies to headings: the first list item in a list must not be indented and following list items must never have a greater indentation than the preceding list item plus one(use tab to indent).
Note! Never use list nesting on
paragraph styles with list formatting (identified by a paragraph icon
next to the name in the Styles and Formatting Pane). Using tab to
indent a paragraph style list will appear correct, but the result will
be wrong.
Images
All images that are to be part of the result must be embedded in the original document. To ensure that images are embedded, do the following:
- On the Edit menu, click Links. If this item is not selectable, all images are embedded, and no further steps are needed.
- Select all links. To select multiple linked objects, hold down CTRL and click each linked object.
- Click Break Links.
Images can be converted to JPEG by checking the "Convert images to JPEG" checkbox.
Word templates
Two document templates are available in the transformer directory. Both include macros to prepare a document for input into pipeline and should be run when the document is finished. To run the pipeline preparation macro:
- Hit ALT+F8
- Select the "PrepareForPipeline" macro
- Click Run
In order to make use of this feature the macro security must be set to "medium" or lower in Word (click Macros/Security... in the Tools menu).
The behaviour is similar regardless of which template you are using:
- A properties dialog is displayed to ensure that the document has a correct title and author. Change the settings and click OK.
- If the document has unsaved changes (e.g. if the properties were changed), you will be asked to confirm that you want your document saved before proceeding. If your document has not been saved before, you will also be asked where to save your document.
- A "save as" dialog will now appear, asking where you want to save your exported document. Do not change file type, it should be XML.
- A list of preparations is now run on your document.
- The exported document is saved.
- Your original document is re-opened if you want to make further edits. Do not edit the exported document. It has been altered to fit the pipeline process.
Note! This proceedure can contain one or two save as dialogs in sequence, pay attention to which dialog you are currently in.
native.dot
This template is designed to be used with the simple script and contains a few basic styles. Focus is on documents that were created in Word.
Page numbers
Documents that are created in Word have a page numbering that matches the layout on the screen. Therefore, the macro contained in this template will insert the current page number automatically at the top of each page.
scanned.dot
This template is designed to be used with the advanced script and contains a wider set of styles. Focus is on documents that were imported into Word from another source, e.g. OCR-software or print publishing software. A basic understanding of the DTBook format is highly recommended as manual corrections usually are needed.
Page numbers
Documents that have another source than Word never have a page numbering that matches the layout on the screen. Therefore, the page breaks in the source format have to be inserted manually using the page number style.
Output
The output of the script is a DTBook document including images.
Configuration
- Input file
- Select input xml file
- Output directory
- Select output directory
- Extract images
- Check box to extract images
- Convert images to JPEG
- Check box to convert all images into JPEG format. Note that if the "Extract images" checkbox is unchecked, this box will have no effect. The conversion is done using an external software called ImageMagick which must be installed on your system. Refer to the transformer documentation for more information on how to set it up.
- Overwrite existing files
- Check box to overwrite existing files in the output directory
- Append XHTML stylesheet
- Check box to include XHTML stylesheet for display in a browser
- Title
- The title of the publication. If no value is supplied, the information is extracted from the file properties
- Author
- The author of the publication. If no value is supplied, the information is extracted from the file properties
- dtb:uid
- A unique identifier. If no value is supplied, an identifier will be generated.
Appendix: List of Transformers used
The documents linked below are parts of the Transformer
technical documentation. These are developer and
systems-administrator centric documents.