All Downloads are FREE. Search and download functionalities are using the official Maven repository.

doc.scripts.WordMLtoDTBook.html Maven / Gradle / Ivy

The newest version!


	
		
		Pipeline Script: WordML to DTBook
		
	
	
		

Pipeline Script: WordML to DTBook

Overview

This script converts Word documents saved as XML from within Word 2003 into DTBook. The purpose is to provide an automatic conversion process from structured Word files into DTBook. The output can be used for further processing by other scripts in the Daisy Pipeline, e.g. to produce a Daisy book.

This documentation covers both the simle script "Word 2003 XML to DTBook" and the advanced script "Word 2003 XML to DTBook (production)". What applies to the simple script also apply to the "Word 2003 XML to XHTML" script.

Input Requirements

This script accepts Word documents saved as XML from within Word 2003 as input. To ensure that the output is error free, the following restrictions apply.

Text flows

Only use a single flow of text. Most people only use one text flow, you would have to put some effort into your layout before breaking this rule by accident.

Floating objects

Never use floating objects. This applies to images as well as to text and other objects. A floating object is an object that is positioned on a page without reference to surrounding text. To test if an object is floating, insert about a page of text on any page preceding the object. If the object remains on the same page and position as before but the text is different, then it is a floating object.

Footnotes

To create high quality output containing footnotes, use the footnotes feature in Word.

Note: A production facility with knowledge in DTBook markup might benefit more from semi automatic footnote creation, especially when working with OCR material. Refer to the transformer documentation for further details.

Paragraph styles

The following built-in paragraph styles can be used to structure the document: heading 1, heading 2, heading 3, heading 4, heading 5, heading 6, block text.

The style names given here are in English, the actual names as they appear in Word may be different depending on which version of Word you have purchased. The localized style names will work as described.

Using styles not defined in this list will not cause an error, but will not enhance the result in any way.

Note: The script can be customized to accept other styles. Refer to the transformer documentation for further details.

It is recommended, although not an absolute requirement, that the first heading in a document is a heading 1 and that following headings never have a greater number than the preceding heading plus one. Not following this recommendation will still create an error free output, but might cause subsequent scrips that use it to fail.

Note! Never use a paragraph style on a section of a paragraph. This is a very common mistake and can be very hard to spot. The most common mistake is to select the entire paragraph except the paragraph marker, thus appearing perfectly fine upon visual inspection. The output will be error free, but it will not reflect the authors intention. This is not a malfunction of the script, but a design flaw/feature in Word.

Character styles

The following built-in character styles can be used to structure the document: strong, emphasis, page number.

The style names given here are in English, the actual names as they appear in Word may be different depending on which version of Word you have purchased. The localized style names will work as described.

Using styles not defined in this list will not cause an error, but will not enhance the result in any way.

Note: The script can be customized to accept other styles. Refer to the transformer documentation for further details.

Manual formatting

The following manual formating is preserved: italic, bold, superscript and subscript. Any other formatting done directly on, or close to, a group of characters will not enhance the result and should only be used for layout that does not communicate anything important to the reader. If the layout is important to the reader (as it should be), use styles to express it.

Lists

Use list nesting on list styles only (identified by a list icon next to the name in the Styles and Formatting Pane).

Keep list nesting neat by using the same principle that applies to headings: the first list item in a list must not be indented and following list items must never have a greater indentation than the preceding list item plus one(use tab to indent).

Note! Never use list nesting on paragraph styles with list formatting (identified by a paragraph icon next to the name in the Styles and Formatting Pane). Using tab to indent a paragraph style list will appear correct, but the result will be wrong.

Images

All images that are to be part of the result must be embedded in the original document. To ensure that images are embedded, do the following:

  1. On the Edit menu, click Links. If this item is not selectable, all images are embedded, and no further steps are needed.
  2. Select all links. To select multiple linked objects, hold down CTRL and click each linked object.
  3. Click Break Links.

Images can be converted to JPEG by checking the "Convert images to JPEG" checkbox.

Word templates

Two document templates are available in the transformer directory. Both include macros to prepare a document for input into pipeline and should be run when the document is finished. To run the pipeline preparation macro:

  1. Hit ALT+F8
  2. Select the "PrepareForPipeline" macro
  3. Click Run

In order to make use of this feature the macro security must be set to "medium" or lower in Word (click Macros/Security... in the Tools menu).

The behaviour is similar regardless of which template you are using:

  • A properties dialog is displayed to ensure that the document has a correct title and author. Change the settings and click OK.
  • If the document has unsaved changes (e.g. if the properties were changed), you will be asked to confirm that you want your document saved before proceeding. If your document has not been saved before, you will also be asked where to save your document.
  • A "save as" dialog will now appear, asking where you want to save your exported document. Do not change file type, it should be XML.
  • A list of preparations is now run on your document.
  • The exported document is saved.
  • Your original document is re-opened if you want to make further edits. Do not edit the exported document. It has been altered to fit the pipeline process.

Note! This proceedure can contain one or two save as dialogs in sequence, pay attention to which dialog you are currently in.

native.dot

This template is designed to be used with the simple script and contains a few basic styles. Focus is on documents that were created in Word.

Page numbers

Documents that are created in Word have a page numbering that matches the layout on the screen. Therefore, the macro contained in this template will insert the current page number automatically at the top of each page.

scanned.dot

This template is designed to be used with the advanced script and contains a wider set of styles. Focus is on documents that were imported into Word from another source, e.g. OCR-software or print publishing software. A basic understanding of the DTBook format is highly recommended as manual corrections usually are needed.

Page numbers

Documents that have another source than Word never have a page numbering that matches the layout on the screen. Therefore, the page breaks in the source format have to be inserted manually using the page number style.

Output

The output of the script is a DTBook document including images.

Configuration

Input file
Select input xml file
Output directory
Select output directory
Extract images
Check box to extract images
Convert images to JPEG
Check box to convert all images into JPEG format. Note that if the "Extract images" checkbox is unchecked, this box will have no effect. The conversion is done using an external software called ImageMagick which must be installed on your system. Refer to the transformer documentation for more information on how to set it up.
Overwrite existing files
Check box to overwrite existing files in the output directory
Append XHTML stylesheet
Check box to include XHTML stylesheet for display in a browser
Title
The title of the publication. If no value is supplied, the information is extracted from the file properties
Author
The author of the publication. If no value is supplied, the information is extracted from the file properties
dtb:uid
A unique identifier. If no value is supplied, an identifier will be generated.

Appendix: List of Transformers used

The documents linked below are parts of the Transformer technical documentation. These are developer and systems-administrator centric documents.

  1. se_tpb_wordml2dtbook
  2. int_daisy_validator




© 2015 - 2025 Weber Informatics LLC | Privacy Policy