DCL  
     Refer a friend Send this Page to a Friend
     Print friendly version Printer-Friendly Format

    Resource Center

    Fact Sheets

    White Papers

XML content travels the world via document conversion.Roundtrip Document Conversion - 7 Rules for Convertible Documents

Mike Gross, Chief Technical Officer at DCL, reveals the secrets of using XML document conversion tools for effective roundtripping and legacy document conversion.

AS XML HAS GAINED in popularity, there is growing interest in producing documents that are stored in XML. However, since most people aren't familiar with XML, there is a need to be able to take documents originally authored in traditional authoring formats like MS Word and convert them to XML. This is referred to as "legacy document conversion." (More on legacy document conversion)

Document Conversion Resources

Quark Document Conversion

MS-Word Document Conversion

SGML Document Conversion

PDF Document Conversion

Document Conversion to XML

However, since many documents are updated on a regular basis (usually by authors expert in their field, but without XML knowledge) there is also a need for a "roundtrip conversion" capability.

This involves converting documents to a proprietary publishing format (such as Word, WordPerfect, Quark, or InDesign), which authors can edit in their favorite word processor or Desktop Publisher with the intent that when done, the documents will be converted back to XML format.

Off-the-shelf Document Conversion Tools

Various tools on the market support converting back and forth between XML and DTP/word processing formats. These attempt to map XML tagging structures to the stylesheets found in publishing software. They also offer some ability to apply customized rules and conditions to the transformation.

Customized conversion rules and conditions are needed because there is rarely an exact mapping between XML tagging and stylesheets, and there are features supported on one side but not on the other. For example, the tag nesting and document hierarchies in XML are not easy to simulate with the much "flatter" structure of a stylesheet.

The conversion of a document from XML to a publishing format is usually straightforward - particularly if you have described each element of your material with XML codes, and have built a DTD or Schema to further constrain your content.

The potential problems lie in converting documents back from publishing formats to XML - the "roundtripping" part. This is because publishing tools contain many features that allow users to create colorful and intricate designs - the ones you see in glossy magazines and corporate brochures. Most of these, however, are impossible to map directly into an XML tagging structure.

To successfully roundtrip documents you need to build a comprehensive publishing stylesheet. This will have "containers" that hold your XML structure. That way, when you convert documents back to XML from the DTP or word processor format, the structure will be reasonably intact.

It is also important to define a set of authoring rules that must be enforced among the authors - otherwise you risk "misplacing" information on the return trip.

The following guidelines will help ensure smoother roundtripping:

  1. All paragraphs must be styled using one of the available template styles.

  2. Unique styles should be defined for paragraphs with different meanings, even if they look the same. For instance, if a figure and table title have the same appearance, separate styles should be created for each of them. That way, when you go back to XML, the two will be clearly marked out.

  3. Paragraph styles should not be overridden to give a paragraph a different appearance than the base style would give it. A different style should be used. For example, rather than applying highlighting or emphasis to a whole "body text" paragraph style, a special style for the purpose should be provided.

  4. Tables, or items meant to be tabular, should be created using a table editing facility - assuming your publishing tool has one available (not all do, unfortunately).

  5. Absolute Frame positioning is available in many publishing tools, but should not be used to mimic a table. Nor should tabs or spacing be used. The rule is: "Tables should be tables."

  6. The method used to insert foreign and special characters (such as Greek and mathematical symbols) should be agreed on in advance, including which fonts are allowed. This stops authors selecting fonts at random to add an obscure character, which would cause problems converting back to XML.

  7. Linked content (such as table and page footnotes, figure and table references) should be done using a method defined in advance. Often publishing tools provide a preferred way to do this, making the conversion of references far simpler. Trying to infer references from text is more difficult and more prone to error.

The above guidelines will allow round-tripping tools to do a better job. But since each tool has its own unique capabilities, you'll need to assess the capabilities and limitations of the software available before setting up a roundtripping strategy.

Performing legacy document conversion using off-the-shelf tools

You might be tempted to use roundtripping tools. However, be aware you'll only get good results if the legacy documents were written in a strict environment and the authors knew how to use the publishing software properly. This is very, very rare.

The harsh reality is that even getting a good document to convert in the controlled roundtripping environment is not always possible. Authors are usually experts in their own field, but have little knowledge of publishing tools. They know how to use the basic formatting buttons - such as bold, italic and indents - to make pages come out the way they want them to look. But they have little knowledge of how to set up even simple stylesheets. When given an "authoring spec" they are often bewildered.

Therefore, it is unrealistic to expect to easily convert legacy documents that were authored primarily with the intention of producing good looking documents on paper. What's more, such documents would often have been created to tight deadlines, since documentation is often the last rung on the ladder to delivering a new product or service. Such pressure means little time to worry about the niceties of using word processors and DTP systems correctly. "Whatever works" is the maxim of the day.

In addition, the structure of the DTD or Schema may not have been built with these types of documents in mind, leaving you without a tagging structure to hold the content of documents created with publishing software.

Assess the effort involved

These are by no means all of the challenges you are likely to face (more articles on document conversion). But the key thing to remember here is that off-the-shelf tools are suitable for converting documents that were authored with a definite XML structure in mind.

If you use them to convert all your legacy materials, you may well be able to get some (or even a lot) of the conversion right. But if your documents are somewhat complex, you will likely have to do a good deal of work on them before they are ready for prime time. The bottom line is: These tools will work when you can carefully control the environment. However, if there is uncontrollable variation, more specialized or tailored tools may be a better choice.

Mike Gross
May 20th, 2004

 
representational space
    Popular Links

    Events

    Recent Events

representational space
representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365, P: 718-357-8700
Data Conversion Lab
Copyright © 1997-2009  Data Conversion Laboratory, Inc. All rights reserved.