Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Holiday Calendar
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations

XML content travels the world via document conversion.Roundtrip Document Conversion - 7 Rules for Convertible Documents

Mike Gross, Chief Technical Officer at DCL, reveals the secrets of using XML document conversion tools for effective roundtripping and legacy document conversion.

AS XML HAS GAINED in popularity, there is growing interest in producing documents that are stored in XML. However, since most people aren't familiar with XML, there is a need to be able to take documents originally authored in traditional authoring formats like MS Word and convert them to XML. This is referred to as "legacy document conversion." (More on legacy document conversion)

Document Conversion Resources

Quark Document Conversion

MS-Word Document Conversion

SGML Document Conversion

PDF Document Conversion

Document Conversion to XML

However, since many documents are updated on a regular basis (usually by authors expert in their field, but without XML knowledge) there is also a need for a "roundtrip conversion" capability.

This involves converting documents to a proprietary publishing format (such as Word, WordPerfect, Quark, or InDesign), which authors can edit in their favorite word processor or Desktop Publisher with the intent that when done, the documents will be converted back to XML format.

Off-the-shelf Document Conversion Tools

Various tools on the market support converting back and forth between XML and DTP/word processing formats. These attempt to map XML tagging structures to the stylesheets found in publishing software. They also offer some ability to apply customized rules and conditions to the transformation.

Customized conversion rules and conditions are needed because there is rarely an exact mapping between XML tagging and stylesheets, and there are features supported on one side but not on the other. For example, the tag nesting and document hierarchies in XML are not easy to simulate with the much "flatter" structure of a stylesheet.

The conversion of a document from XML to a publishing format is usually straightforward - particularly if you have described each element of your material with XML codes, and have built a DTD or Schema to further constrain your content.

The potential problems lie in converting documents back from publishing formats to XML - the "roundtripping" part. This is because publishing tools contain many features that allow users to create colorful and intricate designs - the ones you see in glossy magazines and corporate brochures. Most of these, however, are impossible to map directly into an XML tagging structure.

To successfully roundtrip documents you need to build a comprehensive publishing stylesheet. This will have "containers" that hold your XML structure. That way, when you convert documents back to XML from the DTP or word processor format, the structure will be reasonably intact.

It is also important to define a set of authoring rules that must be enforced among the authors - otherwise you risk "misplacing" information on the return trip.

The following guidelines will help ensure smoother roundtripping:

  1. All paragraphs must be styled using one of the available template styles.

  2. Unique styles should be defined for paragraphs with different meanings, even if they look the same. For instance, if a figure and table title have the same appearance, separate styles should be created for each of them. That way, when you go back to XML, the two will be clearly marked out.

  3. Paragraph styles should not be overridden to give a paragraph a different appearance than the base style would give it. A different style should be used. For example, rather than applying highlighting or emphasis to a whole "body text" paragraph style, a special style for the purpose should be provided.

  4. Tables, or items meant to be tabular, should be created using a table editing facility - assuming your publishing tool has one available (not all do, unfortunately).

  5. Absolute Frame positioning is available in many publishing tools, but should not be used to mimic a table. Nor should tabs or spacing be used. The rule is: "Tables should be tables."

  6. The method used to insert foreign and special characters (such as Greek and mathematical symbols) should be agreed on in advance, including which fonts are allowed. This stops authors selecting fonts at random to add an obscure character, which would cause problems converting back to XML.

  7. Linked content (such as table and page footnotes, figure and table references) should be done using a method defined in advance. Often publishing tools provide a preferred way to do this, making the conversion of references far simpler. Trying to infer references from text is more difficult and more prone to error.

The above guidelines will allow round-tripping tools to do a better job. But since each tool has its own unique capabilities, you'll need to assess the capabilities and limitations of the software available before setting up a roundtripping strategy.

Performing legacy document conversion using off-the-shelf tools

You might be tempted to use roundtripping tools. However, be aware you'll only get good results if the legacy documents were written in a strict environment and the authors knew how to use the publishing software properly. This is very, very rare.

The harsh reality is that even getting a good document to convert in the controlled roundtripping environment is not always possible. Authors are usually experts in their own field, but have little knowledge of publishing tools. They know how to use the basic formatting buttons - such as bold, italic and indents - to make pages come out the way they want them to look. But they have little knowledge of how to set up even simple stylesheets. When given an "authoring spec" they are often bewildered.

Therefore, it is unrealistic to expect to easily convert legacy documents that were authored primarily with the intention of producing good looking documents on paper. What's more, such documents would often have been created to tight deadlines, since documentation is often the last rung on the ladder to delivering a new product or service. Such pressure means little time to worry about the niceties of using word processors and DTP systems correctly. "Whatever works" is the maxim of the day.

In addition, the structure of the DTD or Schema may not have been built with these types of documents in mind, leaving you without a tagging structure to hold the content of documents created with publishing software.

Assess the effort involved

These are by no means all of the challenges you are likely to face (more articles on document conversion). But the key thing to remember here is that off-the-shelf tools are suitable for converting documents that were authored with a definite XML structure in mind.

If you use them to convert all your legacy materials, you may well be able to get some (or even a lot) of the conversion right. But if your documents are somewhat complex, you will likely have to do a good deal of work on them before they are ready for prime time. The bottom line is: These tools will work when you can carefully control the environment. However, if there is uncontrollable variation, more specialized or tailored tools may be a better choice.

Mike Gross
May 20th, 2004

  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
Recent News

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe, Budapest, Hungary, October 21-23 2008. More...

9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

XyUser Phoenix, AZ, September 22-24, 2008. More…
Best Practices Santa Fe, NM, September 15-17, 2008. More…
Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.