Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Holiday Calendar
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations
What's the big deal - just cut and paste?

When converting documents from publishing systems like Microsoft Word, Quark or Adobe FrameMaker, many people cut and paste from the original documents - yet this can prove inaccurate, time consuming and costly, writes DCL's Mike Gross.

"...the reality is that humans are rarely 100 percent correct or consistent in performing tagging. In the majority of cases it makes sense to automate the bulk of a conversion - significantly reducing tagging inconsistencies."

Over the years a lot of people have said to me: "What's the big deal with data conversion? Why not simply cut and paste, it's easy to do and doesn't cost anything." They think that once they've pulled out the text all they need to do is throw a couple of tags around it and the job is done. While that’s partially true, the devil is in the details.

The truth is the old adage "five percent of the details take up ninety-five percent of the effort" applies in the world of document conversion.

You are likely to come across a number of difficulties when using cut and paste to convert documents to XML. And while in a collection of simple documents this may not be a big deal, the cost of cleaning up the “little details” in a collection of complex document can be substantial. These include:

Special characters and emphasis

While standard text is typically extracted accurately using cut and paste, characters such as mathematical and foreign symbols can be dropped in the course of moving into an XML environment.

So while a registered trademark might transfer okay, a Greek Alpha will often be converted to a plain capital "A". If that character occurs 300 times in a document, you will need to find and fix all 300 of them manually.

Likewise, emphasis such as bold, italic, small caps, underline and super and subscripting do not convert well with cut and paste. In most cases you will not end up with the proper tagging needed to represent the desired emphasis in the resulting XML document.

It is also important to note that superscripting is very heavily used within documentation, as hyperlinks, to designate footnotes - especially within tables and journal bibliographies. If these hyperlinks are lost in cut and paste conversion, reconnecting the links can be difficult.

Tables

Technical documents tend to contain a lot of tables. When converting tables using cut and paste the text in the table cells, and often the tab characters between cells, will be retained if the table is relatively simple, and particularly if it has little in the way of column or row spanning.

However, the majority of tables in technical documents are made up of much more than cell contents. Complex elements like spanning, alignment, header row designation and cell borders often do not convert accurately using cut and paste. If the conversion does prove to be inaccurate you would have to insert these important properties by hand.

Tagging inconsistencies

If you do decide to use cut and paste to convert your documents, someone will likely need to manually insert the necessary tagging into the resulting document. This will require people with some XML training and a good understanding of all the rules and tagging requirements for your particular markup environment. This makes it difficult to put together a scalable process.

On top of this, the reality is that humans are rarely 100% correct or consistent in performing tagging. In the majority of cases it makes sense to automate the bulk of a conversion - significantly reducing tagging inconsistencies.

Hyperlinking

Technical documentation is typically filled with potential XML hyperlinks, such as: "See Figure 2.2.7" or "Refer to Step 12". These need to be set up. Cut and paste conversion will always require the manual setting up of hyperlinks like this - even if the hyperlinks have already been created in the original document. This is a labor-intensive task and one that is even more prone to error than simple cut and paste.

Automated conversions, on the other hand, will retain the hyperlinks in the original document, and will create links even when only the text existed before. So, even if the text "See Figure 2.2.7" were simply typed into the original document, the conversion would produce the desired hyperlink.

When it comes to hyperlinking, automated conversion is especially useful for large and complex procedures, and when converting repair manuals.

Besides creating hyperlinks you will need to create IDs to hyperlink to. When tagging IDs it is vital to follow a definite pattern. This is straightforward when using automated conversion, but a nightmare when using cut and paste.

Other special mark-up requirements

During cut and paste conversion there is a risk that important information could be lost. Most at risk is information "buried" in source documents, such as tables of contents and indexes. These would need to be maintained in some way by the target mark-up environment.

In addition, there may be other embedded pieces of information required for final output, such as GUIDs and LOINC codes for pharmaceutical SPL documents. These would be stored in document header fields or other documentation sets, such as Excel spreadsheets. This information would need to be manually inserted if you use the cut and paste approach.

Best approach?

The cut and paste approach is feasible and may even make sense as a viable conversion method for small and simple documentation sets. For large or complex documentation sets, the “little details” loom larger and an automated conversion should be seriously considered for swifter, more accurate and complete tagging.

Mike Gross
September 20, 2005

Further reading:

Roundtrip conversions: seven golden rules for making convertible documents
http://www.dclab.com/xml_document_conversion_tools.asp

Converting from PDF to XML & MS Word: avoiding the pitfalls
http://www.dclab.com/converting_from_pdf2.asp

PDF white paper, Part 1: PDF overview
http://www.dclab.com/pdf_conversion.asp

Getting your content into XML
http://www.dclab.com/XMLwhitepaper.asp

An egg too far
http://www.dclab.com/do_it_yourself.asp

  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
Recent News

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe, Budapest, Hungary, October 21-23 2008. More...

9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

XyUser Phoenix, AZ, September 22-24, 2008. More…
Best Practices Santa Fe, NM, September 15-17, 2008. More…
Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.