DCLWiki | Client Area  
DCL  

representational space

   Refer a friend  Email this Page
   Print friendly version Print-Friendly
   Request Information Request Information
   Subscribe  Subscribe

          LinkedInTwitterFacebook

representational space
Services
Content Reuse
Document Conversion
Quality Assurance
Rendering & Publishing
SPL Labeling
Source Formats
   - Word Processors
   - Publishing Systems
   - PDF
   - Other Formats
Target Formats
   - XML & SGML
   - DITA
   - Military DTDs
   - NLM
   - Public DTDs
   - S1000D
   - Other Standards
Other Services »
representational space
Memberships
What's the big deal - just cut and paste?

When converting documents from publishing systems like Microsoft Word, Quark or Adobe FrameMaker, many people cut and paste from the original documents - yet this can prove inaccurate, time consuming and costly, writes DCL's Mike Gross.

"...the reality is that humans are rarely 100 percent correct or consistent in performing tagging. In the majority of cases it makes sense to automate the bulk of a conversion - significantly reducing tagging inconsistencies."

Over the years a lot of people have said to me: "What's the big deal with data conversion? Why not simply cut and paste, it's easy to do and doesn't cost anything." They think that once they've pulled out the text all they need to do is throw a couple of tags around it and the job is done. While that’s partially true, the devil is in the details.

The truth is the old adage "five percent of the details take up ninety-five percent of the effort" applies in the world of document conversion.

You are likely to come across a number of difficulties when using cut and paste to convert documents to XML. And while in a collection of simple documents this may not be a big deal, the cost of cleaning up the “little details” in a collection of complex document can be substantial. These include:

Special characters and emphasis

While standard text is typically extracted accurately using cut and paste, characters such as mathematical and foreign symbols can be dropped in the course of moving into an XML environment.

So while a registered trademark might transfer okay, a Greek Alpha will often be converted to a plain capital "A". If that character occurs 300 times in a document, you will need to find and fix all 300 of them manually.

Likewise, emphasis such as bold, italic, small caps, underline and super and subscripting do not convert well with cut and paste. In most cases you will not end up with the proper tagging needed to represent the desired emphasis in the resulting XML document.

It is also important to note that superscripting is very heavily used within documentation, as hyperlinks, to designate footnotes - especially within tables and journal bibliographies. If these hyperlinks are lost in cut and paste conversion, reconnecting the links can be difficult.

Tables

Technical documents tend to contain a lot of tables. When converting tables using cut and paste the text in the table cells, and often the tab characters between cells, will be retained if the table is relatively simple, and particularly if it has little in the way of column or row spanning.

However, the majority of tables in technical documents are made up of much more than cell contents. Complex elements like spanning, alignment, header row designation and cell borders often do not convert accurately using cut and paste. If the conversion does prove to be inaccurate you would have to insert these important properties by hand.

Tagging inconsistencies

If you do decide to use cut and paste to convert your documents, someone will likely need to manually insert the necessary tagging into the resulting document. This will require people with some XML training and a good understanding of all the rules and tagging requirements for your particular markup environment. This makes it difficult to put together a scalable process.

On top of this, the reality is that humans are rarely 100% correct or consistent in performing tagging. In the majority of cases it makes sense to automate the bulk of a conversion - significantly reducing tagging inconsistencies.

Hyperlinking

Technical documentation is typically filled with potential XML hyperlinks, such as: "See Figure 2.2.7" or "Refer to Step 12". These need to be set up. Cut and paste conversion will always require the manual setting up of hyperlinks like this - even if the hyperlinks have already been created in the original document. This is a labor-intensive task and one that is even more prone to error than simple cut and paste.

Automated conversions, on the other hand, will retain the hyperlinks in the original document, and will create links even when only the text existed before. So, even if the text "See Figure 2.2.7" were simply typed into the original document, the conversion would produce the desired hyperlink.

When it comes to hyperlinking, automated conversion is especially useful for large and complex procedures, and when converting repair manuals.

Besides creating hyperlinks you will need to create IDs to hyperlink to. When tagging IDs it is vital to follow a definite pattern. This is straightforward when using automated conversion, but a nightmare when using cut and paste.

Other special mark-up requirements

During cut and paste conversion there is a risk that important information could be lost. Most at risk is information "buried" in source documents, such as tables of contents and indexes. These would need to be maintained in some way by the target mark-up environment.

In addition, there may be other embedded pieces of information required for final output, such as GUIDs and LOINC codes for pharmaceutical SPL documents. These would be stored in document header fields or other documentation sets, such as Excel spreadsheets. This information would need to be manually inserted if you use the cut and paste approach.

Best approach?

The cut and paste approach is feasible and may even make sense as a viable conversion method for small and simple documentation sets. For large or complex documentation sets, the “little details” loom larger and an automated conversion should be seriously considered for swifter, more accurate and complete tagging.

Mike Gross
September 20, 2005

Further reading:

Roundtrip conversions: seven golden rules for making convertible documents
http://www.dclab.com/xml_document_conversion_tools.asp

Converting from PDF to XML & MS Word: avoiding the pitfalls
http://www.dclab.com/converting_from_pdf2.asp

PDF white paper, Part 1: PDF overview
http://www.dclab.com/pdf_conversion.asp

Getting your content into XML
http://www.dclab.com/XMLwhitepaper.asp

An egg too far
http://www.dclab.com/do_it_yourself.asp

 
representational space
DCL Library
Articles, fact sheets, presentations and white papers
representational space
Events

CIDM Best Practices Conference
September 13–15, 2010
Hampton, Virginia

Vasont Users' Group Meeting
September 27–30, 2010
Hershey, Pennsylvania

Internet Librarian Conference
October 25–27, 2010
Monterey, California

Journal Article Tag Suite Conference (JATS-Con)
November 1–2, 2010
Bethesda, Maryland

SPARC Digital Repositories Meeting
November 8–9, 2010
Baltimore, Maryland

More Events »

representational space

News
Brill Again Turns to Data Conversion Laboratory (DCL™) for Key Project


DCL and GeerStreet Announce Strategic Partnership


DCL's “Dan Tonkery on the iPad and the Future of Technical Publications” Published in CIDM News


DCL's “Guide to Conversion Cost Variables” Published in Best Practices Newsletter


DCL's “Dan Tonkery on the iPad and the Future of Technical Publications” Translated on German Blog

More News »


representational space
representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th Street, 2nd Floor, Fresh Meadows, NY 11365
718-357-8700
Data Conversion Lab
Copyright © 1997-2010  Data Conversion Laboratory, Inc. All rights reserved.