Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations

Convert From PDF Part 2 of a White Paper

Avoiding Pitfalls When You Convert From PDF To XML & MS Word

Part 2 of a white paper on issues to address when you convert from PDF. Out-of-the-box solutions are a start, but there's more to the story, writes Mike Gross, Chief Technology Officer at Data Conversion Laboratory (DCL).

Part One of this white paper discussed general difficulties when you convert from PDF into editable formats. This second part examines issues as they relate to specific formats and discusses the state-of-the-art in accomplishing the task.

Issues related to specific target formats

C0NVERTING FROM PDF WHITE PAPER

Part 1: Problems of Converting From PDF

Part 2: Target Format Issues

OTHER PDF RESOURCES ON DCLAB.COM

Is all PDF created equal?

Can PDF documents be easily converted into XML?

PDF or SGML? Which should I choose?

DCL Technical Library, PDF pages

NEW WHITE PAPER ALERT!

Be first in line to read new articles on PDF, XML, and data conversion.
Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

So far, we discussed the various global issues to address when you convert from PDF. We didn't cover the specifics of any particular format since most common document formats need the same elements from the source document when you convert from PDF. For example, the logical elements in the source document - such as proper word spacing, de-hyphenation, paragraph borders, special characters, multiple columns, text flow, and table layout - all need to be identified.

In this section we will go further and look at several popular target formats and the issues specific to them.

 

MS Word - Most of the issues discussed in the first part of this white paper relate directly to MS Word as it is the most common program for authoring documents and is a natural target format for those PDF documents that will need to be maintained and modified. It's also the easiest program in which to make fixes to elements in the source document that weren't converted properly. The key item to be aware of is that while professionally authored Word documents would ideally use style sheets to maintain consistent looks within documents, the conversion programs normally do not apply styles. Therefore, besides the normal cleanup tasks, if you need the documents to conform to a specific style sheet, one of your cleanup tasks will be to go through and manually style the document paragraphs.

Besides the lack of styling, the other key issue is that it is often possible to virtually replicate the look of the original PDF, but in a form that's not necessarily maintainable in that way. This is especially true for tabular material, in which the look can be replicated with exact positioning of lines and other elements in a manner that will look correct, but which will be extremely difficult to edit and maintain.

RTF - If you convert to RTF, you can then import your converted documents to any authoring program that allows RTF import. What applies to MS Word holds true for RTF and most other desktop publishing and word processing programs.

XML & SGML - Both XML and SGML require similar tagging which needs to be done at two levels - tagging of the document structure, as well as tagging of content elements. While the structural issues discussed above apply directly, there is the additional issue of tagging to your specific DTD or Schema, which the generic conversion programs know nothing about. Furthermore, XML/SGML documents will also require tagging related to document content (such as section titles and cross referencing). This will need to be applied either manually or by software in a post process.

Some conversion programs produce an intermediate level "vanilla" XML, and you can then use an XSLT script to transform the intermediate document into final form. In all these transformations it's important to realize that you are constantly inferring information which doesn't appear explicitly in the document. This is an inherently difficult process. The conversion programs will typically represent XML and SGML tables using either CALS or HTML table formats. These do a good job of replicating the look and feel of the original PDF table, but you should expect to have to do some clean up on the marked up tables.

HTML - There aren't enough tags in generic HTML tagging to fully replicate the document structures you'd find in most moderately complex documents. As a result many conversions to HTML are approximations - at best. However, using Cascading Style Sheets (CSS) in the latest versions of HTML, you can produce quite sophisticated paragraph layouts. The use of CSS, however, is a double-edged sword. While some of the PDF conversion tools have a tendency to make use of these features to replicate the look of the original page, and accomplish that task admirably, they do so by using features that are very difficult to properly edit and maintain. For example, since the layout done for paper publishing is not necessarily the same layout that you'd use for a computer screen inside a web browser, you could have difficulties re-wrapping the text for the format (due to the way converted documents are coded). This is particularly true of tables in the original PDF document, which should be rendered using HTML table tagging, making them much more flexible and "re-wrapable" so that they can be more readily displayed on smaller display devices (such as a PDA).
 

DID YOU KNOW?

Data Conversion Laboratory (DCL) uses the most up-to-date and best software to assist in the process of converting from PDF to XML, MS Word, and other electronic formats. This white paper comes out of our development team's research into the issues that prevent conversion from PDF from being an automated process.

But conversion is just part of the service DCL provides. Our process includes software that takes automation as far as feasible. This is used in conjunction with software that checks for the issues discussed in this white paper and identifies the problem areas. But we also use expert reviewers - real live humans - to review the results of the conversion process and make sure that what gets delivered is ready for prime time.

Several approaches to PDF conversion

Because PDF was designed to be a print-layout format, and not intended to be editable, as discussed above, the PDF document conversion software has a lot to do. It is a field in which a considerable amount of work is being done, and it is getting better over time. But, the fact is, the job of PDF conversion is rarely done perfectly, and you should expect to have to do some cleanup.

There are currently several software tools that support conversion from PDF into specific target formats. Some of them are Acrobat plugins, others are standalone software programs, while others operate in a service bureau mode (you upload the files to them, and they send you back converted documents). There are even freeware programs available that will do some level of conversion on PDF documents.

The options and control available to the user also vary greatly. For many of the tools that support PDF conversion, the user simply asks the software to export the PDF document to a particular target format (this method offers some user controlled options, but generally there is very little user intervention). Other tools require the user to "zone" the document manually. In this case, the user is defining the page flow zones on the page, such as multiple columns and differentiating between graphics, tables, and text. There are also tools that do their own guessing of the zones on a page, and then allow the user to override the zones guessed at by the conversion software.

DCL's preference is this last way, where the software attempts to guess page zones and the user gets to override this. In general allowing user control of the document before it is completely decomposed is important. As already mentioned, these tools occasionally don't de-columnize a page or break apart a table properly. It is much harder to fix these elements after the output software has run.

It should be mentioned that one of the options available is the internal 'Save As' option available within Acrobat itself. Adobe Acrobat 6 has a whole array of formats you can save a document in. Plus it has a good degree of functionality. For example, it can find paragraphs, decolumnize text, and find and decompose tables. Depending on the complexity of your documents, this solution may suit your needs. Unfortunately, it doesn't allow user intervention after the page is "zoned" and before it is output. So you are forced to clean up those types of errors afterwards.

OCR software tools are another option. Some of the OCR tools that have been around for many years have recently added support for PDF Normal documents - and since they have fairly sophisticated page layout capabilities, they may produce decent results. You need to be aware, however, that they may simply be OCRing the page. In which case you may get text accuracy errors that normally would not be an issue with PDF Normal documents. Making use of the text layer in the PDF documents is a better choice then attempting to "recognize" the document text.

Conclusion

In summary, there are no magic bullets - the features you need supported depend greatly on the complexity of your source documents. With any of the available tools, you will need to test carefully to see how well they convert your particular materials. These tools have made great strides in recent years. But because you can do so much in a PDF document, the conversion process will become ever more complex. You may find that a particular tool does very well with most of your documents, but breaks down on others; so it may be only a partial solution.

The various tools may output to different target formats, but these are somewhat interchangeable (an HTML document, for instance, can be imported directly into MS Word). Therefore, the quality of how the program decomposes the various page elements is more important than the specific format that it saves its output in.

In recent years, Adobe has added the concept of re-flowable PDF constructs, so that publishing tools that produce PDF will be able to pass more information to the PDF documents. This is in Adobe's own best interests, as they would like to make their PDF documents more easily renderable on the ever more prevalent handheld devices.

Theoretically, this will aid PDF conversion software tools in doing their job. So PDF document conversion may get easier in the future.

Certain conversion tools make use of very specific features of the various output formats to re-duplicate the look of the original page. In most cases, you should avoid using these representations of the converted PDF documents, since they tend to sacrifice logical structure in favor of a specific appearance. This makes the converted documents harder to maintain and less "re-purposable."

Lastly, it should be reiterated that, except for simple documents, you should not expect to get perfect final output from these conversion tools. Some level of proofing, manual review, and cleanup will always be required - and you should plan for it.

Because there are no magic bullets, our approach at DCL is to constantly re-evaluate the various tools on the market so that we can incorporate the best of what's out there into what we do. But since for the foreseeable future tools are only part of the solution, and since what we do is deliverable ready-to-use-documents, we've built the review and cleanup processes as an integral part of our workflow. We've found that the key to providing quality content is to have the right level of intervention and quality control at the right moments.

Mike Gross
11/3/2003

Missed part 1 of this PDF white paper? Pick it up here.

Be first in line to read new white papers on PDF, XML, and data conversion. Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
DCL Calendar

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

Best Practices Santa Fe, NM, September 15-17, 2008. More…
XyUser Phoenix, AZ, September 22-24, 2008. More…
9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe. Details TBA.

 
DCL Calendar

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

 
Recent News

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.