Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations

Converting From PDF White Paper: Part 1

Converting From PDF To XML & MS Word: Avoiding The Pitfalls

If you're serious about converting from PDF, out-of-the-box solutions should stay-in-the-box, writes Mike Gross, Chief Technology Officer at Data Conversion Laboratory (DCL).

C0NVERTING FROM PDF WHITE PAPER

Part 1: Problems of Converting From PDF

Part 2: Target Format Issues

OTHER PDF RESOURCES ON DCLAB.COM

Is all PDF created equal?

Can PDF documents be easily converted into XML?

PDF or SGML? Which should I choose?

DCL Technical Library, PDF pages

NEW WHITE PAPER ALERT!

Be first in line to read new articles on PDF, XML, and data conversion.
Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

IN TODAY'S ELECTRONIC/INTERNET AGE, documents can be created using a vast array of text formatting, word processing, desktop publishing, and drawing tools. But Adobe's PDF format has rapidly become a standard way of distributing documents in electronic format. As a result, most documents that are published today exist as a PDF document at some step along the way, even if they are ultimately going to paper.

Although PDF represents an easy and convenient way to electronically represent the paper document, PDF is more correctly a page layout file format than it is a word processing or desktop publishing file format. Therefore, although some minor modifications to a document are possible via Adobe's Acrobat software, it is not intended as an actual source publishing tool (nor is it practical to use it as one).

There often arises a need to republish PDF documents, which requires converting from PDF into a format that is easier to work with. In this white paper, we will address some of the issues and problems you can expect to encounter when converting from PDF into a "source document" format.

Flavors of PDF

There are several different flavors of PDF. In this white paper we are looking at PDF Normal. You get PDF Normal when you produce a PDF document from a text publishing tool (via Acrobat Distiller or other PDF writers). Adobe Acrobat allows other flavors of PDF to contain raster images of each of the pages of the document (with or without some text in the background to allow text searching). These PDF documents are referred to as Image Only or as Image+Text. You get these when you scan paper documents (via Acrobat Exchange or some other method). They are typically used for delivering some form of PDF quickly. But Image Only and Image+Text PDF are not much different to converting from paper, in the sense that OCR issues arise, and so are beyond the scope of this white paper.

> > > For more on the different flavors of PDF, read our four-part white paper on converting to PDF.

Fundamental Problems of Converting From PDF

Extracting Document Text

The good thing about starting with PDF as your source format is that the actual text of the document is stored in an easily accessible way - not just the character chosen, but the specific font (such as Times New Roman). The font weight and font size are also specified in the PDF, so that in most cases, the text that you extract from the PDF document will be completely accurate. But even in the case of document text, there are certain elements in a PDF document that can produce errors when put through the conversion process. These are:

  1. Word Spaces - In most cases, the spaces between words are properly extracted from a PDF document. However, in some instances, the spacing between characters is such that a conversion program cannot completely know whether a space should be there. If the software guesses wrong, and inserts a space where there should not be one, you can end up with a word split in two when it was not meant to be. If, on the other hand, it does not insert a space where there was meant to be one, you can get two words incorrectly joined. Both of these are ugly errors in the converted document.

  2. Hyphens - Since the end-of-line hyphens that appear in published documents are not distinguishable from other types of hyphens that appear within lines in the PDF, conversion software has a problem. Assuming that it has properly determined the end of a line, it knows that most end-of-line hyphens are probably meant to be soft hyphens (only there to make the page layout look nice), but might be hard-hyphens (that are always there in certain word pairs, such as 'life-cycle'). Most conversion tools use dictionary-based algorithms to try to decide whether to leave these hyphens or remove them. These algorithms work pretty well. But unfortunately, there is simply no way to get this right 100% of the time.

  3. Emphasis, Super and Subscripting - Sometimes the way that the document is rendered in the PDF is done in a non-direct way, so that getting font emphasis (such as Bold and Underline) correct is not always a given. In addition, the way that vertical positioning can be done within a PDF is such that, like with word spacing, the extraction software has a tough time determining whether something is in fact super or subscripted, and it simply makes a guess.

  4. Special Characters and Sub-fonting - When converting special characters (such as foreign symbols and mathematical symbols), the source document often makes use of unusual or proprietary fonts, and the special characters need to be converted to more standard representations (such as ISO character entities or Unicode character representations). Typically, conversion software builds character conversion tables, but it is simply not feasible or practical to have these built for every font that a conversion program may encounter, so some of these characters may convert improperly. A more difficult problem relates to PDF's ability to do font embedding. A user can ask that only the part of a font that is used in a document be stored in the PDF file. Sometimes, when this is done, the characters within the sub-font are referred to through an indirect table within the PDF document, thereby making conversion of these characters extremely difficult. Many conversion tools "choke" on these types of characters - often rendering gibberish.

The bottom line with text is that you can expect PDF conversion tools to get most of the text in your document correct. In some cases, it may get all of your text correct. But since the possibilities for error from all of the problem areas we mentioned above are real, it is a good idea to do some level of proofing on your documents to ensure that all of the text was extracted properly.

DID YOU KNOW?

Data Conversion Laboratory (DCL) uses the most up to date and best software to assist in the process of converting from PDF to XML, MS Word, and other electronic formats. This white paper comes out of our development team's research into the issues that prevent conversion from PDF from being an automated process.

But conversion is just part of the service DCL provides. Our process includes software that takes automation as far as feasible. This is used in conjunction with software that checks for the issues discussed in this white paper and identifies the problem areas. But we also use expert reviewers - real live humans - to review the results of the conversion process and make sure that what gets delivered is ready for prime time.

Extracting Document Structure

On the whole, extraction of text from a PDF document works pretty well. The big downside is the PDF document specifies text positions on paper, but not much else. In most cases there is no information about the structure of a document. To do a decent job, conversion software is forced to "reverse engineer" the structure. This involves educated guessing, which sometimes leads to mistakes.

Now let's discuss the structural elements that typically cause problems:

  1. Multiple (Newspaper) Columns - Many PDF documents contain "newspaper" type columns of text. Unfortunately, there is usually no column boundary demarcation in the PDF. So software is forced to guess which text belongs in each column of the page and break it apart based on the page geometry. This task is often accomplished successfully by conversion tools. But some layouts give them problems, especially with short columns, where there is not much useful geometry information. Getting this wrong is particularly painful, because it will result in lines from separate paragraphs completely intermingled, resulting in a very ugly paragraph that is hard to comprehend, and is very difficult to clean up.

  2. Text Flows - Some documents, such as magazine articles and textbooks, often have text boxes set off to the side, and commentary text that runs alongside paragraphs, in which case the text flow from paragraph to paragraph is not at all obvious. This is also a challenge for conversion software.

  3. Paragraph Delineation - In most cases, there is nothing in the PDF document indicating where a paragraph ends (usually noted by inserting a hard return) and a new one begins, so this too is guessed at by software. Again, shorter paragraphs are harder to determine. You should not expect software to guess paragraph delineation correctly all the time. You may get two paragraphs running together as one, or one paragraph running as two. Paragraphs that span two pages are also difficult to deal with.

  4. Page Header and Footers - Typical published pages contain headers and footers at the tops and bottoms of pages. This is usually information that, although it appears on every page, is not desirable for it to appear in the target documents where it appeared in the source. So conversion software needs to attempt to guess these elements.

  5. Tables - One of the hardest document elements to deal with is tables and tabular material. Ideally, you want a conversion tool to faithfully represent the original table in the converted document. The reality is that there is so much to a table that getting it completely right is beyond what your expectations should be. Table attributes, such as delineating the columns and rows, header and body delineation, vertical and horizontal cell spanning, cell separators, and vertical and horizontal cell alignment, all require a certain amount of guessing. And given that even a human sometimes needs to read a table for a few minutes to infer the tables structure, it's just not reasonable to expect software to get this right. Even recognizing that something is a table versus some other element is not trivial.

  6. Graphics - When there are graphics within a document, the conversion software will typically convert the graphic to some sort of raster image format. However, in some cases, guessing which parts of the page belong to the graphic, as well as determining what comprises the graphic caption, can be quite tricky.

  7. Mathematical Equations - These are extremely complicated document elements, which are usually authored using sophisticated equation authoring tools. You should not expect these to be extracted from a source document. You will either need to leave them as images in the converted documents (which may be okay, depending on the situation); or you may have to have them rekeyed in your new source document, in whatever format you decide to use to represent mathematical equations.

>>> Click here to read part 2 of this PDF white paper.

Mike Gross
10/8/2003

Be first in line to read new white papers on PDF, XML, and data conversion. Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
DCL Calendar

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

Best Practices Santa Fe, NM, September 15-17, 2008. More…
XyUser Phoenix, AZ, September 22-24, 2008. More…
9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe. Details TBA.

 
DCL Calendar

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

 
Recent News

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.