From PDF to E-Book: Problems and Solutions for PDF to ePub Conversions

But since PDF is a print format, PDF documents are typically less-structured versions of their word-processor originals. While PDF content is laid out to look good, it includes very little structure—that is, it contains few clues as to the function of text elements (e.g., paragraphs, spaces, line breaks) or how they ought to be displayed in a different context (for instance, an e-book). For this reason, converting a PDF document to ePub is generally best accomplished by first performing the intermediary step of converting the PDF document to a more structured text format, like Microsoft Word or a similar word processing application.
The issues that arise when converting PDF to Word are similar to the obstacles encountered when converting PDF to any other format. While converting thoroughly structured content to ePub is typically a straightforward process, most of the difficulty in PDF-to-ePub conversion has to do with properly extracting the content from the PDF to begin with.
The degree of difficulty involved in extracting PDF content depends largely on two factors: 1) the degree of structure included in the document to be converted; and 2) the nature of the source content. As with any document conversion, the more structure included in the original PDF document, the easier it is to properly extract the content and convert. But while it is possible to include some degree of structure in PDF documents, they are typically much less structured than their word processor originals. This is why the nature of the source content is significant, since a few paragraphs of simple text (requiring relatively little structure) are likely to include fewer of the conversion obstacles than might be found in a complex scientific document that may contain numerous special characters, unusual text alignment, and tables.
- Extraction Sample: PDF Page of a science textbook in PDF Normal format
- Extraction Sample: Same page after undergoing PDF-to-Word extraction. Notice that spacing, emphasis, and special characters are not accurately reproduced by the extraction tool.
The greater the difference in structure between a word processor source document (typically very structured) and a PDF document (typically unstructured), the more likely it is that your conversion software will have to “guess” as to the intended document structure—and the more guessing required, the higher the chances that conversion software will run into obstacles when extracting text from the PDF.
Common obstacles found in PDF-to-Word conversion include:
Word Spaces
These are usually extracted correctly, but since PDF documents create spaces visually (i.e., they are not really labeled as “one standard space” or “two standard spaces”), spacing between words is sometimes misinterpreted by conversion software, causing spaces to be added or deleted incorrectly during PDF-to-Word extraction. See the extraction sample above for an example of incorrectly extracted word spaces.
Paragraph Delineation
In most cases, PDF documents contain no explicit information to indicate where a paragraph begins or ends, so this too must be guessed at by conversion software, based on its “visual” interpretation of the appearance of chunks of text. While conversion software frequently does guess correctly, paragraph delineation can be a source of extraction errors, particularly when paragraphs are very short or span pages.
Hyphens
Hyphens pose a problem because they serve various purposes among which an automated system cannot distinguish. While the hyphen joining a term such as “half-life” should appear no matter where the words are placed within a document, a hyphen that appears halfway through a word because of a line break (e.g., hyphen-ated) becomes an ugly error once the word is moved to the middle of a line.
Emphasis
Depending on how a document is rendered in PDF, extracting the correct emphasis from a PDF document can sometimes pose problems for conversion software. Again, this is because PDF structure is nothing more than a visual representation; while text may appear emphasized, the PDF does not tag it as “emphasized”—conversion software must make its best guess based on what it can glean from the text’s appearance. See the extraction sample above for an example of incorrectly extracted emphasis.
Superscripting and Subscripting
Since PDF documents’ treatment of super and subscripts is limited to the way they appear when laid out in the PDF (rather than by some kind of “superscript” or “subscript” tag), extraction software tends to run into problems with determining the vertical alignment of text. As a result, super and subscripts are frequently misinterpreted by extraction software.
Special Characters
In PDF documents, special characters like foreign or mathematical symbols are frequently represented by unusual or proprietary fonts. In order to extract them to a word processor, these characters first need to be converted to a more standard character representation (e.g., ISO or Unicode). While many conversion software suites build conversion tables to handle such characters, it is impossible to keep up with the vast variety of atypical and proprietary fonts in use, and so many special characters fail to extract properly. See the extraction sample above for an example of incorrectly extracted special characters.
Sub-fonting
PDF’s approach to font embedding is another obstacle to proper extraction. Sometimes when PDFs are created, the PDF document does not store the information for the entire font, but rather stores only the parts of the font which are used in a given document. The characters within this “sub-font” are accessed via an indirect table within the PDF document itself, making correct interpretation and extraction of sub-fonted characters difficult. Many conversion tools cannot extract these characters at all, and produce “garbage” text instead of accurately extracted content. See the extraction sample above for an example of “garbage” text.
Tables
Tables are among the trickiest document elements to extract. This is because the appearance of even a simple table is determined by numerous attributes, including but not limited to column and row delineation, header and body delineation, vertical and horizontal cell spanning, cell separators, and vertical and horizontal cell alignment. With none of this information included in the source PDF, it is nearly impossible for an automated tool to reproduce a table exactly as it appeared in the original document.
While some short or simple documents may be able to undergo a PDF-to-Word (and subsequent PDF-to-ePub) conversion with minimal difficulty, any long or complex document set will encounter several of these obstacles. The obstacles inherent in any PDF text extraction should underscore, first, the utility of retaining original versions of source documents in word processor format, if possible; and second, the critical importance of a good quality assurance strategy in any conversion process.
Quality assurance is included as a component of all DCL conversion services. We also offer stand-alone quality assurance services, which may be used for independent reviews of converted results or to oversee an in-house or outsourced conversion project. For more information on PDF-to-Word conversion, PDF-to-E-Book conversion, or DCL’s quality assurance services, contact us.
Related posts:
Not All eBooks are Created Equal
Reality Check: What to Expect from Automated Conversion to eBook
The Changing Content Landscape in Publishing
Automated Conversion to eBook — Problems and Limitations
Dan Tonkery on the iPad and the Future of Technical Publications










Pingback: convert text to digital book | Book Shop