Data Conversion Laboratory Logo
Getting it Right ...Every Time
Home » Featured » From PDF to E-Book: Problems and Solutions for PDF to ePub Conversions

From PDF to E-Book: Problems and Solutions for PDF to ePub Conversions

04/29/2010 Posted in Featured, Online Publishing

Laptop E-Book

As the e-book business starts to boom in earnest (see Coming Soon: The Dawn of the Digital Textbook and E-Books, iPads, and the Next-Big-Thing), many publishers find themselves needing to convert their PDF documents to the e-reader-friendly ePub standard.

But since PDF is a print format, PDF documents are typically less-structured versions of their word-processor originals. While PDF content is laid out to look good, it includes very little structure—that is, it contains few clues as to the function of text elements (e.g., paragraphs, spaces, line breaks) or how they ought to be displayed in a different context (for instance, an e-book). For this reason, converting a PDF document to ePub is generally best accomplished by first performing the intermediary step of converting the PDF document to a more structured text format, like Microsoft Word or a similar word processing application.

The issues that arise when converting PDF to Word are similar to the obstacles encountered when converting PDF to any other format. While converting thoroughly structured content to ePub is typically a straightforward process, most of the difficulty in PDF-to-ePub conversion has to do with properly extracting the content from the PDF to begin with.

The degree of difficulty involved in extracting PDF content depends largely on two factors: 1) the degree of structure included in the document to be converted; and 2) the nature of the source content. As with any document conversion, the more structure included in the original PDF document, the easier it is to properly extract the content and convert. But while it is possible to include some degree of structure in PDF documents, they are typically much less structured than their word processor originals. This is why the nature of the source content is significant, since a few paragraphs of simple text (requiring relatively little structure) are likely to include fewer of the conversion obstacles than might be found in a complex scientific document that may contain numerous special characters, unusual text alignment, and tables.

The greater the difference in structure between a word processor source document (typically very structured) and a PDF document (typically unstructured), the more likely it is that your conversion software will have to “guess” as to the intended document structure—and the more guessing required, the higher the chances that conversion software will run into obstacles when extracting text from the PDF.

Common obstacles found in PDF-to-Word conversion include:

Word Spaces

These are usually extracted correctly, but since PDF documents create spaces visually (i.e., they are not really labeled as “one standard space” or “two standard spaces”), spacing between words is sometimes misinterpreted by conversion software, causing spaces to be added or deleted incorrectly during PDF-to-Word extraction. See the extraction sample above for an example of incorrectly extracted word spaces.

Paragraph Delineation

In most cases, PDF documents contain no explicit information to indicate where a paragraph begins or ends, so this too must be guessed at by conversion software, based on its “visual” interpretation of the appearance of chunks of text. While conversion software frequently does guess correctly, paragraph delineation can be a source of extraction errors, particularly when paragraphs are very short or span pages.

Hyphens

Hyphens pose a problem because they serve various purposes among which an automated system cannot distinguish. While the hyphen joining a term such as “half-life” should appear no matter where the words are placed within a document, a hyphen that appears halfway through a word because of a line break (e.g., hyphen-ated) becomes an ugly error once the word is moved to the middle of a line.

Emphasis

Depending on how a document is rendered in PDF, extracting the correct emphasis from a PDF document can sometimes pose problems for conversion software. Again, this is because PDF structure is nothing more than a visual representation; while text may appear emphasized, the PDF does not tag it as “emphasized”—conversion software must make its best guess based on what it can glean from the text’s appearance. See the extraction sample above for an example of incorrectly extracted emphasis.

Superscripting and Subscripting

Since PDF documents’ treatment of super and subscripts is limited to the way they appear when laid out in the PDF (rather than by some kind of “superscript” or “subscript” tag), extraction software tends to run into problems with determining the vertical alignment of text. As a result, super and subscripts are frequently misinterpreted by extraction software.

Special Characters

In PDF documents, special characters like foreign or mathematical symbols are frequently represented by unusual or proprietary fonts. In order to extract them to a word processor, these characters first need to be converted to a more standard character representation (e.g., ISO or Unicode). While many conversion software suites build conversion tables to handle such characters, it is impossible to keep up with the vast variety of atypical and proprietary fonts in use, and so many special characters fail to extract properly. See the extraction sample above for an example of incorrectly extracted special characters.

Sub-fonting

PDF’s approach to font embedding is another obstacle to proper extraction. Sometimes when PDFs are created, the PDF document does not store the information for the entire font, but rather stores only the parts of the font which are used in a given document. The characters within this “sub-font” are accessed via an indirect table within the PDF document itself, making correct interpretation and extraction of sub-fonted characters difficult. Many conversion tools cannot extract these characters at all, and produce “garbage” text instead of accurately extracted content. See the extraction sample above for an example of “garbage” text.

Tables

Tables are among the trickiest document elements to extract. This is because the appearance of even a simple table is determined by numerous attributes, including but not limited to column and row delineation, header and body delineation, vertical and horizontal cell spanning, cell separators, and vertical and horizontal cell alignment. With none of this information included in the source PDF, it is nearly impossible for an automated tool to reproduce a table exactly as it appeared in the original document.

While some short or simple documents may be able to undergo a PDF-to-Word (and subsequent PDF-to-ePub) conversion with minimal difficulty, any long or complex document set will encounter several of these obstacles. The obstacles inherent in any PDF text extraction should underscore, first, the utility of retaining original versions of source documents in word processor format, if possible; and second, the critical importance of a good quality assurance strategy in any conversion process.

Quality assurance is included as a component of all DCL conversion services. We also offer stand-alone quality assurance services, which may be used for independent reviews of converted results or to oversee an in-house or outsourced conversion project. For more information on PDF-to-Word conversion, PDF-to-E-Book conversion, or DCL’s quality assurance services, contact us.

Related posts:

Not All eBooks are Created Equal

Reality Check: What to Expect from Automated Conversion to eBook

The Changing Content Landscape in Publishing

Automated Conversion to eBook — Problems and Limitations

Dan Tonkery on the iPad and the Future of Technical Publications

Tags: , , , , ,
  • http://Anthonyintl.com sherman watstein

    It would be a mistake to consider that all documents are created using Microsoft Word.

    Adobe offers both a structured and unstructured FrameMaker program that is a powerful authoring and publishing software solution, and is customizable WYSIWYG XML editor for technical writing. When used with Acrobat’s Distiller to generate a PDF.

    I have used Microsoft Word as far back as 1983 with “WYSIWYG” first released and Adobe’s FrameMaker since 1995, for printed documents and later for creating PDFs.

    With 40 years of technical writing experience in manufacturing, aerospace and software documentation, I do not recommend anyone doing a manual using Microsoft Word. Most writers do not understand how to use Microsoft Word correctly, which is one reason this article points out problems with structure.

    • http://www.dclab.com Jeremy Seideman

      Sherman —

      While you are correct that there are many authoring and publishing tools that are extremely powerful, the average user (that is, someone who is not employed as a technical writer or even that experienced in documentation writing) tasked with some sort of documentation or writing project may very well use Word or a similar word processing product simply because of his or her familiarity with it.

      Many writers, though, are not familiar with the capabilities of Word and its ability to create documents with well-defined structure, leading to the difficulties that you point out.

      When dealing with source files that lack such a structure, the most efficient solution is generally to employ a conversion method (such as DCL’s) that attempts to determine the best possible structure based on the document’s appearance.

      Jeremy Seideman
      Conversion Engineer
      Data Conversion Laboratory, Inc.

      • Bcollinsmaster

        I agree when I published my book I tried the free PDF to ePUB converter calibre but I couldn’t get it to work. I ended up hiring out and had http://www.1pdftoepub.com/ do it.

  • Terry Ham

    Thank you for the great post. I used Celibre to convert my PDF to EPUB before. It works well. Recently, I always have a lot of eBooks and most of them are encrypted in PDF format to convert to Epub and Calibre is hard to fix them. And I turn to AnyBizSoft PDF to EPUB Converter which support batch conversion and encrypted PDF conversion. It works well. But unfortunately it is paid software. However, just a heads-up.

    • http://www.dclab.com Michael Gross

      The reality is that off the shelf conversion software can work in certain cases, but more typically, especially because of the challenges of extracting text perfectly from the PDF file format, the software will produce an imperfect EPUB file that requires manual cleanup. You would need to try the software on samples of your material to see if the solution will satisfy your needs.

      Mike Gross
      CTO
      Data Conversion Laboratory

      • Vengadesan

        Great disuccsions. I would like to participate on this.
        I done study on these issues (more than 2 years) and we found a solution in automatic way. Yes, this is feasible to extract content from PDF without these issues and no manual clean is requried.

        Vengadesan

      • Dipika

        Hi. Michael Gross,

         

            Â
               I agree with you. But Manual cleanup always required
        lots of time and passion. So if any or you have new idea or know how to convert
        PDF to any format in short time, then please let me know.

         

  • Pingback: convert text to digital book | Book Shop

  • Chitthuhlaing

    where are the solutions?

    • http://www.pdf-to-epub-converters.com/ PDFtoEPUB

      There is not perfect solution for this kind of conversion. i used freeware Calibre and it performs well for simple pdf files. However, if the pdf file is complicated, let say it contains table,drawings,etc, the converted file is horrible. I did some research find some Wondershare PDF to EPUB Convert works pretty well on this situation.

  • http://www.facebook.com/people/Jack-Smith/100002995429650 Jack Smith

    Very informative article for epub conversion, I myself have gone through the same process for conversion but that was just a conversion, I ordered my book for conversion through http://www.ebookconversion.com/ it needed many edits to format the book to look alike 99.99%