DCL  
     Refer a friend Send this Page to a Friend
     Print friendly version Printer-Friendly Format

    Resource Center

    Fact Sheets

    White Papers

PDF Conversion: How, For Whom, And When?

PDF Conversion White Paper, Part 1: Overview.

The lowdown on PDF conversion from Data Conversion Laboratory (DCL) by Lazar Weisz.

READ MORE OF THIS PDF CONVERSION WHITE PAPER:

  1. Overview
  2. PDF Image Only
  3. PDF Searchable Image
  4. PDF Normal

OTHER PDF CONVERSION RESOURCES ON DCLAB.COM

NEW WHITE PAPER ALERT!

Be first in line to read new articles on PDF, XML, and data conversion. Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

PDF, or Portable Document Format, is Adobe's flagship document publishing and distribution format. It has become the most widely used format for distributing documents within businesses, schools, and the Web.

This white paper addresses PDF conversion and the attendant issues. One of the secrets behind the success of PDF is the fact that it is portable. Regardless of the Operating System of the user - whether it be Windows, Linux or Macintosh - with Adobe's free Acrobat Reader, PDFs become readable and printable everywhere. In a changing world of constant struggle for compatibility, this is a tremendously powerful factor. If you want to make sure your documents will be viewable by the largest amount of people at low cost, Adobe PDF is the way to go.

If your primary goal is to disseminate information in its existing form and look, PDF will do an excellent job at much lower cost than other alternatives. PDF is an outstanding choice for reference documents that must retain their original look, and for documents that would normally be printed. However, if your requirements include repurposing and normalizing your documents so that they can be republished and shared with other organizations, PDF may not be the ideal choice. PDF files are also typically larger than marked-up text.

>>> A brief discussion of this topic can be found in our FAQ section - see PDF vs. SGML

Not all PDF files are equal. There are three forms of PDF files, each with their own characteristics:

  • PDF Normal
  • PDF Searchable Image
  • PDF Image Only

Let's look at each of them in turn ...

PDF Normal

Adobe officially calls this Formatted Text & Graphics. But we'll continue to refer to it as PDF Normal. This is the best kind of PDF. You get this when your materials have been produced on a modern word processing or publishing system, with a PDF output capability. It contains the full text of the page with appropriate coding to define fonts, sizes, etc. The downloaded files are relatively small, and it will look as good on the screen as the printed version would.

If PDF works for your application, and you have the original Word Processing or publishing files, this is the best bet. However, if you are going from legacy materials and don't have suitable electronic files, producing PDF Normal is complex and relatively costly, usually requiring that you convert to a word processing or publishing format first, and from there produce the PDF files.

Image Only

This type of PDF is easiest to produce from legacy sources. It is an image of the page in a PDF wrapper and contains no searchable text. Producing it is easy. All you need to do is scan the materials and put the images through an automated PDF loading process. Image Only PDF could be seen as a replacement for microfilm: It is an archival format which can be retrieved. However, there is no ability for text searching and files tend to be fairly large and therefore harder to store and download. The image quality is dependent on the quality of the source materials and the quality of the scanning operation.

Searchable Image PDF

This is a good compromise for many legacy applications. It is an image of the page, but with the text portions of the image converted to text for search purposes. In a search application, when the text is found, the image corresponding to the found text is displayed, and the materials can be read in context. This type of PDF is relatively inexpensive to produce since the pages can be scanned and run through an automated Optical Recognitions Process -- commonly referred to as "Optical Character Recognition" (OCR).

Usually raw OCR is not suitable because accuracy is unlikely to be high enough (raw OCR accuracy is only about 95-99% for most materials). But for search purposes, it is good enough for the majority of applications. Also, since the image needs to be retained, file sizes are larger than PDF Normal and larger than other text formats. If you can live with these constraints, Searchable Image PDF could be a very good compromise. This approach is frequently suitable for library and legal applications.

NOTE: Searchable Image PDF allows text to be selected and copied into the Windows' paste buffer to use in other applications. But care needs to be taken during the conversion process because any OCR errors that have not been "cleaned up" will be seen if someone pastes text. What's more, searches would fail if the text did not OCR properly -- all of which would reflect poorly on the quality of the product.

Table 1. contains a general overview of the prices you can expect when converting to the various types of PDF. Note that these prices depend on a wide variety of factors. Each conversion project requires its own, unique conversion methodology. The prices shown should be regarded as benchmarks for the average project.

Table 2. illustrates typical file sizes per PDF page generated from the various types of paper and electronic sources.

Table 1: Estimated Prices For PDF Conversion

 

PDF Image Only

Searchable Image PDF

PDF Normal*

Bitonal Page

$0.15-0.30

$0.17-0.30

$1.00-10.00

Grayscale Page

$0.25-0.40

$0.30-0.45

$1.00-10.00

Color Page

$0.30-0.45

$0.30-0.50

$1.00-10.00

Composite page

$0.40-0.80

$0.40-0.80

$1.00-10.00

From Word Processing Application

n/a

n/a

Trivial

NOTE: In PDF Normal complex pages often need to be recomposed to retain the original look. The amount of work involved varies widely.

Table 2: PDF file sizes - per page

Page properties

Typical file size per page

Bitonal

G4

100K

JBIG2

30K

Grayscale

250K1

Color

600K2

Composite

G4

200K3

JBIG2

100K4

PDF Normal - text only

30K

NOTE: G4 and JBIG2 are bitonal compression algorithms.

1 Assuming scan at 150 DPI using medium-strength 8-bit JPEG compression.
2 Assuming scan at 150 DPI using medium-strength 24-bit JPEG compression.
3 Assuming bitonal scan at 300 DPI using G4 compression, grayscale at 8-bit 150 DPI, and color at 24-bit 150 DPI using medium-strength JPEG compression.
4 Assuming bitonal scan at 300 DPI using JBIG2 compression, grayscale at 8-bit 150 DPI, and color at 24-bit 150 DPI using medium-strength JPEG compression.

Lazar Weisz
Data Conversion Laboratory

Read more of this PDF conversion white paper:

  1. Overview: www.dclab.com/pdf_conversion.asp
  2. PDF Image Only: www.dclab.com/pdfwhitepaper2.asp
  3. PDF Searchable Image: www.dclab.com/pdfconversion3.asp
  4. PDF Normal: www.dclab.com/pdf_whitepaper_4.asp


© 2002/2003 Data Conversion Laboratory. All rights reserved.
This White Paper is for informational purposes only. Data Conversion Laboratory makes no warranties in this document, expressed or implied.
 
representational space
    Popular Links

    Events

    Recent Events

representational space
representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365, P: 718-357-8700
Data Conversion Lab
Copyright © 1997-2009  Data Conversion Laboratory, Inc. All rights reserved.