Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations

PDF Conversion: How, For Whom, And When?

PDF Conversion White Paper, Part 1: Overview.

The lowdown on PDF conversion from Data Conversion Laboratory (DCL) by Lazar Weisz.

READ MORE OF THIS PDF CONVERSION WHITE PAPER:

  1. Overview
  2. PDF Image Only
  3. PDF Searchable Image
  4. PDF Normal

OTHER PDF CONVERSION RESOURCES ON DCLAB.COM

NEW WHITE PAPER ALERT!

Be first in line to read new articles on PDF, XML, and data conversion. Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

PDF, or Portable Document Format, is Adobe's flagship document publishing and distribution format. It has become the most widely used format for distributing documents within businesses, schools, and the Web.

This white paper addresses PDF conversion and the attendant issues. One of the secrets behind the success of PDF is the fact that it is portable. Regardless of the Operating System of the user - whether it be Windows, Linux or Macintosh - with Adobe's free Acrobat Reader, PDFs become readable and printable everywhere. In a changing world of constant struggle for compatibility, this is a tremendously powerful factor. If you want to make sure your documents will be viewable by the largest amount of people at low cost, Adobe PDF is the way to go.

If your primary goal is to disseminate information in its existing form and look, PDF will do an excellent job at much lower cost than other alternatives. PDF is an outstanding choice for reference documents that must retain their original look, and for documents that would normally be printed. However, if your requirements include repurposing and normalizing your documents so that they can be republished and shared with other organizations, PDF may not be the ideal choice. PDF files are also typically larger than marked-up text.

>>> A brief discussion of this topic can be found in our FAQ section - see PDF vs. SGML

Not all PDF files are equal. There are three forms of PDF files, each with their own characteristics:

  • PDF Normal
  • PDF Searchable Image
  • PDF Image Only

Let's look at each of them in turn ...

PDF Normal

Adobe officially calls this Formatted Text & Graphics. But we'll continue to refer to it as PDF Normal. This is the best kind of PDF. You get this when your materials have been produced on a modern word processing or publishing system, with a PDF output capability. It contains the full text of the page with appropriate coding to define fonts, sizes, etc. The downloaded files are relatively small, and it will look as good on the screen as the printed version would.

If PDF works for your application, and you have the original Word Processing or publishing files, this is the best bet. However, if you are going from legacy materials and don't have suitable electronic files, producing PDF Normal is complex and relatively costly, usually requiring that you convert to a word processing or publishing format first, and from there produce the PDF files.

Image Only

This type of PDF is easiest to produce from legacy sources. It is an image of the page in a PDF wrapper and contains no searchable text. Producing it is easy. All you need to do is scan the materials and put the images through an automated PDF loading process. Image Only PDF could be seen as a replacement for microfilm: It is an archival format which can be retrieved. However, there is no ability for text searching and files tend to be fairly large and therefore harder to store and download. The image quality is dependent on the quality of the source materials and the quality of the scanning operation.

Searchable Image PDF

This is a good compromise for many legacy applications. It is an image of the page, but with the text portions of the image converted to text for search purposes. In a search application, when the text is found, the image corresponding to the found text is displayed, and the materials can be read in context. This type of PDF is relatively inexpensive to produce since the pages can be scanned and run through an automated Optical Recognitions Process -- commonly referred to as "Optical Character Recognition" (OCR).

Usually raw OCR is not suitable because accuracy is unlikely to be high enough (raw OCR accuracy is only about 95-99% for most materials). But for search purposes, it is good enough for the majority of applications. Also, since the image needs to be retained, file sizes are larger than PDF Normal and larger than other text formats. If you can live with these constraints, Searchable Image PDF could be a very good compromise. This approach is frequently suitable for library and legal applications.

NOTE: Searchable Image PDF allows text to be selected and copied into the Windows' paste buffer to use in other applications. But care needs to be taken during the conversion process because any OCR errors that have not been "cleaned up" will be seen if someone pastes text. What's more, searches would fail if the text did not OCR properly -- all of which would reflect poorly on the quality of the product.

Table 1. contains a general overview of the prices you can expect when converting to the various types of PDF. Note that these prices depend on a wide variety of factors. Each conversion project requires its own, unique conversion methodology. The prices shown should be regarded as benchmarks for the average project.

Table 2. illustrates typical file sizes per PDF page generated from the various types of paper and electronic sources.

Table 1: Estimated Prices For PDF Conversion

 

PDF Image Only

Searchable Image PDF

PDF Normal*

Bitonal Page

$0.15-0.30

$0.17-0.30

$1.00-10.00

Grayscale Page

$0.25-0.40

$0.30-0.45

$1.00-10.00

Color Page

$0.30-0.45

$0.30-0.50

$1.00-10.00

Composite page

$0.40-0.80

$0.40-0.80

$1.00-10.00

From Word Processing Application

n/a

n/a

Trivial

NOTE: In PDF Normal complex pages often need to be recomposed to retain the original look. The amount of work involved varies widely.

Table 2: PDF file sizes - per page

Page properties

Typical file size per page

Bitonal

G4

100K

JBIG2

30K

Grayscale

250K1

Color

600K2

Composite

G4

200K3

JBIG2

100K4

PDF Normal - text only

30K

NOTE: G4 and JBIG2 are bitonal compression algorithms.

1 Assuming scan at 150 DPI using medium-strength 8-bit JPEG compression.
2 Assuming scan at 150 DPI using medium-strength 24-bit JPEG compression.
3 Assuming bitonal scan at 300 DPI using G4 compression, grayscale at 8-bit 150 DPI, and color at 24-bit 150 DPI using medium-strength JPEG compression.
4 Assuming bitonal scan at 300 DPI using JBIG2 compression, grayscale at 8-bit 150 DPI, and color at 24-bit 150 DPI using medium-strength JPEG compression.

Lazar Weisz
Data Conversion Laboratory

Read more of this PDF conversion white paper:

  1. Overview: www.dclab.com/pdf_conversion.asp
  2. PDF Image Only: www.dclab.com/pdfwhitepaper2.asp
  3. PDF Searchable Image: www.dclab.com/pdfconversion3.asp
  4. PDF Normal: www.dclab.com/pdf_whitepaper_4.asp


© 2002/2003 Data Conversion Laboratory. All rights reserved.
This White Paper is for informational purposes only. Data Conversion Laboratory makes no warranties in this document, expressed or implied.
  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
DCL Calendar

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

Best Practices Santa Fe, NM, September 15-17, 2008. More…
XyUser Phoenix, AZ, September 22-24, 2008. More…
9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe. Details TBA.

 
Recent News

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.