Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations

Adobe PDF Conversion: How, For Whom, And When?

PDF White Paper, Part 2: PDF Image Only.

Get the lowdown on how to convert documents to PDF from Lazar Weisz, PDF expert at Data Conversion Laboratory (DCL).

What is PDF Image Only?

READ MORE OF THIS PDF CONVERSION WHITE PAPER:

  1. Overview
  2. PDF Image Only
  3. PDF Searchable Image
  4. PDF Normal

OTHER PDF RESOURCES ON DCLAB.COM

NEW WHITE PAPER ALERT!

Be first in line to read new articles on PDF, XML, and data conversion. Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

PDF Image Only is simply a scanned, non-searchable image of the page inside PDF wrappers. This limited approach to distribute documents is the cheapest - for simple text documents prices range between $0.15 and 0.30 / page. This is an ideal solution for archiving legacy documents in digital format.

PDF Image Only File Sizes

In contrast to the relatively small PDF Normal documents authored in word processors or publishing platforms, PDF Image Only files are subject to the same file size concerns that all image formats - such as TIFF, JPEG and BMP - are subject to. Depending on the type of image, color range, and image resolution, file size is frequently a major concern. Two methods to reduce file size are Image Compression and Composite PDF. A combination of the two might yield the best results.

Image Quality

When scanning to PDF Image Only, keep in mind that the quality of the final PDF is largely dependant on the initial capture to digital format. A high quality scanner with minimal post-scan clean-up will always yield better results than a low quality scan and lots of image clean-up. Investing more in an excellent scanner or in better training of the people doing the scanning pays off quickly when compared to the costs of manually having to de-speckle, de-skew, and otherwise fix a bad scan.

Hyper linking an Image Only PDF

While images are not searchable, there are other navigational aids that can be used with Image Only PDF files. Adobe Acrobat, and other tools, can be used to add hyper linking to a PDF Image Only document. For example, you can provide a Table of Contents, Index, or other intra-document linking structure, which would be linked directly to the relevant page. Alternatively, you can use Acrobat's bookmarking feature, which enables you to create your own Table of Contents-like list of headings that are linked to their respective pages. These bookmarks become part of the PDF file but are not an actual page in the file.


Appendix A:

Image Compression: An Overview

Image compression refers to any of several techniques used to reduce image file sizes usually by removing either redundant information or information which can be recreated prior to display. Reducing file sizes is often important in order to allow image-heavy files to be easily transmitted and stored.

The scope of the problem is related to a number of factors. For example, if the page contains only text or a few black/white (bi-tonal) images, the problem is limited since bi-tonal images compress to very small sizes (typically using CCITT Group 4 compression at the industry standard 300 DPI resolution). You'll be able to scan the entire page at a single resolution (300 DPI), color-depth (bi-tonal) and compression (CCITT Group 4), and retain a small file size. Using JBIG2 compression you can even achieve similar file sizes as PDF Normal. If the page contains grayscale or color images, however, file size increases dramatically. An 8 ½ by 11 inches page scanned at 300 DPI with 24-bit color depth would result, uncompressed, in a TIFF file of around 25 MB:

Width: 8 ½ x 300 = 2550 pixels
Length: 11 x 300 = 3300 pixels

2550 x 3300 = 8415000 total page pixels

8415000 x 24 (color-depth bits) = 201960000 bits

201960000 / 8 = 25245000 bytes, or 25.2 MB.

PDF files containing 25.2 MB per page would take a long time to download and will require much disk space to store. Image compression is intended to reduce image sizes.

Image Compression can be categorized as lossy and lossless. Lossy compression algorithms focus more on losing file size than on retaining the image quality. JPEG, for example, is a lossy compression method. It is frequently used for color images on the Web, where small image file sizes and thus shorter download times are more important than high quality images. TIFF G4, on the other hand, is a lossless bitonal compression methodology often used to scan medical, legal, and governmental documents that must retain their original look and feel. Also, when converting to Searchable Image PDF, the OCR (Optical Character Recognition) process required to add the text layer to the PDF will work much better if applied to a lossless, purely bitonal scan. TIFF G4 is therefore often used for OCR. The right compression method for your conversion therefore depends on the following factors:

  1. Type of Information (medical, legal, etc.)
  2. Range of colors (bitonal, grayscale, color)
  3. Resolution required, in DPI (dots per inch)

The following table illustrates the most popular methods of compression and where they are commonly used:

Compression Method

Lossy/
Lossless

Color Range Supported

Application

Compression Ratio

TIFF

G4

Lossless

Bitonal

Legal, Defense, Government

90-95%

JBIG2

Supports both

Bitonal

Legal, Defense, Government

95-98%

LZW/Packbits

Lossless

Color

Medical, IT

LZW: 80-85%
Packbits: 75-80%

JPEG, GIF

JPEG: Lossy
GIF: Lossless

Color

WWW

JPEG: 90-95%
GIF: 60-80%

While compressing the entire page using one method is the simplest, it does not necessarily provide the optimal results. Frequently different types of compression are suitable to different parts of the page. Areas on a page containing text that will undergo an OCR process to produce Searchable PDF, for example, should be scanned at a resolution not lower than 300 DPI and using bitonal color depth. Images on the same page, however, can't be scanned at bitonal color depth since that would convert the color image to monochrome. Scanning the entire page at 300 DPI color will result in a large file size even when using image compression. So if a page contains images and text, a dilemma unfolds: If the compression methods mentioned above allow for only one color depth and one resolution setting, the final PDF produced from the image will either contain color but will be large in size and suffer from below-par OCR results, or it will have to be created bitonally to allow for small file sizes and good OCR. This problem is solved with Composite PDF.

Appendix B:

Composite PDF

Standard image file formats have a major drawback: you can only have one resolution and one color depth setting for the entire image. For example: in order to scan a page containing mostly text, but also a few color images surrounded by text (think of a medical journal or a computer magazine), you'll typically either scan the whole page at a bitonal setting, which will capture the text and white space optimally but will convert all images to monochrome, or at a color setting, which will pick up the color images beautifully but create unnecessary gray shadings for the text and white space and result in a huge file. You'd also be limited to one resolution. As a solution to this, PDF allows you to combine many 'zones' on a single page. In the example above, you could scan the whole page to a 300 DPI bitonal TIFF, and then again at 150 DPI JPEG color, and combine them in the final PDF to yield the perfect balance: Composite PDF. This PDF will enjoy the best of both worlds: purely bitonal text and white space areas (which is important to get best OCR and print results) and true color, compressed image areas. File size will be kept to a minimum since you'll be able to use G4 or JBIG2 compression on all text and white space areas and JPEG for the images..

Lazar Weisz
Data Conversion Laboratory

Read more of this white paper on how to convert documents to PDF:

  1. Overview: www.dclab.com/pdf_conversion.asp
  2. PDF Image Only: www.dclab.com/pdfwhitepaper2.asp
  3. PDF Searchable Image: www.dclab.com/pdfconversion3.asp
  4. PDF Normal: www.dclab.com/pdf_whitepaper_4.asp


© 2002/2003 Data Conversion Laboratory. All rights reserved.
This White Paper is for informational purposes only. Data Conversion Laboratory makes no warranties in this document, expressed or implied.
  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
DCL Calendar

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

Best Practices Santa Fe, NM, September 15-17, 2008. More…
XyUser Phoenix, AZ, September 22-24, 2008. More…
9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe. Details TBA.

 
DCL Calendar

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

 
Recent News

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.