|
|
Adobe PDF Conversion: How, For Whom, And When?
PDF White Paper, Part 2: PDF Image Only.
Get the lowdown on how to convert documents to PDF from Lazar Weisz, PDF expert at Data Conversion Laboratory (DCL).
What is PDF Image Only?
PDF Image Only is simply a scanned, non-searchable image of the page
inside PDF wrappers. This limited approach to distribute documents is
the cheapest - for simple text documents prices range between $0.15
and 0.30 / page. This is an ideal solution for archiving legacy
documents in digital format.
PDF Image Only File Sizes
In contrast to the relatively small PDF Normal documents authored in
word processors or publishing platforms, PDF Image Only files are
subject to the same file size concerns that all image formats - such
as TIFF, JPEG and BMP - are subject to. Depending on the type of
image, color range, and image resolution, file size is frequently a
major concern. Two methods to reduce file size are Image Compression and Composite PDF. A combination of the two might
yield the best results.
Image Quality
When scanning to PDF Image Only, keep in mind that the quality of the
final PDF is largely dependant on the initial capture to digital
format. A high quality scanner with minimal post-scan clean-up will
always yield better results than a low quality scan and lots of image
clean-up. Investing more in an excellent scanner or in better
training of the people doing the scanning pays off quickly when
compared to the costs of manually having to de-speckle, de-skew, and
otherwise fix a bad scan.
Hyper linking an Image Only PDF
While images are not searchable, there are other navigational aids
that can be used with Image Only PDF files. Adobe Acrobat, and other
tools, can be used to add hyper linking to a PDF Image Only document.
For example, you can provide a Table of Contents, Index, or other
intra-document linking structure, which would be linked directly to
the relevant page. Alternatively, you can use Acrobat's bookmarking
feature, which enables you to create your own Table of Contents-like
list of headings that are linked to their respective pages. These
bookmarks become part of the PDF file but are not an actual page in
the file.
Appendix A:
Image Compression: An Overview
Image compression refers to any of several techniques used
to reduce image file sizes usually by removing either redundant
information or information which can be recreated prior to display.
Reducing file sizes is often important in order to allow image-heavy
files to be easily transmitted and stored.
The scope of the problem is related to a number of factors. For
example, if the page contains only text or a few black/white
(bi-tonal) images, the problem is limited since bi-tonal images
compress to very small sizes (typically using CCITT Group 4
compression at the industry standard 300 DPI resolution). You'll be
able to scan the entire page at a single resolution (300 DPI),
color-depth (bi-tonal) and compression (CCITT Group 4), and retain a
small file size. Using JBIG2 compression you can even achieve similar
file sizes as PDF Normal. If the page contains grayscale or color
images, however, file size increases dramatically. An 8 ½ by
11 inches page scanned at 300 DPI with 24-bit color depth would
result, uncompressed, in a TIFF file of around 25 MB:
Width: 8 ½ x 300 = 2550 pixels
Length: 11 x 300 = 3300 pixels
2550 x 3300 = 8415000 total page pixels
8415000 x 24 (color-depth bits) = 201960000 bits
201960000 / 8 = 25245000 bytes, or 25.2 MB.
PDF files containing 25.2 MB per page would take a long time to
download and will require much disk space to store. Image compression
is intended to reduce image sizes.
Image Compression can be categorized as lossy and lossless. Lossy
compression algorithms focus more on losing file size than on
retaining the image quality. JPEG, for example, is a lossy
compression method. It is frequently used for color images on the
Web, where small image file sizes and thus shorter download times are
more important than high quality images. TIFF G4, on the other hand,
is a lossless bitonal compression methodology often used to scan
medical, legal, and governmental documents that must retain their
original look and feel. Also, when converting to Searchable Image
PDF, the OCR (Optical Character Recognition) process required to add
the text layer to the PDF will work much better if applied to a
lossless, purely bitonal scan. TIFF G4 is therefore often used for
OCR. The right compression method for your conversion therefore
depends on the following factors:
- Type of Information (medical, legal, etc.)
- Range of colors (bitonal, grayscale, color)
- Resolution required, in DPI (dots per inch)
The following table illustrates the most
popular methods of compression and where they are commonly used:
|
Compression Method
|
Lossy/ Lossless
|
Color Range Supported
|
Application
|
Compression Ratio
|
|
TIFF
|
G4
|
Lossless
|
Bitonal
|
Legal, Defense, Government
|
90-95%
|
|
JBIG2
|
Supports both
|
Bitonal
|
Legal, Defense, Government
|
95-98%
|
|
LZW/Packbits
|
Lossless
|
Color
|
Medical, IT
|
LZW: 80-85%
Packbits: 75-80%
|
|
JPEG, GIF
|
JPEG: Lossy
GIF: Lossless
|
Color
|
WWW
|
JPEG: 90-95%
GIF: 60-80%
|
While compressing the entire page using one method is the
simplest, it does not necessarily provide the optimal results.
Frequently different types of compression are suitable to different
parts of the page. Areas on a page containing text that will undergo
an OCR process to produce Searchable PDF, for example, should be
scanned at a resolution not lower than 300 DPI and using bitonal
color depth. Images on the same page, however, can't be scanned at
bitonal color depth since that would convert the color image to
monochrome. Scanning the entire page at 300 DPI color will result in
a large file size even when using image compression. So if a page
contains images and text, a dilemma unfolds: If the compression
methods mentioned above allow for only one color depth and one
resolution setting, the final PDF produced from the image will either
contain color but will be large in size and suffer from below-par OCR
results, or it will have to be created bitonally to allow for small
file sizes and good OCR. This problem is solved with Composite PDF.
Appendix B:
Composite PDF
Standard image file formats have a major drawback: you can only have
one resolution and one color depth setting for the entire image. For
example: in order to scan a page containing mostly text, but also a
few color images surrounded by text (think of a medical journal or a
computer magazine), you'll typically either scan the whole page at a
bitonal setting, which will capture the text and white space
optimally but will convert all images to monochrome, or at a color
setting, which will pick up the color images beautifully but create
unnecessary gray shadings for the text and white space and result in
a huge file. You'd also be limited to one resolution. As a solution
to this, PDF allows you to combine many 'zones' on a single page. In
the example above, you could scan the whole page to a 300 DPI bitonal
TIFF, and then again at 150 DPI JPEG color, and combine them in the
final PDF to yield the perfect balance: Composite PDF. This PDF will
enjoy the best of both worlds: purely bitonal text and white space
areas (which is important to get best OCR and print results) and true
color, compressed image areas. File size will be kept to a minimum
since you'll be able to use G4 or JBIG2 compression on all text and
white space areas and JPEG for the images..
Lazar Weisz
Data Conversion Laboratory
Read more of this white paper on how to convert documents to PDF:
- Overview: www.dclab.com/pdf_conversion.asp
- PDF Image Only: www.dclab.com/pdfwhitepaper2.asp
- PDF Searchable Image: www.dclab.com/pdfconversion3.asp
- PDF Normal: www.dclab.com/pdf_whitepaper_4.asp
© 2002/2003 Data Conversion Laboratory. All rights reserved. This White Paper is for informational purposes only. Data Conversion Laboratory makes no warranties in this document, expressed or implied.
|
|
|
|
|
CIDM Best Practices Conference September 13–15, 2010 Hampton, Virginia
Vasont Users' Group Meeting September 27–30, 2010 Hershey, Pennsylvania
Internet Librarian Conference October 25–27, 2010 Monterey, California
Journal Article Tag Suite Conference (JATS-Con) November 1–2, 2010 Bethesda, Maryland
SPARC Digital Repositories Meeting November 8–9, 2010 Baltimore, Maryland
More Events »
|
|
|
|
 |
|
|