Part 1: Overview
Part 2: PDF Image-Only
Part 3: PDF Searchable Image
Part 4: PDF Normal
App. A: Image Compression
App. B: Composite PDF
PDF, or Portable Document Format, is Adobe's flagship document publishing and distribution format. It has become the most widely used format for distributing documents within businesses, schools, and the Web.
This white paper addresses PDF conversion and the attendant issues. One of the secrets behind the success of PDF is the fact that it is portable. Regardless of the Operating System of the user - whether it be Windows, Linux or Macintosh - with Adobe's free Acrobat Reader, PDFs become readable and printable everywhere. In a changing world of constant struggle for compatibility, this is a tremendously powerful factor. If you want to make sure your documents will be viewable by the largest amount of people at low cost, Adobe PDF is the way to go.
If your primary goal is to disseminate information in its existing form and look, PDF will do an excellent job at much lower cost than other alternatives. PDF is an outstanding choice for reference documents that must retain their original look, and for documents that would normally be printed. However, if your requirements include repurposing and normalizing your documents so that they can be republished and shared with other organizations, PDF may not be the ideal choice. PDF files are also typically larger than marked-up text.
Not all PDF files are equal. There are three forms of PDF files, each with their own characteristics:
PDF Normal
PDF Searchable Image
PDF Image Only
Let's look at each of them in turn ...
PDF Normal
Adobe officially calls this Formatted Text & Graphics. But we'll continue to refer to it as PDF Normal. This is the best kind of PDF. You get this when your materials have been produced on a modern word processing or publishing system, with a PDF output capability. It contains the full text of the page with appropriate coding to define fonts, sizes, etc. The downloaded files are relatively small, and it will look as good on the screen as the printed version would.
If PDF works for your application, and you have the original Word Processing or publishing files, this is the best bet. However, if you are going from legacy materials and don't have suitable electronic files, producing PDF Normal is complex and relatively costly, usually requiring that you convert to a word processing or publishing format first, and from there produce the PDF files.
PDF Image Only
This type of PDF is easiest to produce from legacy sources. It is an image of the page in a PDF wrapper and contains no searchable text. Producing it is easy. All you need to do is scan the materials and put the images through an automated PDF loading process. Image Only PDF could be seen as a replacement for microfilm: It is an archival format which can be retrieved. However, there is no ability for text searching and files tend to be fairly large and therefore harder to store and download. The image quality is dependent on the quality of the source materials and the quality of the scanning operation.
PDF Searchable Image
This is a good compromise for many legacy applications. It is an image of the page, but with the text portions of the image converted to text for search purposes. In a search application, when the text is found, the image corresponding to the found text is displayed, and the materials can be read in context. This type of PDF is relatively inexpensive to produce since the pages can be scanned and run through an automated Optical Recognitions Process -- commonly referred to as "Optical Character Recognition" (OCR).
Usually raw OCR is not suitable because accuracy is unlikely to be high enough (raw OCR accuracy is only about 95-99% for most materials). But for search purposes, it is good enough for the majority of applications. Also, since the image needs to be retained, file sizes are larger than PDF Normal and larger than other text formats. If you can live with these constraints, Searchable Image PDF could be a very good compromise. This approach is frequently suitable for library and legal applications.
NOTE: Searchable Image PDF allows text to be selected and copied into the Windows' paste buffer to use in other applications. But care needs to be taken during the conversion process because any OCR errors that have not been "cleaned up" will be seen if someone pastes text. What's more, searches would fail if the text did not OCR properly -- all of which would reflect poorly on the quality of the product.
Table 1 contains a general overview of the prices you can expect when converting to the various types of PDF. Note that these prices depend on a wide variety of factors. Each conversion project requires its own, unique conversion methodology. The prices shown should be regarded as benchmarks for the average project.
Table 2 illustrates typical file sizes per PDF page generated from the various types of paper and electronic sources.
|
|
PDF Image Only |
Searchable Image PDF |
PDF Normal* |
|
Bitonal Page |
$0.15-0.30 |
$0.17-0.30 |
$1.00-10.00 |
|
Grayscale Page |
$0.25-0.40 |
$0.30-0.45 |
$1.00-10.00 |
|
Color Page |
$0.30-0.45 |
$0.30-0.50 |
$1.00-10.00 |
|
Composite page |
$0.40-0.80 |
$0.40-0.80 |
$1.00-10.00 |
|
From Word Processing Application |
n/a |
n/a |
Trivial |
NOTE: In PDF Normal complex pages often need to be recomposed to retain the original look. The amount of work involved varies widely.
Page properties |
Typical file size per page |
|
Bitonal |
G4 |
100K |
JBIG2 |
30K |
|
Grayscale |
250K1 |
|
Color |
600K2 |
|
Composite |
G4 |
200K3 |
JBIG2 |
100K4 |
|
PDF Normal - text only |
30K |
|
NOTE: G4 and JBIG2 are bitonal compression algorithms.
1 Assuming scan at 150 DPI using medium-strength 8-bit JPEG compression.
2 Assuming scan at 150 DPI using medium-strength 24-bit JPEG compression.
3 Assuming bitonal scan at 300 DPI using G4 compression, grayscale at 8-bit 150 DPI, and color at 24-bit 150 DPI using medium-strength JPEG compression.
4 Assuming bitonal scan at 300 DPI using JBIG2 compression, grayscale at 8-bit 150 DPI, and color at 24-bit 150 DPI using medium-strength JPEG compression.
Part 2: PDF Image-Only
Part 2: PDF Image-Only
Part 3: PDF Searchable Image
Part 4: PDF Normal
App. A: Image Compression
App. B: Composite PDF
Get the lowdown on how to convert documents to PDF from Lazar Weisz, PDF expert at Data Conversion Laboratory (DCL).
PDF Image Only is simply a scanned, non-searchable image of the page inside PDF wrappers. This limited approach to distribute documents is the cheapest - for simple text documents prices range between $0.15 and 0.30 / page. This is an ideal solution for archiving legacy documents in digital format.
PDF Image Only File Sizes
In contrast to the relatively small PDF Normal documents authored in word processors or publishing platforms, PDF Image Only files are subject to the same file size concerns that all image formats - such as TIFF, JPEG and BMP - are subject to. Depending on the type of image, color range, and image resolution, file size is frequently a major concern. Two methods to reduce file size are Image Compression and Composite PDF. A combination of the two might yield the best results.
Image Quality
When scanning to PDF Image Only, keep in mind that the quality of the final PDF is largely dependant on the initial capture to digital format. A high quality scanner with minimal post-scan clean-up will always yield better results than a low quality scan and lots of image clean-up. Investing more in an excellent scanner or in better training of the people doing the scanning pays off quickly when compared to the costs of manually having to de-speckle, de-skew, and otherwise fix a bad scan.
Hyper linking an Image Only PDF
While images are not searchable, there are other navigational aids that can be used with Image Only PDF files. Adobe Acrobat, and other tools, can be used to add hyper linking to a PDF Image Only document. For example, you can provide a Table of Contents, Index, or other intra-document linking structure, which would be linked directly to the relevant page. Alternatively, you can use Acrobat's bookmarking feature, which enables you to create your own Table of Contents-like list of headings that are linked to their respective pages. These bookmarks become part of the PDF file but are not an actual page in the file.
Part 3: PDF Searchable Image
Part 2: PDF Image-Only
Part 3: PDF Searchable Image
Part 4: PDF Normal
App. A: Image Compression
App. B: Composite PDF
What is PDF Searchable Image?
PDF Searchable Image is a PDF Image Only document with the addition of a text layer beneath the image. This approach inexpensively retains the look of the original page while enabling text searchability. This balanced approach is especially suitable for documents that have to be searchable but would be too expensive to recompose. The text layer is created by an Optical Character Recognition (OCR) application that scans the text on each page. It then creates a PDF file with the recognized text stored in a layer beneath the image of the text.
Who needs this format?
Many corporations, universities, governmental agencies, and other organizations have millions of pages of invaluable information sitting on storage shelves with limited accessibility. When faced with the challenge of converting these pages to digital format, the two most important factors (besides low cost) are easy distribution and an ability to search, link, and index the text. Since much of the process can be automated, PDF Searchable Image allows low-cost conversion from paper to PDF, while permitting the same linking, bookmarking, searching, and indexing that a recomposed PDF Normal document allows.
The disadvantages of this format are:
Care needs to be taken with the conversion process because any OCR errors that have not been cleaned up will be seen if someone pastes text.
If text does not OCR properly, searches will fail.
The technology behind it
Every PDF document, unlike static image formats such as TIFF, JPEG and BMP, has the ability to contain several 'layers' of information. First, there is the 'image layer'. If your PDF page contains any bitmap images, their information, such as the actual image, resolution, compression method and color depth, are included in this layer. Then there is the 'text layer'. With PDF Searchable Image, the text layer includes the actual ASCII text and an identification of the text's location behind the bitmap of the page. This means that any page, regardless of its contents, is scanned to bitmap format and not recomposed. An OCR run is then performed against any desired area on the bitmap and the results stored in the text layer of the final PDF document. The result is an exact bitmapped replica of the scanned paper page, with text information stored behind the bitmap image of the page.
The trade-offs
While the costs of this approach are lower than the re-authoring approaches, it's counterbalanced by two factors that may be an issue for your application: text accuracy and file size.
Text Accuracy
The OCR process required to create PDF Searchable Image typically provides text accuracy of 97 to 99 percent. One to three wrong characters for every 100 may seem like a lot errors. But this is not a problem for those applications that this approach is designed for. Since the user sees a scanned image representation of the original paper page, OCR errors will not be visible to the eye. The errors are only an issue when searching or copying text, which accesses the text layer.
Most text accuracy errors, when converting from good quality paper, result from special characters being picked up incorrectly by the OCR engine. Since the vast majority of searches and linking performed on the PDF file will be done against regular characters and not special characters (which are usually not searched against) the search accuracy for many applications is good enough even at the 97-99% textual accuracy. If a higher accuracy level is desired, expect higher conversion prices since someone will have to manually proofread and correct the documents after they underwent the OCR process.
File Size
File sizes are generally larger since the full image of each page needs to be retained. However, as discussed in Part II of this White Paper, where the primary focus is PDF Image Only documents, there are a number of ways to decrease the file size of a PDF file. All information discussed there also pertains to PDF Searchable Image, since Searchable Image is in fact the same as PDF Image Only with an added text layer. A smaller final PDF document usually costs more than a larger one, since the conversion process will have to implement additional steps to decrease file size. An example of this would be Composite PDF
Part 4: PDF Normal
Part 2: PDF Image-Only
Part 3: PDF Searchable Image
Part 4: PDF Normal
App. A: Image Compression
App. B: Composite PDF
What is PDF Normal?
PDF Normal is an exact print-ready representation of the source format, whether paper or electronic. All page layout information, such as font properties, resolution and compression of images, and their location on the page, is contained within this format. The easiest way to understand PDF Normal is to think of it as a viewing platform for documents created in a word processing or publishing application: it displays exactly what the author has created. This allows for the most realistic representation of the source. Text in PDF Normal documents are not scanned bitmap representations of the original, as is the case in PDF Searchable Image. It comes directly from the application in which the document was authored. This ensures that text accuracy is extremely high. Also, the absence of bit-mapped images enables the PDF file size to remain as small as possible. In eBooks, for example, this is very important because eBooks are frequently downloaded and small file sizes are therefore essential.
How do I get PDF Normal?
If your source is already in a typeset, electronic format and has been created using a word processor such as MS Word or a desktop publishing application such as Quark, Interleaf or FrameMaker, going to PDF Normal is simple. These applications typically come with a 'Save As PDF' or 'Print To PDF' function, which allows the user to painlessly convert the document to PDF Normal. The author ensures that all text, images, hyperlinking, and other elements of the document are correctly formatted within the authoring application. Once that is done, the document is saved as PDF Normal.
If your source data is paper, however, creating PDF Normal becomes significantly more complicated and expensive.
Paper to PDF Normal conversion
Depending on the quality of the source paper documents, the information must be converted to electronic format either by scanning & OCR or by manual keying. Other elements of the page, such as tables and images, will also have to be ported over to electronic format. OCR engines do a pretty good job at detecting simple tables; however, expect to do post-OCR clean up on complex tables. Raster - or bitmap - images will have to be scanned, cleaned up, and adjusted to the right color space and resolution. If you would like to include vector images in your final PDF Normal file, you will have to draw them from scratch, since OCR is not able to create vector images from paper.
Typesetting and conversion to PDF Normal
Once all document elements have been captured from paper into electronic format, they need to be typeset in a desktop publishing or word processing environment. This is the step where all final PDF Normal components are created: text layout, hyperlinks, image properties, headers and footers, table structures, and so on. Remember: if you OCR'ed the text from paper, you will need to carefully proofread it to ensure it conforms to the high textual accuracy PDF Normal users' demand: typically 99.995%, or 5 errors in 100,000 characters. As opposed to Searchable PDF, any typo in the text will be immediately visible in the final PDF.
This is also a good place to add elements to the document that the paper did not have. For example, if the original paper document did not have a Table of Contents or an Index, you can create one now, link the various entries to the appropriate pages in the file, and thus add value to the overall project.
As mentioned earlier, once typesetting is complete, you can produce the final PDF Normal file simply by using the 'Save As PDF' or 'Print to PDF' function.
Why not scan and OCR straight to PDF Normal?
Most OCR applications are able to produce PDF Normal right out of the OCR stage. Why, then, go through the trouble of typesetting the document? The answer to this question comes with a good understanding of PDF Normal. This format does not leave any room for textual inconsistencies. If one line of text in the PDF is composed of Times New Roman font size 10, and the next line is made up of font size 9.5, the reader will immediately pick it up, just like she would in a Word document. Therefore, you can't rely on the OCR engine to produce a 100% consistent representation of the original paper page in terms of font type and size as well as textual accuracy. Another reason: going directly from OCR to PDF Normal does not allow you to add any value to the project - what you see on paper is what you'll get in the PDF Normal file. This is a wasted opportunity.
PDF Normal: Summary
The complexity and cost of the journey to PDF Normal depends on the format of the source (paper or already typeset electronic format), the complexity of the page layout, and whether you would like to add value to the document you want to produce. Conversion from typeset electronic format to PDF is trivial; conversion from paper is difficult and expensive. However, once you have created PDF Normal from your documents, you are in possession of the best format possible for distributing and publishing your documents on the local network and the Web. For many companies this is an invaluable resource and one that may be critical to business success. It is therefore often worth the extra money to get the best quality PDF Normal.
PDF White Paper: Summary
The PDF format has become a primary choice of representing and distributing information at low cost, both on local networks as well as the World Wide Web. The unique ability of PDF to enable documents to be viewed and printed easily has been a prime factor in its success. Just as Microsoft has done with its Windows family of Operating Systems, PDF has gained a critical mass of end-users to achieve a self-sustaining customer base. This ensures that the format will live on for many years to come. The sheer amount of plug-ins available for Adobe's Acrobat application also allows users to manipulate their PDF files in any number of ways. The PDF format is thus not a dead-end. Using the many tools available, images and text in PDF files can be exported, changed, deleted, and adjusted. Additionally, the many security options that come with PDF permit documents to be protected from tampering, piracy, and fraud. All of these broad possibilities have contributed to PDF's popularity and success.
It is, however, important to point out that PDF is not the panacea of publishing. As pointed out in Part I of this White Paper, PDF is not in competition with markup languages such as SGML and XML. If you intend to normalize and repurpose your documents, PDF is not a solution, since the text in PDF files is not styled. Often the ideal solution is a combination of SGML/XML and PDF, where documents are first converted to SGML/XML, loaded into a publishing platform, and then printed to PDF.
Appendix A: Image Compression
Part 2: PDF Image-Only
Part 3: PDF Searchable Image
Part 4: PDF Normal
App. A: Image Compression
App. B: Composite PDF
Image compression refers to any of several techniques used to reduce image file sizes usually by removing either redundant information or information which can be recreated prior to display. Reducing file sizes is often important in order to allow image-heavy files to be easily transmitted and stored.
The scope of the problem is related to a number of factors. For example, if the page contains only text or a few black/white (bi-tonal) images, the problem is limited since bi-tonal images compress to very small sizes (typically using CCITT Group 4 compression at the industry standard 300 DPI resolution). You'll be able to scan the entire page at a single resolution (300 DPI), color-depth (bi-tonal) and compression (CCITT Group 4), and retain a small file size. Using JBIG2 compression you can even achieve similar file sizes as PDF Normal. If the page contains grayscale or color images, however, file size increases dramatically. An 8 ½ by 11 inches page scanned at 300 DPI with 24-bit color depth would result, uncompressed, in a TIFF file of around 25 MB:
Width: 8 ½ x 300 = 2550 pixels
Length: 11 x 300 = 3300 pixels
2550 x 3300 = 8415000 total page pixels
8415000 x 24 (color-depth bits) = 201960000 bits
201960000 / 8 = 25245000 bytes, or 25.2 MB.
PDF files containing 25.2 MB per page would take a long time to download and will require much disk space to store. Image compression is intended to reduce image sizes.
Image Compression can be categorized as lossy and lossless. Lossy compression algorithms focus more on losing file size than on retaining the image quality. JPEG, for example, is a lossy compression method. It is frequently used for color images on the Web, where small image file sizes and thus shorter download times are more important than high quality images. TIFF G4, on the other hand, is a lossless bitonal compression methodology often used to scan medical, legal, and governmental documents that must retain their original look and feel. Also, when converting to Searchable Image PDF, the OCR (Optical Character Recognition) process required to add the text layer to the PDF will work much better if applied to a lossless, purely bitonal scan. TIFF G4 is therefore often used for OCR. The right compression method for your conversion therefore depends on the following factors:
1. Type of Information (medical, legal, etc.)
2. Range of colors (bitonal, grayscale, color)
3. Resolution required, in DPI (dots per inch)
The following table illustrates the most popular methods of compression and where they are commonly used:
|
Compression Method |
Lossy/ |
Color Range Supported |
Application |
Compression Ratio |
|
|
TIFF |
G4 |
Lossless |
Bitonal |
Legal, Defense, Government |
90-95% |
|
JBIG2 |
Supports both |
Bitonal |
Legal, Defense, Government |
95-98% |
|
|
LZW/Packbits |
Lossless |
Color |
Medical, IT |
LZW: 80-85%
|
|
|
JPEG, GIF |
JPEG: Lossy
|
Color |
WWW |
JPEG: 90-95%
|
|
While compressing the entire page using one method is the simplest, it does not necessarily provide the optimal results. Frequently different types of compression are suitable to different parts of the page. Areas on a page containing text that will undergo an OCR process to produce Searchable PDF, for example, should be scanned at a resolution not lower than 300 DPI and using bitonal color depth. Images on the same page, however, can't be scanned at bitonal color depth since that would convert the color image to monochrome. Scanning the entire page at 300 DPI color will result in a large file size even when using image compression. So if a page contains images and text, a dilemma unfolds: If the compression methods mentioned above allow for only one color depth and one resolution setting, the final PDF produced from the image will either contain color but will be large in size and suffer from below-par OCR results, or it will have to be created bitonally to allow for small file sizes and good OCR. This problem is solved with Composite PDF.
Appendix B: Composite PDF
Part 2: PDF Image-Only
Part 3: PDF Searchable Image
Part 4: PDF Normal
App. A: Image Compression
App. B: Composite PDF
Standard image file formats have a major drawback: you can only have one resolution and one color depth setting for the entire image. For example: in order to scan a page containing mostly text, but also a few color images surrounded by text (think of a medical journal or a computer magazine), you'll typically either scan the whole page at a bitonal setting, which will capture the text and white space optimally but will convert all images to monochrome, or at a color setting, which will pick up the color images beautifully but create unnecessary gray shadings for the text and white space and result in a huge file. You'd also be limited to one resolution. As a solution to this, PDF allows you to combine many 'zones' on a single page. In the example above, you could scan the whole page to a 300 DPI bitonal TIFF, and then again at 150 DPI JPEG color, and combine them in the final PDF to yield the perfect balance: Composite PDF. This PDF will enjoy the best of both worlds: purely bitonal text and white space areas (which is important to get best OCR and print results) and true color, compressed image areas. File size will be kept to a minimum since you'll be able to use G4 or JBIG2 compression on all text and white space areas and JPEG for the images..
Lazar Weisz
Data Conversion Laboratory















