|
|
Adobe PDF Conversion: How, For Whom, And When?PDF White Paper, Part 3: PDF Searchable Image. Get the lowdown on converting data to PDF from Lazar Weisz, PDF expert at Data Conversion Laboratory (DCL). What is PDF Searchable Image?
PDF Searchable Image is a PDF Image Only document with the addition of a text layer beneath the image. This approach inexpensively retains the look of the original page while enabling text searchability. This balanced approach is especially suitable for documents that have to be searchable but would be too expensive to recompose. The text layer is created by an Optical Character Recognition (OCR) application that scans the text on each page. It then creates a PDF file with the recognized text stored in a layer beneath the image of the text. Who needs this format?Many corporations, universities, governmental agencies, and other organizations have millions of pages of invaluable information sitting on storage shelves with limited accessibility. When faced with the challenge of converting these pages to digital format, the two most important factors (besides low cost) are easy distribution and an ability to search, link, and index the text. Since much of the process can be automated, PDF Searchable Image allows low-cost conversion from paper to PDF, while permitting the same linking, bookmarking, searching, and indexing that a recomposed PDF Normal document allows. The disadvantages of this format are:
The technology behind itEvery PDF document, unlike static image formats such as TIFF, JPEG and BMP, has the ability to contain several 'layers' of information. First, there is the 'image layer'. If your PDF page contains any bitmap images, their information, such as the actual image, resolution, compression method and color depth, are included in this layer. Then there is the 'text layer'. With PDF Searchable Image, the text layer includes the actual ASCII text and an identification of the text's location behind the bitmap of the page. This means that any page, regardless of its contents, is scanned to bitmap format and not recomposed. An OCR run is then performed against any desired area on the bitmap and the results stored in the text layer of the final PDF document. The result is an exact bitmapped replica of the scanned paper page, with text information stored behind the bitmap image of the page. The trade-offsWhile the costs of this approach are lower than the re-authoring approaches, it's counterbalanced by two factors that may be an issue for your application: text accuracy and file size. Text Accuracy - The OCR process required to create PDF Searchable Image typically provides text accuracy of 97 to 99 percent. One to three wrong characters for every 100 may seem like a lot errors. But this is not a problem for those applications that this approach is designed for. Since the user sees a scanned image representation of the original paper page, OCR errors will not be visible to the eye. The errors are only an issue when searching or copying text, which accesses the text layer. Most text accuracy errors, when converting from good quality paper, result from special characters being picked up incorrectly by the OCR engine. Since the vast majority of searches and linking performed on the PDF file will be done against regular characters and not special characters (which are usually not searched against) the search accuracy for many applications is good enough even at the 97-99% textual accuracy. If a higher accuracy level is desired, expect higher conversion prices since someone will have to manually proofread and correct the documents after they underwent the OCR process. File Size - File sizes are generally larger since the full image of each page needs to be retained. However, as discussed in Part II of this White Paper, where the primary focus is PDF Image Only documents, there are a number of ways to decrease the file size of a PDF file. All information discussed there also pertains to PDF Searchable Image, since Searchable Image is in fact the same as PDF Image Only with an added text layer. A smaller final PDF document usually costs more than a larger one, since the conversion process will have to implement additional steps to decrease file size. An example of this would be Composite PDF Lazar Weisz Read more of this white paper on converting data to PDF:
© 2002/2003 Data Conversion Laboratory. All rights reserved.
This White Paper is for informational purposes only. Data Conversion Laboratory makes no warranties in this document, expressed or implied.
|
|
|
|
|
|
|
|
|
|
| |||||||||||||||||||||||||||||||||||||||||||||||||