DCL  
representational space

   Refer a friend  Email this Page
   Print friendly version Print-Friendly
   Request Information Request Information
   Subscribe  Subscribe

LinkedInTwitterFacebook

representational space
Services
Content Reuse
Document Conversion
Quality Assurance
Rendering & Publishing
SPL Labeling
Source Formats
   - Word Processors
   - Publishing Systems
   - PDF
   - Other Formats
Target Formats
   - XML & SGML
   - DITA
   - Military DTDs
   - NLM
   - Public DTDs
   - S1000D
   - Other Standards
Other Services >>
representational space
Memberships

Adobe PDF Conversion: How, For Whom, And When?

PDF White Paper, Part 3: PDF Searchable Image.

Get the lowdown on converting data to PDF from Lazar Weisz, PDF expert at Data Conversion Laboratory (DCL).

What is PDF Searchable Image?

READ MORE OF THIS PDF CONVERSION WHITE PAPER:

  1. Overview
  2. PDF Image Only
  3. PDF Searchable Image
  4. PDF Normal

OTHER PDF RESOURCES ON DCLAB.COM

NEW WHITE PAPER ALERT!

Be first in line to read new articles on PDF, XML, and data conversion. Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

PDF Searchable Image is a PDF Image Only document with the addition of a text layer beneath the image. This approach inexpensively retains the look of the original page while enabling text searchability. This balanced approach is especially suitable for documents that have to be searchable but would be too expensive to recompose. The text layer is created by an Optical Character Recognition (OCR) application that scans the text on each page. It then creates a PDF file with the recognized text stored in a layer beneath the image of the text.

Who needs this format?

Many corporations, universities, governmental agencies, and other organizations have millions of pages of invaluable information sitting on storage shelves with limited accessibility. When faced with the challenge of converting these pages to digital format, the two most important factors (besides low cost) are easy distribution and an ability to search, link, and index the text. Since much of the process can be automated, PDF Searchable Image allows low-cost conversion from paper to PDF, while permitting the same linking, bookmarking, searching, and indexing that a recomposed PDF Normal document allows.

The disadvantages of this format are:

  1. Care needs to be taken with the conversion process because any OCR errors that have not been cleaned up will be seen if someone pastes text.
  2. If text does not OCR properly, searches will fail.

The technology behind it

Every PDF document, unlike static image formats such as TIFF, JPEG and BMP, has the ability to contain several 'layers' of information. First, there is the 'image layer'. If your PDF page contains any bitmap images, their information, such as the actual image, resolution, compression method and color depth, are included in this layer. Then there is the 'text layer'. With PDF Searchable Image, the text layer includes the actual ASCII text and an identification of the text's location behind the bitmap of the page. This means that any page, regardless of its contents, is scanned to bitmap format and not recomposed. An OCR run is then performed against any desired area on the bitmap and the results stored in the text layer of the final PDF document. The result is an exact bitmapped replica of the scanned paper page, with text information stored behind the bitmap image of the page.

The trade-offs

While the costs of this approach are lower than the re-authoring approaches, it's counterbalanced by two factors that may be an issue for your application: text accuracy and file size.

Text Accuracy - The OCR process required to create PDF Searchable Image typically provides text accuracy of 97 to 99 percent. One to three wrong characters for every 100 may seem like a lot errors. But this is not a problem for those applications that this approach is designed for. Since the user sees a scanned image representation of the original paper page, OCR errors will not be visible to the eye. The errors are only an issue when searching or copying text, which accesses the text layer.

Most text accuracy errors, when converting from good quality paper, result from special characters being picked up incorrectly by the OCR engine. Since the vast majority of searches and linking performed on the PDF file will be done against regular characters and not special characters (which are usually not searched against) the search accuracy for many applications is good enough even at the 97-99% textual accuracy. If a higher accuracy level is desired, expect higher conversion prices since someone will have to manually proofread and correct the documents after they underwent the OCR process.

File Size - File sizes are generally larger since the full image of each page needs to be retained. However, as discussed in Part II of this White Paper, where the primary focus is PDF Image Only documents, there are a number of ways to decrease the file size of a PDF file. All information discussed there also pertains to PDF Searchable Image, since Searchable Image is in fact the same as PDF Image Only with an added text layer. A smaller final PDF document usually costs more than a larger one, since the conversion process will have to implement additional steps to decrease file size. An example of this would be Composite PDF

Lazar Weisz
Data Conversion Laboratory

Read more of this white paper on converting data to PDF:

  1. Overview: www.dclab.com/pdf_conversion.asp
  2. PDF Image Only: www.dclab.com/pdfwhitepaper2.asp
  3. PDF Searchable Image: www.dclab.com/pdfconversion3.asp
  4. PDF Normal: www.dclab.com/pdf_whitepaper_4.asp


© 2002/2003 Data Conversion Laboratory. All rights reserved.
This White Paper is for informational purposes only. Data Conversion Laboratory makes no warranties in this document, expressed or implied.
 
representational space
DCL Library
Articles, fact sheets, presentations and white papers
representational space
Events

Content Management Strategies/DITA North America 2010 Conference,
April 19–21 2010, Santa Clara, California

2010 ATA e-Business Forum,
May 17–19, 2010, Seattle, WA

representational space

representational space
representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th Street, 2nd Floor, Fresh Meadows, NY 11365
718-357-8700
Data Conversion Lab
Copyright © 1997-2010  Data Conversion Laboratory, Inc. All rights reserved.