Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Holiday Calendar
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations

Adobe PDF Conversion: How, For Whom, And When?

PDF White Paper, Part 3: PDF Searchable Image.

Get the lowdown on converting data to PDF from Lazar Weisz, PDF expert at Data Conversion Laboratory (DCL).

What is PDF Searchable Image?

READ MORE OF THIS PDF CONVERSION WHITE PAPER:

  1. Overview
  2. PDF Image Only
  3. PDF Searchable Image
  4. PDF Normal

OTHER PDF RESOURCES ON DCLAB.COM

NEW WHITE PAPER ALERT!

Be first in line to read new articles on PDF, XML, and data conversion. Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

PDF Searchable Image is a PDF Image Only document with the addition of a text layer beneath the image. This approach inexpensively retains the look of the original page while enabling text searchability. This balanced approach is especially suitable for documents that have to be searchable but would be too expensive to recompose. The text layer is created by an Optical Character Recognition (OCR) application that scans the text on each page. It then creates a PDF file with the recognized text stored in a layer beneath the image of the text.

Who needs this format?

Many corporations, universities, governmental agencies, and other organizations have millions of pages of invaluable information sitting on storage shelves with limited accessibility. When faced with the challenge of converting these pages to digital format, the two most important factors (besides low cost) are easy distribution and an ability to search, link, and index the text. Since much of the process can be automated, PDF Searchable Image allows low-cost conversion from paper to PDF, while permitting the same linking, bookmarking, searching, and indexing that a recomposed PDF Normal document allows.

The disadvantages of this format are:

  1. Care needs to be taken with the conversion process because any OCR errors that have not been cleaned up will be seen if someone pastes text.
  2. If text does not OCR properly, searches will fail.

The technology behind it

Every PDF document, unlike static image formats such as TIFF, JPEG and BMP, has the ability to contain several 'layers' of information. First, there is the 'image layer'. If your PDF page contains any bitmap images, their information, such as the actual image, resolution, compression method and color depth, are included in this layer. Then there is the 'text layer'. With PDF Searchable Image, the text layer includes the actual ASCII text and an identification of the text's location behind the bitmap of the page. This means that any page, regardless of its contents, is scanned to bitmap format and not recomposed. An OCR run is then performed against any desired area on the bitmap and the results stored in the text layer of the final PDF document. The result is an exact bitmapped replica of the scanned paper page, with text information stored behind the bitmap image of the page.

The trade-offs

While the costs of this approach are lower than the re-authoring approaches, it's counterbalanced by two factors that may be an issue for your application: text accuracy and file size.

Text Accuracy - The OCR process required to create PDF Searchable Image typically provides text accuracy of 97 to 99 percent. One to three wrong characters for every 100 may seem like a lot errors. But this is not a problem for those applications that this approach is designed for. Since the user sees a scanned image representation of the original paper page, OCR errors will not be visible to the eye. The errors are only an issue when searching or copying text, which accesses the text layer.

Most text accuracy errors, when converting from good quality paper, result from special characters being picked up incorrectly by the OCR engine. Since the vast majority of searches and linking performed on the PDF file will be done against regular characters and not special characters (which are usually not searched against) the search accuracy for many applications is good enough even at the 97-99% textual accuracy. If a higher accuracy level is desired, expect higher conversion prices since someone will have to manually proofread and correct the documents after they underwent the OCR process.

File Size - File sizes are generally larger since the full image of each page needs to be retained. However, as discussed in Part II of this White Paper, where the primary focus is PDF Image Only documents, there are a number of ways to decrease the file size of a PDF file. All information discussed there also pertains to PDF Searchable Image, since Searchable Image is in fact the same as PDF Image Only with an added text layer. A smaller final PDF document usually costs more than a larger one, since the conversion process will have to implement additional steps to decrease file size. An example of this would be Composite PDF

Lazar Weisz
Data Conversion Laboratory

Read more of this white paper on converting data to PDF:

  1. Overview: www.dclab.com/pdf_conversion.asp
  2. PDF Image Only: www.dclab.com/pdfwhitepaper2.asp
  3. PDF Searchable Image: www.dclab.com/pdfconversion3.asp
  4. PDF Normal: www.dclab.com/pdf_whitepaper_4.asp


© 2002/2003 Data Conversion Laboratory. All rights reserved.
This White Paper is for informational purposes only. Data Conversion Laboratory makes no warranties in this document, expressed or implied.
  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
DCL Calendar

Best Practices Santa Fe, NM, September 15-17, 2008. More…
XyUser Phoenix, AZ, September 22-24, 2008. More…
9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe. Details TBA.

 
Recent News

Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.