|
|
Adobe PDF Conversion: How, For Whom, And When?
PDF White Paper, Part 3: PDF Searchable Image.
Get the lowdown on converting data to PDF from Lazar Weisz, PDF expert at Data Conversion Laboratory (DCL).
What is PDF Searchable Image?
PDF Searchable Image is a PDF Image Only document with the addition of a text layer beneath the image. This approach inexpensively retains the look of the original page while enabling text searchability. This balanced approach is especially suitable for documents that have to be searchable but would be too expensive to recompose. The text layer is created by an Optical Character Recognition (OCR) application that scans the text on each page. It then creates a PDF file with the recognized text stored in a layer beneath the image of the text.
Who needs this format?
Many corporations, universities,
governmental agencies, and other organizations have millions of pages
of invaluable information sitting on storage shelves with limited
accessibility. When faced with the challenge of converting these
pages to digital format, the two most important factors (besides low
cost) are easy distribution and an ability to search, link, and index
the text. Since much of the process can be automated, PDF Searchable
Image allows low-cost conversion from paper to PDF, while permitting
the same linking, bookmarking, searching, and indexing that a
recomposed PDF Normal document allows.
The disadvantages of this format are:
- Care needs to be taken with the conversion process because any OCR
errors that have not been cleaned up will be seen if someone pastes
text.
- If text does not OCR properly, searches will fail.
The technology behind it
Every PDF document, unlike static image
formats such as TIFF, JPEG and BMP, has the ability to contain
several 'layers' of information. First, there is the 'image layer'.
If your PDF page contains any bitmap images, their information, such
as the actual image, resolution, compression method and color depth,
are included in this layer. Then there is the 'text layer'. With PDF
Searchable Image, the text layer includes the actual ASCII text and
an identification of the text's location behind the bitmap of the
page. This means that any page, regardless of its contents, is
scanned to bitmap format and not recomposed. An OCR run is then
performed against any desired area on the bitmap and the results
stored in the text layer of the final PDF document. The result is an
exact bitmapped replica of the scanned paper page, with text
information stored behind the bitmap image of the page.
The trade-offs
While the costs of this approach are
lower than the re-authoring approaches, it's counterbalanced by two
factors that may be an issue for your application: text accuracy and
file size.
Text Accuracy - The OCR
process required to create PDF Searchable Image typically provides
text accuracy of 97 to 99 percent. One to three wrong characters for
every 100 may seem like a lot errors. But this is not a problem for
those applications that this approach is designed for. Since the user
sees a scanned image representation of the original paper page, OCR
errors will not be visible to the eye. The errors are only an issue
when searching or copying text, which accesses the text layer.
Most text accuracy errors, when
converting from good quality paper, result from special characters
being picked up incorrectly by the OCR engine. Since the vast
majority of searches and linking performed on the PDF file will be
done against regular characters and not special characters (which are
usually not searched against) the search accuracy for many
applications is good enough even at the 97-99% textual accuracy. If a
higher accuracy level is desired, expect higher conversion prices
since someone will have to manually proofread and correct the
documents after they underwent the OCR process.
File Size - File sizes
are generally larger since the full image of each page needs to be
retained. However, as discussed in Part II of this White
Paper, where the primary focus is PDF Image Only documents, there
are a number of ways to decrease the file size of a PDF file. All
information discussed there also pertains to PDF Searchable Image,
since Searchable Image is in fact the same as PDF Image Only with an
added text layer. A smaller final PDF document usually costs more
than a larger one, since the conversion process will have to
implement additional steps to decrease file size. An example of this
would be Composite
PDF
Lazar Weisz
Data Conversion Laboratory
Read more of this white paper on converting data to PDF:
- Overview: www.dclab.com/pdf_conversion.asp
- PDF Image Only: www.dclab.com/pdfwhitepaper2.asp
- PDF Searchable Image: www.dclab.com/pdfconversion3.asp
- PDF Normal: www.dclab.com/pdf_whitepaper_4.asp
© 2002/2003 Data Conversion Laboratory. All rights reserved. This White Paper is for informational purposes only. Data Conversion Laboratory makes no warranties in this document, expressed or implied.
|
|
|
|
|
CIDM Best Practices Conference September 13–15, 2010 Hampton, Virginia
Vasont Users' Group Meeting September 27–30, 2010 Hershey, Pennsylvania
Internet Librarian Conference October 25–27, 2010 Monterey, California
Journal Article Tag Suite Conference (JATS-Con) November 1–2, 2010 Bethesda, Maryland
SPARC Digital Repositories Meeting November 8–9, 2010 Baltimore, Maryland
More Events »
|
|
|
|
 |
|
|