|
|
Adobe PDF Conversion: How, For Whom, And When?
PDF White Paper, Part 4: PDF Normal.
Get the lowdown on how to convert data to PDF from Lazar Weisz, PDF expert at Data Conversion Laboratory (DCL).
What is PDF Normal?
PDF Normal is an exact print-ready representation of the source format, whether paper or electronic. All page layout information, such as font properties, resolution and compression of images, and their location on the page, is contained within this format. The easiest way to understand PDF Normal is to think of it as a viewing platform for documents created in a word processing or publishing application: it displays exactly what the author has created. This allows for the most realistic representation of the source. Text in PDF Normal documents are not scanned bitmap representations of the original, as is the case in PDF Searchable Image. It comes directly from the application in which the document was authored. This ensures that text accuracy is extremely high. Also, the absence of bit-mapped images enables the PDF file size to remain as small as possible. In eBooks, for example, this is very important because eBooks are frequently downloaded and small file sizes are therefore essential.
How do I get PDF Normal?
If your source is already in a typeset, electronic format and has
been created using a word processor such as MS Word or a desktop
publishing application such as Quark, Interleaf or FrameMaker, going
to PDF Normal is simple. These applications typically come with a
'Save As PDF' or 'Print To PDF' function, which allows the user to
painlessly convert the document to PDF Normal. The author ensures
that all text, images, hyperlinking, and other elements of the
document are correctly formatted within the authoring application.
Once that is done, the document is saved as PDF Normal.
If your source data is paper, however, creating PDF Normal becomes
significantly more complicated and expensive.
Paper to PDF Normal conversion
Depending on the quality of the source paper documents, the
information must be converted to electronic format either by scanning
& OCR or by manual keying. Other elements of the page, such as
tables and images, will also have to be ported over to electronic
format. OCR engines do a pretty good job at detecting simple tables;
however, expect to do post-OCR clean up on complex tables. Raster -
or bitmap - images will have to be scanned, cleaned up, and adjusted
to the right color space and resolution. If you would like to include
vector images in your final PDF Normal file, you will have to draw
them from scratch, since OCR is not able to create vector images from
paper.
Typesetting and conversion to PDF Normal
Once all document elements have been captured from paper into electronic format, they need to be typeset in a desktop publishing or word processing environment. This is the step where all final PDF Normal components are created: text layout, hyperlinks, image properties, headers and footers, table structures, and so on. Remember: if you OCR'ed the text from paper, you will need to carefully proofread it to ensure it conforms to the high textual accuracy PDF Normal users' demand: typically 99.995%, or 5 errors in 100,000 characters. As opposed to Searchable PDF, any typo in the text will be immediately visible in the final PDF.
This is also a good place to add elements to the document that the
paper did not have. For example, if the original paper document did
not have a Table of Contents or an Index, you can create one now,
link the various entries to the appropriate pages in the file, and
thus add value to the overall project.
As mentioned earlier, once typesetting is complete, you can
produce the final PDF Normal file simply by using the 'Save As PDF'
or 'Print to PDF' function.
Why not scan and OCR straight to PDF Normal?
Most OCR applications are able to produce PDF Normal right out of the
OCR stage. Why, then, go through the trouble of typesetting the
document? The answer to this question comes with a good understanding
of PDF Normal. This format does not leave any room for textual
inconsistencies. If one line of text in the PDF is composed of Times
New Roman font size 10, and the next line is made up of font size
9.5, the reader will immediately pick it up, just like she would in a
Word document. Therefore, you can't rely on the OCR engine to produce
a 100% consistent representation of the original paper page in terms
of font type and size as well as textual accuracy. Another reason:
going directly from OCR to PDF Normal does not allow you to add any
value to the project - what you see on paper is what you'll get in
the PDF Normal file. This is a wasted opportunity.
PDF Normal: Summary
The complexity and cost of the journey to PDF Normal depends on the
format of the source (paper or already typeset electronic format),
the complexity of the page layout, and whether you would like to add
value to the document you want to produce. Conversion from typeset
electronic format to PDF is trivial; conversion from paper is
difficult and expensive. However, once you have created PDF Normal
from your documents, you are in possession of the best format
possible for distributing and publishing your documents on the local
network and the Web. For many companies this is an invaluable
resource and one that may be critical to business success. It is
therefore often worth the extra money to get the best quality PDF
Normal.
PDF White Paper: Summary
The PDF format has become a primary choice of representing and
distributing information at low cost, both on local networks as well
as the World Wide Web. The unique ability of PDF to enable documents
to be viewed and printed easily has been a prime factor in its
success. Just as Microsoft has done with its Windows family of
Operating Systems, PDF has gained a critical mass of end-users to
achieve a self-sustaining customer base. This ensures that the format
will live on for many years to come. The sheer amount of plug-ins
available for Adobe's Acrobat application also allows users to
manipulate their PDF files in any number of ways. The PDF format is
thus not a dead-end. Using the many tools available, images and text
in PDF files can be exported, changed, deleted, and adjusted.
Additionally, the many security options that come with PDF permit
documents to be protected from tampering, piracy, and fraud. All of
these broad possibilities have contributed to PDF's popularity and
success.
It is, however, important to point out that PDF is not the panacea
of publishing. As pointed out in Part I of this White
Paper, PDF is not in competition with markup languages such as SGML
and XML. If you intend to normalize and repurpose your documents, PDF
is not a solution, since the text in PDF files is not styled. Often
the ideal solution is a combination of SGML/XML and PDF, where
documents are first converted to SGML/XML, loaded into a publishing
platform, and then printed to PDF.
For additional information on the relationship between PDF and
SGML/XML, please see the following items on our FAQ page:
Lazar Weisz
Data Conversion Laboratory
Read more of this white paper on how to convert data to PDF:
- Overview: www.dclab.com/pdf_conversion.asp
- PDF Image Only: www.dclab.com/pdfwhitepaper2.asp
- PDF Searchable Image: www.dclab.com/pdfconversion3.asp
- PDF Normal: www.dclab.com/pdf_whitepaper_4.asp
© 2002/2003 Data Conversion Laboratory. All rights reserved. This White Paper is for informational purposes only. Data Conversion Laboratory makes no warranties in this document, expressed or implied.
|
|
|
|
|
CIDM Best Practices Conference September 13–15, 2010 Hampton, Virginia
Vasont Users' Group Meeting September 27–30, 2010 Hershey, Pennsylvania
Internet Librarian Conference October 25–27, 2010 Monterey, California
Journal Article Tag Suite Conference (JATS-Con) November 1–2, 2010 Bethesda, Maryland
SPARC Digital Repositories Meeting November 8–9, 2010 Baltimore, Maryland
More Events »
|
|
|
|
 |
|
|