|
Alphabet
Soup
Or, What File Format Should I Really Use?
adapted from The End
Sheet, Summer 2000
by John Lynch
There’s no doubt: The Internet has changed the face
of publishing, and data standardization has become a major issue. It doesn’t
matter whether you’re a small publisher or one of the giants, you’ve most
likely produced huge volumes of materials in a multitude of electronic
formats. You may have tried to set standards, but it’s difficult to expect
all authors to format files exactly the way you specify. Who are you
to tell that important author that she can’t use her 1985 Macintosh? Or that Nobel Prize winner that he must
use Word 97?
As data reuse and relicensing
become tantamount to success (or even survival), standardization starts to
make more and more sense. Increasingly, publishers are faced with
the daunting task of integrating disparate files— produced in a multitude
of inconsistent formats— and converting them to formats suitable for Web
publishing.
The choices are confusing, and making the right decisions early is
critical.
What Are the Data-Use Issues? Before you begin the physical task of conversion,
there are several important issues to address: 1. You must first be able to produce and distribute
pages that look exactly the way you’d like them to look, with the fonts
you specify and the page integrity you have chosen. 2. The data you produce must be consistent and meet
your standards. 3. You must produce a version of that data suitable
for derivative uses (e.g., the Internet, handheld devices, audio devices,
CDs, etc.). 4. When producing that repurposed data, you must also
build in the ability to find information through text searches and more
advanced searches that depend on context and “understanding.” 5. In addition, you should consider using portions of
that data for different products and for the “back end” of your operation,
such as customer service and accounting. 6. Finally, it may be advantageous to produce your materials so that
others could use it and so that you could incorporate that information
into products belonging to other organizations. What Are the File Formats? Several file formats have
emerged as standards in the industry. You may already use some; others may
never make it into your arsenal of tools.
Native Formats Common word processing or
publishing formats in which documents are generally produced. Word,
WordPerfect, Quark, PageMaker, and Excel are native formats.
TIFF (Tagged Image Format) Common format for exchanging raster (bit- mapped)
images between application programs. Similar to a photographic image of the
page, and usually produced by scanning. Files are flat, which means that text
in the image cannot be searched. PDF (Portable Document Format) Proprietary print format intended to reproduce
documents as originally composed. Depending on how they’re produced,
these files may contain text (at varying degrees of accuracy), or they may
be image only.
Requires freely available software, Adobe Acrobat Reader, to view,
print, and search. Not optimized for viewing on-screen.
HTML (Hypertext Markup Language) Set of “markup tags,” loosely
modeled on SGML, specifically intended to support files for display on the
World Wide Web.
The markup tells the browser how to display a Web page’s text and
images.
SGML (Standard Generalized Markup Language) Internationally agreed-upon standard for information
representation.
Provides a structure for defining document tag sets for a wide
variety of applications. The tag sets
allow the appearance and content to be separated so that the information
can be reformatted for different uses. XML (Extensible Markup Language) Streamlined version of SGML that makes it possible to
use and display information in different ways by defining and separating
structure and elements. How Do the Different Technologies Stack Up? Finding the right format requires analysis on several
levels.
When choosing a file format to use for your data, keep in mind your
target audience, the long-term goals of your organization, and expected
future uses. Here’s a brief analysis of each
of the file formats, with pros and cons and typical uses:
|
DATA
USE ISSUES |
NATIVE FORMATS |
TIFF |
PDF |
HTML |
SGML |
XML |
|
Distributing page images |
very
good |
excellent |
excellent |
good |
good |
good |
|
Enforcing standards |
limited |
none |
none |
limited |
excellent |
excellent |
|
Repurposing |
limited |
none |
limited |
limited |
excellent |
excellent |
|
Searching |
limited |
none |
limited |
good |
excellent |
excellent |
|
Component reuse |
limited |
none |
none |
limited |
excellent |
excellent |
|
Data
interchange |
limited |
none |
none |
limited |
excellent |
excellent |
|
Relative cost per page in $ |
0.00 |
.25
-.75 |
.50
-3.00 |
2.00
-5.00 |
2.00
-8.00 |
2.00
-6.00 |
|
|
PROS |
CONS |
TYPICAL USES |
|
Native Formats |
|
• No additional investment |
|
• System already in place |
|
• No additional training |
|
|
• Will it continue to be supported? |
|
• Limited enforcement of standards |
|
• Limited repurposing capabilities |
|
|
• Documents intended for print only |
|
• Internal documents |
|
• documents with limited audience |
|
|
TIFF |
|
• Exact representation of pages |
|
• Inexpensive to produce from paper via
scanning |
|
|
|
• Large file sizes |
|
• Not suitable for applications that must
be searched |
|
• Cannot reorganize information |
|
|
• Images |
|
• Paper document archives |
|
• “Dead” documents |
|
|
PDF
|
|
• “Almost exact” representation of page |
|
• Inexpensive to produce |
|
|
• Limited search capability |
|
• Cannot edit or modify resulting files
|
|
• Difficult to read on-screen |
|
• Proprietary format |
|
• Large file sizes |
|
• Requires separate software— versions can
change and affect display |
|
|
• Reference documents that must retain
original look |
|
• Documents that would normally be
printed |
|
|
HTML
|
|
• Designed specifically for the Internet
|
|
• Widely supported |
|
• Automatically produced by some
software |
|
|
• Limited formatting capabilities |
|
• Tagging relates to look, not content |
|
• Making it “look right” needs tweaking
|
|
• Difficult to edit |
|
• A moving standard |
|
|
• Internet publishing |
|
• Quick Web applications |
|
|
SGML |
|
• Adaptable to many applications |
|
• Well established |
|
• International standard |
|
• Content searching |
|
|
• Significant investment |
|
• Training issues |
|
• Requires professional support staff |
|
• Must use structured documents |
|
|
• Large collections |
|
• Technical documentation |
|
• Materials intended for multipurposing
|
|
|
XML |
|
• Robust |
|
• Advantages of SGML but easier to get
started |
|
• Can start gradually, adding functionality
as you go |
|
• Enhanced features for Internet and
e-commerce |
|
• Widely supported |
|
• Many are building tools that may be
inexpensive |
|
|
• Tools not yet fully developed |
|
• Additional training for new tools |
|
|
• Internet applications |
|
• Catalogs |
|
• “Living” documents |
| What's the bottom line? Think cost
effective, not just cost. You'll want to enrich your data so that you can
extract maximum utility from it long into the future. Take some
time - talk to your peers and other experts in the held. Planning is
everything, and you won't want to go through multiple conversions.
|