DCLWiki | Client Area  
DCL  

representational space

   Refer a friend  Email this Page
   Print friendly version Print-Friendly
   Request Information Request Information
   Subscribe  Subscribe

          LinkedInTwitterFacebook

representational space
Services
Content Reuse
Document Conversion
Quality Assurance
Rendering & Publishing
SPL Labeling
Source Formats
   - Word Processors
   - Publishing Systems
   - PDF
   - Other Formats
Target Formats
   - XML & SGML
   - ePub
   - DITA
   - Military DTDs
   - NLM
   - Public DTDs
   - S1000D
   - Other Standards
Other Services »
representational space
Memberships

 Alphabet Soup

Or, What File Format Should I Really Use?

adapted from The End Sheet, Summer 2000

by John Lynch

There’s no doubt: The Internet has changed the face of publishing, and data standardization has become a major issue.  It doesn’t matter whether you’re a small publisher or one of the giants, you’ve most likely produced huge volumes of materials in a multitude of electronic formats.  You may have tried to set standards, but it’s difficult to expect all authors to format files exactly the way you specify.  Who are you to tell that important author that she can’t use her 1985 Macintosh?  Or that Nobel Prize winner that he must use Word 97?

As data reuse and relicensing become tantamount to success (or even survival), standardization starts to make more and more sense.  Increasingly, publishers are faced with the daunting task of integrating disparate files— produced in a multitude of inconsistent formats— and converting them to formats suitable for Web publishing.  The choices are confusing, and making the right decisions early is critical.

What Are the Data-Use Issues?

Before you begin the physical task of conversion, there are several important issues to address:

1. You must first be able to produce and distribute pages that look exactly the way you’d like them to look, with the fonts you specify and the page integrity you have chosen.

2. The data you produce must be consistent and meet your standards.

3. You must produce a version of that data suitable for derivative uses (e.g., the Internet, handheld devices, audio devices, CDs, etc.).

4. When producing that repurposed data, you must also build in the ability to find information through text searches and more advanced searches that depend on context and “understanding.”

5. In addition, you should consider using portions of that data for different products and for the “back end” of your operation, such as customer service and accounting.

6. Finally, it may be advantageous to produce your materials so that others could use it and so that you could incorporate that information into products belonging to other organizations.

What Are the File Formats?

Several file formats have emerged as standards in the industry.  You may already use some; others may never make it into your arsenal of tools.

Native Formats

Common word processing or publishing formats in which documents are generally produced.  Word, WordPerfect, Quark, PageMaker, and Excel are native formats.

TIFF (Tagged Image Format)

Common format for exchanging raster (bit- mapped) images between application programs.  Similar to a photographic image of the page, and usually produced by scanning. Files are flat, which means that text in the image cannot be searched.

PDF (Portable Document Format)

Proprietary print format intended to reproduce documents as originally composed.  Depending on how they’re produced, these files may contain text (at varying degrees of accuracy), or they may be image only.  Requires freely available software, Adobe Acrobat Reader, to view, print, and search.  Not optimized for viewing on-screen.

HTML (Hypertext Markup Language)

Set of “markup tags,” loosely modeled on SGML, specifically intended to support files for display on the World Wide Web.  The markup tells the browser how to display a Web page’s text and images.

SGML (Standard Generalized Markup Language)

Internationally agreed-upon standard for information representation.  Provides a structure for defining document tag sets for a wide variety of applications.  The tag sets allow the appearance and content to be separated so that the information can be reformatted for different uses.

XML (Extensible Markup Language)

Streamlined version of SGML that makes it possible to use and display information in different ways by defining and separating structure and elements.

How Do the Different Technologies Stack Up?

Finding the right format requires analysis on several levels.  When choosing a file format to use for your data, keep in mind your target audience, the long-term goals of your organization, and expected future uses.

Here’s a brief analysis of each of the file formats, with pros and cons and typical uses:

 

 

DATA USE ISSUES

NATIVE FORMATS

 TIFF

 PDF

 HTML

 SGML

XML

Distributing page images

very good

excellent

excellent

good

good

good

Enforcing standards

limited

none

none

limited

excellent

excellent

Repurposing

limited

none

limited

limited

excellent

excellent

Searching

limited

none

limited

good

excellent

excellent

Component reuse

limited

none

none

limited

excellent

excellent

Data interchange

limited

none

none

limited

excellent

excellent

Relative cost per page in $

0.00

.25 -.75

.50 -3.00

2.00 -5.00

2.00 -8.00

2.00 -6.00

  

 

PROS

CONS

TYPICAL USES

Native Formats

 

  

 

       No additional investment

       System already in place

       No additional training

 

       Will it continue to be supported?

       Limited enforcement of standards

       Limited repurposing capabilities

 

       Documents intended for print only

       Internal documents

       documents with limited audience

 

TIFF

 

 

 

  

       Exact representation of pages

       Inexpensive to produce from paper via scanning

 

       Large file sizes

       Not suitable for applications that must be searched

       Cannot reorganize information

 

       Images

       Paper document archives

       “Dead” documents

PDF

 

 

 

 

  

       “Almost exact” representation of page

       Inexpensive to produce

 

       Limited search capability

       Cannot edit or modify resulting files

       Difficult to read on-screen

       Proprietary format

       Large file sizes

       Requires separate software— versions can change and affect display

       Reference documents that must retain original look

       Documents that would normally be printed

HTML

 

 

 

 

 

       Designed specifically for the Internet

       Widely supported

       Automatically produced by some software

       Limited formatting capabilities

       Tagging relates to look, not content

       Making it “look right” needs tweaking

       Difficult to edit

       A moving standard

 

       Internet publishing

       Quick Web applications

SGML

 

 

 

 

       Adaptable to many applications

       Well established

       International standard

       Content searching

 

       Significant investment

       Training issues

       Requires professional support staff

       Must use structured documents

 

       Large collections

       Technical documentation

       Materials intended for multipurposing

XML

 

 

 

 

 

 

 

       Robust

       Advantages of SGML but easier to get started

       Can start gradually, adding functionality as you go

       Enhanced features for Internet and e-commerce

       Widely supported

       Many are building tools that may be inexpensive

       Tools not yet fully developed

       Additional training for new tools

 

       Internet applications

       Catalogs

       “Living” documents

What's the bottom line?

Think cost effective, not just cost. You'll want to enrich your data so that you can extract maximum utility from it long into the future.  Take some time - talk to your peers and other experts in the held.  Planning is everything, and you won't want to go through multiple conversions.

 
“Socially Enabling Documentation
in the Cloud“
Watch now!

“Content Strategy: It's Not About Technology“
“Converting to S1000D: What you need to know before, during and after“
DCL Library
Articles, fact sheets, presentations and white papers
Events

RSuite 2011 User Conference
October 25, 2011
Philadelphia, PA

LAVA-Con
November 13-16, 2011
Austin, TX

Digital Book World
January 23-25, 2012
New York, NY

More Events »
News

The Optical Society Selects Data Conversion Laboratory (DCL) For Major Publishing Project


Data Conversion Laboratory Completes eBook Projects For Information Today And Plexus Publishing

Data Conversion Laboratory and Alexander Street Press Collaborate on METS/ALTO Implementation

          More News »

representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th Street, 2nd Floor, Fresh Meadows, NY 11365
718-357-8700
Data Conversion Lab
Copyright © 1997-2011  Data Conversion Laboratory, Inc. All rights reserved.