DCL  
representational space

   Refer a friend  Email this Page
   Print friendly version Print-Friendly
   Request Information Request Information
   Subscribe  Subscribe

LinkedInTwitterFacebook

representational space
Services
Content Reuse
Document Conversion
Quality Assurance
Rendering & Publishing
SPL Labeling
Source Formats
   - Word Processors
   - Publishing Systems
   - PDF
   - Other Formats
Target Formats
   - XML & SGML
   - DITA
   - Military DTDs
   - NLM
   - Public DTDs
   - S1000D
   - Other Standards
Other Services >>
representational space
Memberships

Click here for a printer-friendly version of this report

Don BridgesThe Real Story on XML
DCL's Don Bridges (pictured) provides the lowdown on XML, looking at how it compares to its predecessors, SGML and HTML, and at what to expect in the near future...



THERE HAS BEEN a tremendous amount of buzz in the last few years about XML and how it will revolutionize how information is used, managed, exchanged, and presented. A 1998 technology report went so far as to say that "XML will revolutionize the exchange of business information similar to the way the phone, fax machine, and photocopier did when those devices were invented."  But talk of revolution is bold talk indeed.

In the next few pages, we will attempt to provide a high level overview about what XML is, how it compares to its predecessor HTML, and what we see in the near future.  Please remember that predicting the future is dangerous business (if you don't believe us, just ask your local TV weatherman!). These are our opinions, based on 20 years in the industry (long before XML, HTML, or SGML for that matter).


DCLnews, for insider tips on XML and SGML, the latest tech and e-publishing news, plus the month's offbeat news stories...FREE E-JOURNAL!
Subscri
be to DCLNews, for insider tips on XML & SGML, the latest tech and e-publishing news, plus the month's off-beat news stories. Click here to subscribe or read a sample issue


Data Formats
Before we get too far, it's probably a good idea to map out the data format landscape, as it exists today.  It's a bit of an "alphabet soup" with lots of acronyms to choose from. 

TIFF

A common format exchanging raster (bitmapped) images between application programs. The equivalent of a photographic image of the page usually produced through scanning. May also be identifying metadata attached to the image, but the text appearing in the image is not available for searching.

PDF

Created by Adobe and an acronym for "Portable Document Format," PDF is a proprietary print format intended to reproduce documents as originally composed.  Requires the freely available Adobe Acrobat Reader to view, print, and search PDF documents.

HTML

Hypertext Markup Language is the set of "markup" tags, loosely modeled on SGML, and specifically intended to support files for display on the web. This markup tells the web browser how to display a web page's text and images.

SGML

Standard Generalized Markup Language is an internationally agreed standard for information representation. It provides an architecture for defining document tag sets for a wide variety of applications.  The tag sets allow the appearance and text to be separated and reformatted for different uses.

XML

Extensible Markup Language is a streamlined version of SGML which makes it possible to use and display information in different ways by defining its structure and elements

Data Issues
With that bit of housekeeping behind us, we can look at Data Use issues (which, in the final analysis, is why one format is considered 'better' than another). DCL feels that the requirements should always drive the solution. So we have put together a concise list of six areas that data formats can be evaluated on.  Of course, your organization may feel that some areas are more important than others, but more about that later.  The six areas are:

Distributing Page Image Representations

Ability to distribute and produce an exact page image with exact fonts, composition, and page integrity.

Repurposing

Ability to create new versions of data suitable for derivative uses (e.g., the web, diagnostic equipment, hand-held devices, etc.) or customized applications (e.g., showing one 'view' of a data set to a mechanic and a different "view" of the same data set to an operator)

Searching

Ability to find information through text searches and through more advanced (e.g., Boolean) searches that depend on context and "understanding"

Component Re-use

Ability to use portions of data for different products and different documentation sets.  Automation of the data process comes into play here.

Enforce Data Standards

Ability to assure that the information produced is produced consistently and meets corporate standards

Interchange with Vendors, Customers, and the World

Ability for others to use your information for communications with others and to incorporate into products belonging to other organizations

This is a concise list of data use issues.  (As an aside, if you feel that we are missing one, your feedback is welcome).

Data "Consumer Reports"
So the natural question is "How do different technologies compare against these issues?"  Glad you asked.  With all due respect to Consumer Reports, we present the results of our "Battle of the Data Formats":

Data Format
Data Use Issue TIFF PDF HTML XML SGML
Distributing Page Images
Re-purposing
Searching
Component Reuse
Enforce Standards
Interchange

Legend

None

Limited

Good

Very Good

Excellent

It is critical to emphasize that each organization should only evaluate data formats based on the issues that are important to them.  For instance, if "Distributing Page Image Representations" is the ONLY issue that is important, the PDF is a very good option (maybe the best option).  However, when you look at all of the data issues (most of which ARE important to 'high-tech' companies), you start to understand why there is such a buzz around XML.

But if XML is rated so highly, why is HTML still around?  To understand that question, let's take a closer look at HTML and how it compares to XML.

HTML vs. XML
Both HTML and XML are "mark-up languages", meaning that there are tags applied to impart meaning to the data.

HTML (Hypertext Markup Language) is:

  • Pervasive and supported means of describing information for web transmission
  • Limited structure, reuse, interchange, and automation
  • Uses tags to describe how information should appear

XML (Extensible Markup Language) is:

  • Destined to become the mainstream technology in web applications where high degrees of reuse, interchange and automation are required.
  • Tags are separated from the formatting, which means that the tags tell you what the data means - not how it looks.

To illustrate this, let's look at an example of the tagging for HTML vs. XML

In HTML:

<p>
<b> P266 Laptop</b>
<br>
<i>Friendly Computer Shop</i>
<Br>$1438
</p>

In XML:

<product>
<model>P266 Laptop</model>
<dealer>Friendly Computer Shop</dealer>
<price>$1438</price>
</product>

XML typically tells us about the data; HTML tells us about the formatting.


NEWS FLASH!!!
DCL PROVIDES ONLINE ACCESS TO NEW XML
TECHNICAL LIBRARY

Data Conversion Laboratory announces the launch of their new online technical library. This new library gives anyone FREE access to insider information about XML and SGML, e-books, technical documentation, and scientific and educational publishing.

Go to: http://www.dclab.com/dcllibrary.asp


SGML is the Foundation
Before there was HTML or XML, there was SGML. SGML became an ISO standard in 1986 (ISO 8879).  SGML has been adopted and implemented by many industries in many applications (DCL performed one of the first large scale SGML conversions for General Motors in 1986). SGML is rich in syntax and very extensible, and today's markup language implementations (HTML) and Variations (XML) owe their usefulness to SGML. But if SGML is the foundation, HTML and XML are the evolutionary applications that are strong, reliable, and cost effective.

This is a result of two main issues that are particularly true of HTML and XML:

  • Content creation and rendering is easy
  • Content management and distribution tools are available and affordable

Today's Markup REQUIREMENTS are defined by content creation, management, and distribution requirements, which are currently defined as:

  • Paper
  • Web
  • Custom Applications

Today's Markup ACCEPTANCE is driven by effectiveness and ROI.

XML meets the business need
The reality is that XML is simpler and easier to create and distribute than SGML. Features that are important for web delivery have been retained (elements, attributes, linking, validation), while least used and most difficult to implement features dropped (marked sections, inclusions, exclusions).  In addition, XML is extensible, which means transformation capabilities and data-type standards are inherent to the format.

So is XML the 'Silver Bullet' for content? Not so fast. 

XML is:

  • Not a print format
  • Not suitable for unstructured information
  • Requires planning

These limitations can make XML a difficult format to migrate to.  This is particularly true of large and/or complex materials that are typically characterized by elaborate tables, equations, cross-referencing, special characters, footnotes, and complex imaging requirements, including hotspots.

Another issue is that there is no single XML standard like there is for HTML. There are several reasons for this, including:

  • Everyone uses data differently
  • Each industry has its issues
  • Not all XML is created equal
  • There will always be new ways to use data
  • Creative approaches lead to a competitive advantage

To take the point further, data models tend to be turned to internal processes and priorities. Since every company differs in those areas, it's natural that the data models would differ aswell.  At the same time, it's important for industries to strive to establish interchange data models, which will be subsets of the internal data models of the participating companies.

So what about SGML?
Does XML replace SGML today? MAYBE!
  • XML is designed for data delivery, not authoring
  • XML simplifies the delivery and rendering of complex data

If you're starting up now, XML is easier to implement, and the tools are pretty much in existence.  At the very least you should make your application XML-ready (meaning that the data should be structured in a manner that will allow it to meet (or almost meet) the restrictions of XML if that is desired in the future.

However, if your project is already in process - e.g. you've already defined a DTD, or are using an industry standard DTD that works for you - there's no reason to change in midstream to XML, as you do get the same benefits, and you've done most of the hard work already.

Also, some data sets use 'Exclusions' and 'Inclusions' (rules that say the data is only applicable to some models or parts, but not all), and these are not currently allowed in XML (but are allowed in SGML).

Does XML replace SGML in the future? PROBABLY!

  • XML tools will become plentiful, powerful, and cheap
  • XML data structures (schemas) will become standards
  • Web based interfaces will become reliable
So what about HTML?
Does XML replace HTML today?  NO!
  • HTML is easy and free
  • HTML works well for a majority of web users
  • HTML is universal

Does XML replace HTML in the future? YES!

  • Users will expect more from their Web experience
  • Web based interfaces will become reliable

XML will not replace HTML as a formatting language. But XML should and certainly will take the place of HTML as a source language for many types of applications.

Conclusion
There is a buzz about XML in the market, and for good reason.  Is it a revolution? Technically, no.  Is it the final answer for data formats forever? History tells us no. But it may revolutionize the way that we use information to share and leverage information. 

Clearly, XML is not for everyone.  Each organization has to evaluate the benefits and make a thorough analysis to understand if the business case justifies the expense and effort to migrate to XML. As the technology matures (away from the bleeding edge) and tools become easier, cheaper, and more powerful, the business case will become easier to validate.

Post Script
Data Conversion Laboratory's expertise in SGML and XML is recognized in a variety of forums.  DCL's president, Mark Gross, recently authored the chapter on legacy document conversion to XML for Charles Goldfarb's XML Handbook (Prentice Hall), and is currently authoring the Conversion chapter for Columbia University's The Columbia Guide to Digital Publishing. DCLstaff frequently speak on document conversion at leading industry conferences. 

You can learn more about XML by going to our Technical Library which is a collection of resources about data conversion and related topics gathered from past issues of DCLnews, various papers and presentations from DCL, and materials available in other places. The Library is in a state of evolution and is being updated frequently - so stop by often.

If you are planning to migrate your data to XML (or are just thinking about it), we would be happy to discuss your project with you, and explain how DCL can help put you on track to getting the most out of XML, by fully integrating all your existing documents and data in the most efficient and cost-effective way possible.

Don Bridges
Account Manager for Technical Documents
Data Conversion Laboratory

You can contact us at (800) 321-2816 x267 or send us an e-mail at sales@dclab.com
Click here for a printer-friendly version of this report

Back to top

 
representational space
DCL Library
Articles, fact sheets, presentations and white papers
representational space
Events

Content Management Strategies/DITA North America 2010 Conference,
April 19–21 2010, Santa Clara, California

2010 ATA e-Business Forum,
May 17–19, 2010, Seattle, WA

representational space

representational space
representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th Street, 2nd Floor, Fresh Meadows, NY 11365
718-357-8700
Data Conversion Lab
Copyright © 1997-2010  Data Conversion Laboratory, Inc. All rights reserved.