DCLWiki | Client Area  
DCL  

representational space

   Refer a friend  Email this Page
   Print friendly version Print-Friendly
   Request Information Request Information
   Subscribe  Subscribe

          LinkedInTwitterFacebook

representational space
Services
Content Reuse
Document Conversion
Quality Assurance
Rendering & Publishing
SPL Labeling
Source Formats
   - Word Processors
   - Publishing Systems
   - PDF
   - Other Formats
Target Formats
   - XML & SGML
   - ePub
   - DITA
   - Military DTDs
   - NLM
   - Public DTDs
   - S1000D
   - Other Standards
Other Services »
representational space
Memberships

Is XML Always the Answer?
Keeping Down your Document Conversion Costs, Part 2

By Mark Gross, Data Conversion Laboratory
Click here for Part 1

Related Reading:

The Real Story on XML
www.dclab.com/realxml.asp

The Business Case for XML
www.dclab.com/businessxml.asp

Conversion to XML: Should I stay with SGML?
www.dclab.com/sgmltoxml.asp

FAQ - Difference between SGML and XML
www.dclab.com/dclfaq.asp#diff

FAQ - If I'm just starting out, should I use SGML or XML?
www.dclab.com/dclfaq.asp#xmlvsgml

Although managing costs is important anytime, it is especially important in today's economic reality where budgets are shrinking drastically. Getting your money's worth as well as what you need to support your data should be a core factor of any data project.

The two biggest cost factors are the type of conversion work you need done and how much of it you'll need. This article focuses on how your goals for your project relate to the output format you choose, and how that format impacts costs. While some outputs, like XML, provide higher capabilities, they also cost more to create. Only you can determine the value of that capability to you.

Is XML, then, the only answer? As the saying goes, when your only tool is a hammer, every problem looks

"The two biggest cost factors are the type of conversion work you need done and how much…"

like a nail. If all you have is XML, everything looks like an XML problem. However, there are other alternatives worth exploring. Well known output formats certainly include markup formats like SGML, XML, and HTML - but we'll also discuss some other lower-cost alternatives in upcoming articles.

What's Markup Language and Why Should I Care?

XML, HTML, and SGML are all markup languages, and markup languages are conceptually different from other output formats you might consider. The markup capability of these languages defines the power of the output format, but is also a major contributor to its cost. Markup is the set of explicit tagging elements that describe something about the information or the text, but it's not the information or the text itself. For example, you are using markup when you indicate in a word processing document that a section of text should be bolded or italicized. The instruction to bold or to italicize is the markup, as distinguished from the text. This is a direct form of markup and is the easiest to use. You merely use some keystrokes to indicate to the computer that you want something in the document to look bold, and some codes are hidden in the document to tell the computer for future reference how to display the information, and in modern word processors - what you see is what you get.

But while this kind of direct markup is easier to understand, it is inflexible if you should you later change your mind about how you want the document to look, or if you want to incorporate your materials into someone else's document. For example, if you had decided that your headings needed to look a certain way - bold, and center, and in a particular type font - and later changed your mind you'd need to find each such occurrence of a heading and change the associated markup. You'd have the same problem if you wanted to combine your document with someone else's and this person had made different assumptions about how headings should look.

"XML, on the other hand, makes it possible to define the content of a document separately from its formatting"

XML, on the other hand, makes it possible to define the content of a document separately from its formatting, thereby making it easy to reuse that content in other applications or for other presentation environments. Essentially, XML contains a basic structure & syntax that may be easily shared between several different types of applications.

XML Document Example

<?xml version="1.0"?>
<note>
  <to>Mark</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget to bring home milk!</body>
</note>

What are Styles?

The next level of markup is known as indirect markup, exemplified by the "styles" that are used in MS-word and other word processors. Styles allow you to mark each heading as having a "heading" style without indicating the specifics of how it displays.

The characteristics that apply to the "heading" style are defined separately in some kind of style library. These "heading" characteristics as defined would then be automatically applied each time a "heading" style is used in the document. And if you change your mind about how headings should look, you may easily change its style once, in the style definition. This change is then automatically promulgated throughout the document. The defined characteristics may include font type and size, justification, color, and whatever else the content system supports.

The style library might contain many different styles that you could define, or that are defined centrally. As long as those working with the data are consistent about how using the styles they choose, it is relatively easy to combine documents from multiple authors with minimal effort.

Why Bother with a Markup Language?

The usage of styling as described above can be done with most modern word processors and desktop publishing. So why would you need the more formal markup language? The problem with styles is largely due to its success.

Over time, as people actively use styles in their documents they tend to create new styles to fit new needs, and it's cumbersome to communicate these changes to everyone else. As a result it gets increasingly difficult for people to stay consistent in their use of styles.

Formal markup languages such as SGML, XML, and HTML, incorporate tools to assure consistency in using "styles". Without standardization, and the means to enforce standardization, many of the benefits are lost. Much as standardization of manufacturing techniques launched the industrial revolution in the 1800's, standardization of how data is treated launched an information revolution over the past few decades.

"Much as standardization of manufacturing techniques launched the industrial revolution in the 1800's, standardization of how data is treated launched an information revolution over the past few decades."

The standardization afforded by these markup languages allows the use of software that automatically checks and verifies that the markup was being done correctly, allows the data to be loaded into various databases, and allows the results to be distributed though various media including print and the Web - all in a consistent manner.

Also, as distinct from HTML, which is a standard set of tags, SGML and XML are not tag sets, but rather they are languages to allow you to define tag sets. This means that with SGML and XML, specialized tag sets can be developed for specific uses and specific industries. Examples of such standardized tag sets include ATA-100 for the aerospace industry, SPL for the pharmaceutical industry and XBRL for the financial industry. This also allows the flexibility to create specialized proprietary tag sets.

What's the difference between SGML & XML?

SGML (Standardized Generalized Markup Language), while not the first markup language, was the first of the three more talked about markup languages? (SGML, XML and HTML) coming into commercial use in the early 1990's. It is the most robust implementation of the three, with many features that make it suitable for large-scale applications. It is still in wide use, primarily for legacy applications. The downside of SGML is that some of the features designed to handle the more arcane requirements make it unwieldy and more difficult to build software that can fully support it, with a corresponding lack of newer software to support it.

XML (Extensible Markup Language) began as a simpler, more flexible subset of SGML that eliminated some of the more esoteric features that were causing the most difficulty in implementation. Removing these rarely-used features made it easier to develop tools to work with XML. The relative ease of implementation, coupled with the growing need of to support digital data initiatives across all types of organizations made it the more popular output format selected, with most new tools, and support systems like XSLT, being build to support XML rather than SGML. Today it is the markup language of choice for most applications that need a full markup language.

What's a DTD? What's an XSD? Why Should You Want One?

It's also important to understand that both XML and SGML are not by themselves static collections of styles, or tags. Rather both are languages that allow you to build up your own, very sophisticated definition of allowable tags, called DTD's (Document Type Definitions) or XSD's (XML Schema Definition). Many DTD's and Schemas have already been developed for various industries and applications. Having a DTD or XSD also allows you to automatically assure that all data provided meets the formal definitions; not necessarily that the text is correct, but that it is in the right format and style, and other definable criteria.

Where Does HTML Fit In?

HTML (Hypertext markup Language) is an instance (i.e. a specific tag set) of SGML and XML, designed for use on the World Wide Web. HTML contains the tags identified as being needed for proper display of Web pages. HTML has been accepted as an international standard for the purpose of developing and displaying content on the Web, and consequently is supported by most browsers and other tools designed for their Web usability. While well-suited for the Web, HTML does not contain the tagging and features to support more sophisticated print publishing or other specialized uses. But for its purpose, to manage the display and rendering of documents on the Web - it's excellent, well supported, and has become the lingua franca of the World Wide Web.

The Other Alternatives

Markup languages because they provide both full text, and a higher degree of markup both of which allow quite a lot of flexibility, are at the higher end of the price spectrum. These languages also more effectively support how information is organized in databases, how it is rendered on multiple media types, and how the data may be re-edited and reworked over time. However, you may not need all that. There are other lesser expensive options including simple unmarked text, simple images, "dirty OCR", and various kinds of PDF. The next installment will describe these formats and discuss their pros and cons.

DCLnews Editorial
July 2009

 
“Socially Enabling Documentation
in the Cloud“
Watch now!

“Content Strategy: It's Not About Technology“
“Converting to S1000D: What you need to know before, during and after“
DCL Library
Articles, fact sheets, presentations and white papers
Events

RSuite 2011 User Conference
October 25, 2011
Philadelphia, PA

LAVA-Con
November 13-16, 2011
Austin, TX

Digital Book World
January 23-25, 2012
New York, NY

More Events »
News

The Optical Society Selects Data Conversion Laboratory (DCL) For Major Publishing Project


Data Conversion Laboratory Completes eBook Projects For Information Today And Plexus Publishing

Data Conversion Laboratory and Alexander Street Press Collaborate on METS/ALTO Implementation

          More News »

representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th Street, 2nd Floor, Fresh Meadows, NY 11365
718-357-8700
Data Conversion Lab
Copyright © 1997-2011  Data Conversion Laboratory, Inc. All rights reserved.