Converting to XML not only gives you the ability to publish documents to the Web, print, CD-ROM, and to handheld devices at the click of a button, it also brings very real cost savings ...
In this white paper we look at the benefits of XML and discover how much it costs to get those benefits. We also look at strategies for increasing benefits while at the same time keeping costs down. Plus we touch on PDF -- often viewed as an XML alternative -- and discuss when it is appropriate to use it instead of XML. Note that much of the information in this document applies to SGML, which is the "parent" of XML.
What is
XML?
XML (eXtensible Markup Language) is a means of
representing text information so that:
- Only standard text (both ASCII and Unicode) is used within a document
- No formatting information is contained in the document. (A Document Type Definition, or DTD, can be set up to allow formatting information to be included in the XML tagging).
- All document elements are clearly identified (for example, <title>Why XML?</title>).
- The document typically conforms to a predefined template or DTD. Strictly speaking you don't have to use a DTD, but it is highly recommended that you do.
- Mechanisms are provided for linking text within a document to information within the same or other documents. The information being linked can be any XML structure including tables, figures, paragraphs, headings, and so on.
(NOTE: Developers invented an XML sub-technology, or "vocabulary," called XLink. XLink is a more powerful way of linking from one item to another than is possible in the standard XML mechanism. It allows you to link to an arbitrary place in a document. In standard linking, like that found in HTML, you can only link to something if you've got an anchor to it. If you've got a complex document this can mean inserting thousands of anchors -- a laborious task. With XLink you to point anywhere you like without anchors).
Key benefits of
XML
The benefits of using XML as a document representation format are
great and apply across all areas of industry. Let's look at what you gain when you adopt
XML:
- Content identification - Perhaps the most important aspect of XML is that text elements are identified, not on the basis of what they look like, but on the basis of what they are -- that is, of their significance in the context of a document. The <title> example above illustrates this, but the concept goes well beyond identifying things like titles, captions, or body text. Depending on need, warning paragraphs can be identified, procedures can be identified in terms of who they are applicable to, and assembly parts can be identified. Tags are user-defined for each document set, so different documents can be tagged in different ways.
- Databasing - An XML tagged document can be viewed as fielded text. The fielding makes it possible to break documents down to their component parts to any degree of granularity for storage in a document management system. The documents can then be re-assembled in different ways, and for different audiences, without the need to track multiple document versions. This is particularly important in cases where different audiences may need to see different versions of a document (in the military, for example, you might have a "top security" clearance version and a "standard" version).
In this way, boilerplate text, such as a standard warning, can be stored once for use in many manuals. When the warning text is changed, it is changed once, not each time it appears. Also, the warning will appear the same way each time it appears, thus avoiding the embarrassment of incorrect text.
- Enforced structure - XML documents are composed in accordance with a DTD, or Schema, which defines the legal tag set for that document type. It also defines valid and invalid relationships between elements (for example, a <header 2> tag might be defined as valid only when it comes after a <header 1> tag). This "enforced structure" ensures that documents have uniformity -- even when coming from diverse sources.
- Merging materials from diverse suppliers - The uniform structure and lack of internal formatting makes it easy to merge documents into seamless document sets -- even if they are coming in from different facilities. An XML compliant document management system can track the individual pieces by contributor, if necessary.
- International Standard - XML is an international standard that is maintained by an independent standards' committee, which means it enjoys widespread support across industry boundaries and gets extensive support from vendors. Being an international standard also means that there are a wide variety of XML editing, document management, validation, and publishing tools available at a range of price and quality levels.
- Industry standardization - Many industries have adopted standardized XML DTDs to allow documents to be easily exchanged across different areas of industry. In fact, developing inter-industry, data exchange standards based on XML is currently the big thing amongst both developers and firms alike (Microsoft's BizTalk is an example). Aside from industries coming up with standard DTDs, many organizations have developed new tag sets to fit their subject field. The newspaper industry, for example, recently came up with its own XML-based markup language, called SportsML, makes it easier for sports writers and editors to format, store, and publish sports information for newspapers, websites, and other media. Plus there's MathML and ChemML for the sciences.
- Platform independent - Because "raw" XML consists only of ASCII and Unicode approved characters (the tags themselves are represented in ASCII), XML data can be moved freely between all hardware and operating system platforms that support these character sets. There are no hardware or operating systems that do not support the ASCII character set and Unicode is now widely supported. The Internet Explorer and Netscape browsers, for example, support it, as do most plain text editors.
- Software independent - As noted, there are a wide variety of XML-compliant tools available from many vendors. Because XML is an independent standard, tool sets can be upgraded or changed without fear of data incompatibility. Furthermore, many of the mainstream and "low-end" tools are becoming XML compliant in response to market demand for support of these formats. Such software includes WordPerfect, FrameMaker+XML and Ventura Publisher, among others. Support for XML is already available to some degree in most of the Office 2000 products. It is supported extensively in Internet Explorer 5 and above, as well as in recent versions of Netscape. What's more, any text editor that supports Unicode can be used to view/edit XML. And the XSL (eXtensible Stylesheet Language) standard will allow you to publish XML material to paper or a website using publicly available software.
- Endurance - Appearance-based text representations are constantly changing -- making conversion costly when migrating from one software package to another or even when upgrading an existing software package. There is also potential for data loss when performing such conversions. XML, however, is a "permanent" representation. Even as the standard evolves, there is no problem upgrading data. If the DTD is carefully selected or designed, a conversion to XML will be the last conversion you'll ever need. In a budget-sensitive environment, this is a very important benefit.
- Repurpose data for different publication media - With XML, formatting is done on a "just in time" basis. As noted, tags identify content, not appearance. Appearance decisions are therefore left until documents are actually published, which means they can easily be modified based on the publication platform. This is a big advantage because what looks good on paper won't look good on screen and vice-versa. XML makes it easy to develop different stylesheets based on the needs of individual publications. The stylesheets map the tags to a set of formatting directives. Thus the same document can easily be published to paper and to the web -- and be customized for each rendition -- simply by customizing stylesheets. When publishing to paper, <title> can be rendered as Times-Roman, 12 point bold. On the web titles might look better in a more web-friendly typeface, like Verdana, in a larger size. They would simply be defined that way in the web stylesheet, without the need to change the document at all. Because XML data is well-fielded it can also be directly adapted into non-traditional publishing outlets such as in IETM's (Interactive Electronic Technical Manuals) or for use with field maintenance reference software. This is of particular importance in military applications.
Costs of XML
- Training - XML is conceptually different from the appearance-based text representations that most people are used to. Thus, document authors and maintainers must be educated about the differences between XML and Microsoft Word, for example. And they need to be made aware of the new requirements implied by XML's benefits.
- All text
must be tagged. This is a requirement most authors aren't
accustomed to.
- No
formatting is applied at authoring time. This violates the habits
of many authors. (Studies show that removing the formatting
requirement from authors can dramatically increase their productivity;
they focus on writing, not on making a small section of text look
"just so".)
- The document structure, as defined by the DTD, must be adhered to. XML authoring tools don't allow the writer to put a <heading 2> in front of a <heading 1>, even if that seems, "OK, just this once."
These are:
Nonetheless, today's XML tools support WYSIWYG interfaces, drag-and-drop technology, and the other functions that non-technical computer users already understand. The cost of additional training is more than offset by the benefits of using XML. Not only that, but the cost of training is declining as the learning curve becomes less and less steep.
- Specialized software - This is more a perceived cost than a real one. Today, XML-compliant software is available at all price levels and many mainstream word processing tools support XML. Plus Internet Explorer and Netscape, and other browsers, support it natively. Naturally the high-end tools offer more features, support larger environments, and provide greater benefits than their low-end counterparts. A cost/benefits analysis needs to be done to determine the best tool set for each particular installation. If you do need a high-end solution, this will give you a clear picture of the additional benefits you are getting for your money.
- Legacy conversions - To get the most out of XML, it is important that an enterprise's entire active document set is moved over to an XML environment. It is not enough to adopt XML on a "from here on in" basis. XML conversions can be complex. But they needn't tie up internal human resources. In fact, because expertise and experience are essential for a successful conversion to XML, it is advisable to outsource the conversion to a specialized XML conversion vendor. This is less expensive in the long run than doing conversion in-house. Using a single, experienced XML conversion vendor guarantees you'll get a quality XML end product. Not only will your documents be technically valid, they will also be meaningfully and consistently tagged. Unless documents are professionally tagged, you won't get the benefits of XML.
PDF: An
alternative to XML?
PDF is a proprietary page
representation format developed by Adobe Systems. It puts documents in a "container" that
preserves not only the text but also the image of the page. PDF can be
generated directly by many traditional word processing packages. It can
also be generated by scanning paper documents.
PDF does not have any of the content tagging capabilities of XML (except for limited linking). And, although widely accepted, PDF is not a recognized independent standard. PDF files are binary; besides text they may contain images of various types, postscript, and other binary information. All this is useful, but means PDF is not as portable as XML.
Furthermore, when PDF is generated from paper, text accuracy is very poor. Although readers may see what appears to be a perfectly usable page, what is actually being displayed is a bitmap image of the page. The text itself, extracted via an OCR process during the PDF conversion, is not directly visible. It is searchable -- but if the accuracy is poor, as is inevitable with uncorrected OCR, the searches will be inaccurate, missing many potentially important "hits" and producing irrelevant hits. Correction is possible, but difficult and expensive -- possibly exceeding the cost of an XML conversion.
PDF files are generally large and unwieldy, especially when the page image is preserved in bitmap form (usually the case when PDF was generated from paper). This means they are difficult to transport over networks or to make available over the web.
Data Conversion Laboratory can and does do PDF conversions where appropriate. We recommend, however, that they be limited to situations where paper is being eliminated for space reasons, and the documents are not frequently accessed, but must be available when required. We recommend XML for "live" data that needs to be frequently accessed, modified, or searched.
|
For further
information on PDF, read: |
Conclusion: Use XML! It's
just better ...
DCL has a wide variety of
experience converting data from many formats into many formats. Our
expertise extends well beyond the domains of XML/XML, so we don't have an
XML axe to grind. But we believe that XML should be the format of choice
for all industries who need to manage their "intellectual capital." And we recommend the use of XML
in these circumstances. Not because it is legally mandated -- though in
many cases it is -- but because it provides the most attractive package of
benefits at justifiable cost. The truth is, we often find ourselves
saying: "Use XML! It's just better."
DCLnews Editorial
Read more
XML articles at DCL
Library
|
FREE Tech
Newsletter! |












