Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations

What's the Story Converting SGML to SGML?

Part One in a Series on SGML to SGML Conversion...

ABSTRACT: As the Internet evolves and information becomes increasingly valuable, so too does the competition for revenues from that information. As publishers and societies seek new ways to yield revenues (and maintain existing ones), repurposing and converting data to SGML and to XML becomes a critical part of the overall strategy. While SGML/XML is likely the way to go – which SGML DTD? Or which XML Schema; and how do you go about converting SGML data you already have to a different SGML DTD or converting to XML?

One of the questions that I’m often asked is:  “If I’ve already paid to have my documents converted to SGML, and now ‘all’ I want to do is convert them to another SGML or XML DTD, shouldn’t that be easy?  And shouldn’t it be inexpensive?”  These are perfectly logical questions.  In theory, if all documents were converted to SGML ‘properly’, then the process of converting from one structured markup to another should be as simple as remapping one set of tags to another.  However, the reality can be quite different.  Consider DTD design, for example.  When it comes to designing DTDs, there is quite a bit of latitude available in the level of granularity that the tag structure can be designed to deal with.  There are many factors that can affect the design of the DTD.  Real world issues, such as cost and the need to republish the document to paper or a new electronic format can also influence or restrict the designer.  What this translates to, when converting from one DTD to another, is that structural components are often not present on a consistent level, and compromises made in the design phase, combine to make the conversion considerably more complex.

I usually explain to people that one of the basic concepts of structured markup is that you say what something is, not how it looks.  In practice, however, there are practical limits that implementers face when trying to reach these goals.  To help illustrate this, let’s take a look at a real-world issue that we often come across. 

Figure 1:  A typical journal reference as it exists on the printed page.

In typical scholarly journal publishing, each article contains a reference section at the end pointing the reader to reference sources used in the article.   These references are typically well structured, and contain information such as the author's name, article title, journal title, page number, date and place of publication, etc.  In an ideal world, when such references are converted to SGML, each reference would be completely decomposed to its component pieces, with all emphasis and punctuation removed  (after all, what we said above is that in SGML, we want to say what something is, not how it looks).  However, this can add significant cost and complexity to the conversion process. Because most of us have to function in the real world (budgets to meet, etc.), some electronic publishers may decide not to bother decomposing these references at all, while other publishers may decompose the references but leave in emphasis and punctuation (this makes the act of republishing the reference easier since the display engine has less work to do).  Yet others may go ‘all the way’ and produce a completely tagged reference (this is SGML at its most granular).

Figure 2:  A partially decomposed SGML instance of the reference in Figure 1.

Figure 3:  A more fully decomposed SGML instance of the same reference.

Herein can lie the problem.  Say if someone already has an SGML representation of a journal, and the references have NOT been decomposed.  Now they want to license the journal to someone else and the new DTD requires that the references be fully decomposed.  A whole bunch of engineering work needs to be done to accomplish this.  Similarly, if the source SGML representation to be converted is fully decomposed into a set of markup tags, and the target DTD does not support this, then the conversion procedure must reinsert into the target document, all of the punctuation and emphasis that was removed from the source SGML.  Effectively, this means that the conversion software is required to ‘play’ composition engine.  This can get particularly complex if there are a number of tags involved, because the conversion software and process must be engineered to deal with many different combinations of tags that may appear in a variety of sequences.

Decomposing references in SGML to SGML conversion is just one of the challenges involved.  While there are many other similar issues, the above example should help illustrate the point that while these conversions are doable, they are certainly not anything close to trivial.  Based on the complexity of the conversion, manual editorial review is often required following the automated conversion processing.  This is because unexpected input can occasionally produce unacceptable results (e.g. punctuation or a space in the wrong place).  In the follow up article, we’ll discuss some steps that can be taken up front to minimize the pain and potentially reduce the costs involved in doing SGML to SGML/XML conversion.

Michael Gross
Director of Research & Development
Data Conversion Laboratory
Phone: 718-357-8700 x 236
Fax: 718-357-8776
mikegross@dclab.com

 


Click here to see the follow-up "What's The Story?" article on Converting SGML to SGML

Click here to see the previous "What's The Story?" article - Converting Quark to XML.

  

  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
DCL Calendar

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

Best Practices Santa Fe, NM, September 15-17, 2008. More…
XyUser Phoenix, AZ, September 22-24, 2008. More…
9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe. Details TBA.

 
DCL Calendar

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

 
Recent News

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.