|
||||
| DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area | ||||
|
What's the Story Converting SGML to SGML?Part One in a Series on SGML to SGML Conversion...
One of the questions that I’m often asked is: “If I’ve already paid to have my documents converted to SGML, and now ‘all’ I want to do is convert them to another SGML or XML DTD, shouldn’t that be easy? And shouldn’t it be inexpensive?” These are perfectly logical questions. In theory, if all documents were converted to SGML ‘properly’, then the process of converting from one structured markup to another should be as simple as remapping one set of tags to another. However, the reality can be quite different. Consider DTD design, for example. When it comes to designing DTDs, there is quite a bit of latitude available in the level of granularity that the tag structure can be designed to deal with. There are many factors that can affect the design of the DTD. Real world issues, such as cost and the need to republish the document to paper or a new electronic format can also influence or restrict the designer. What this translates to, when converting from one DTD to another, is that structural components are often not present on a consistent level, and compromises made in the design phase, combine to make the conversion considerably more complex. I usually explain to people that one of the basic concepts of structured markup is that you say what something is, not how it looks. In practice, however, there are practical limits that implementers face when trying to reach these goals. To help illustrate this, let’s take a look at a real-world issue that we often come across.
Figure 1: A typical journal reference as it exists on the printed page. In typical scholarly journal publishing, each article contains a reference section at the end pointing the reader to reference sources used in the article. These references are typically well structured, and contain information such as the author's name, article title, journal title, page number, date and place of publication, etc. In an ideal world, when such references are converted to SGML, each reference would be completely decomposed to its component pieces, with all emphasis and punctuation removed (after all, what we said above is that in SGML, we want to say what something is, not how it looks). However, this can add significant cost and complexity to the conversion process. Because most of us have to function in the real world (budgets to meet, etc.), some electronic publishers may decide not to bother decomposing these references at all, while other publishers may decompose the references but leave in emphasis and punctuation (this makes the act of republishing the reference easier since the display engine has less work to do). Yet others may go ‘all the way’ and produce a completely tagged reference (this is SGML at its most granular).
Figure 2: A partially decomposed SGML instance of the reference in Figure 1.
Figure 3: A more fully decomposed SGML instance of the same reference. Herein can lie the problem. Say if someone already has an SGML representation of a journal, and the references have NOT been decomposed. Now they want to license the journal to someone else and the new DTD requires that the references be fully decomposed. A whole bunch of engineering work needs to be done to accomplish this. Similarly, if the source SGML representation to be converted is fully decomposed into a set of markup tags, and the target DTD does not support this, then the conversion procedure must reinsert into the target document, all of the punctuation and emphasis that was removed from the source SGML. Effectively, this means that the conversion software is required to ‘play’ composition engine. This can get particularly complex if there are a number of tags involved, because the conversion software and process must be engineered to deal with many different combinations of tags that may appear in a variety of sequences. Decomposing references in SGML to SGML conversion is just one of the challenges involved. While there are many other similar issues, the above example should help illustrate the point that while these conversions are doable, they are certainly not anything close to trivial. Based on the complexity of the conversion, manual editorial review is often required following the automated conversion processing. This is because unexpected input can occasionally produce unacceptable results (e.g. punctuation or a space in the wrong place). In the follow up article, we’ll discuss some steps that can be taken up front to minimize the pain and potentially reduce the costs involved in doing SGML to SGML/XML conversion. Michael Gross
Click here to see the follow-up "What's The Story?" article on Converting SGML to SGML Click here to see the previous "What's The Story?" article - Converting Quark to XML.
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Data Conversion Laboratory, Inc. 61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365 718-357-8700 convert@dclab.com Copyright © 1997-2008 Data Conversion Laboratory, Inc. All rights reserved. |