SGML Conversion:Issues in Converting SGML to a new DTD
Part
One in a Series on SGML to SGML Conversion...
ABSTRACT: As the Internet evolves and information becomes increasingly valuable, so too does the competition for revenues from that information. As publishers and societies seek new ways to yield revenues (and maintain existing ones), repurposing and converting data to SGML and to XML becomes a critical part of the overall strategy. While SGML/XML is likely the way to go – which SGML DTD? Or which XML Schema; and how do you go about converting SGML data you already have to a different SGML DTD or converting to XML?
One
of the questions that I’m often asked is:
“If I’ve already paid to have my documents converted to SGML,
and now ‘all’ I want to do is convert them to another SGML or XML DTD,
shouldn’t that be easy? And
shouldn’t it be inexpensive?” These
are perfectly logical questions. In
theory, if all documents were converted to SGML ‘properly’, then the
process of converting from one structured markup to another should be as
simple as remapping one set of tags to another.
However, the reality can be quite different. Consider DTD design, for example. When it comes to designing DTDs, there is quite a bit of
latitude available in the level of granularity that the tag structure
can be designed to deal with. There
are many factors that can affect the design of the DTD.
Real world issues, such as cost and the need to republish the
document to paper or a new electronic format can also influence or restrict the
designer. What this translates to,
when
converting from one DTD to another, is that structural
components are often not present on a consistent level, and compromises made in
the design phase, combine to make the conversion considerably more
complex.
I usually
explain to people that one of the basic concepts of structured markup is
that you say what something is, not how it looks.
In practice, however, there are practical limits that
implementers face when trying to reach these goals.
To help illustrate this, let’s take a look at a real-world
issue that we often come across.

Figure
1: A typical journal reference as it exists on the printed
page.
In typical scholarly
journal publishing, each article contains a reference section at the end
pointing the reader to reference sources used in the article.
These references are typically well structured, and contain
information such as the author's name, article title, journal
title, page number, date and place of publication, etc.
In an ideal world, when such references are converted to SGML,
each reference would be completely decomposed to its component pieces,
with all emphasis and punctuation removed
(after all, what we said above is that in SGML, we want to say
what something is, not how it looks).
However, this can add significant cost and complexity to the
conversion process. Because most of us have to function in the real
world (budgets to meet, etc.), some electronic publishers may decide not
to bother decomposing these references at all, while other publishers
may decompose the references but leave in emphasis and punctuation (this
makes the act of republishing the reference easier since the display
engine has less work to do). Yet
others may go ‘all the way’ and produce a completely tagged
reference (this is SGML at its most granular).

Figure
2: A partially decomposed SGML instance of the reference in
Figure 1.

Figure
3: A more fully decomposed SGML instance of the same
reference.
Herein can lie the
problem. Say if someone
already has an SGML representation of a journal, and the references have
NOT been decomposed. Now
they want to license the journal to someone else and the new DTD
requires that the references be fully decomposed. A whole bunch of engineering work needs to be done to accomplish this.
Similarly, if the source SGML representation to be converted is
fully decomposed into a set of markup tags, and the target DTD does not
support this, then the conversion procedure must reinsert into the
target document, all of the punctuation and emphasis that was removed
from the source SGML. Effectively,
this means that the conversion software is required to ‘play’
composition engine. This
can get particularly complex if there are a number of tags involved,
because the conversion software and process must be engineered to
deal with many different combinations of tags that may appear in a
variety of sequences.
Decomposing
references in SGML to SGML conversion is just one of the challenges
involved. While there are
many other similar issues, the above example should help illustrate the
point that while these conversions are doable, they are certainly not
anything close to trivial. Based
on the complexity of the conversion, manual editorial review is often
required following the automated conversion processing. This is because
unexpected input can occasionally produce unacceptable results (e.g.
punctuation or a space in the wrong place).
In the follow up article, we’ll discuss some steps that can be
taken up front to minimize the pain and potentially reduce the costs
involved in doing SGML to SGML/XML conversion.
Michael Gross
Director of Research & Development
Data Conversion Laboratory
Phone: 718-357-8700 x 236
Fax: 718-357-8776
mikegross@dclab.com
Click here to see the
follow-up
"What's The Story?" article on Converting
SGML to SGML
Click here to see the
previous "What's The Story?" article - Converting
Quark to XML.
|