|
Part
Two in a Series on SGML to SGML ...
ABSTRACT: As
the Internet evolves and information becomes increasingly valuable, so
too does the competition for revenues from that information.
As publishers and societies seek new ways to yield revenues (and
maintain existing ones), repurposing of data for licensing reasons becomes
a critical part of the overall strategy.
To that end, SGML/XML is the way to go – but which SGML?
And what about multiple SGMLs?
In our last article, we discussed some of the problems
that typically occur in SGML to SGML conversions, making them more
complex. In this article, we’d like to discuss some ways you can set
up your initial DTD so as to make it easier to port your content to
other DTDs later.
1) Try not to dispose of text in the source
material, because you may need to put it back – This is absolutely
critical, but we’ve found it to be a recurring issue. As we discussed
previously, the level at which a paragraph is decomposed into a series
of content tags can vary greatly from DTD to DTD. For instance, if we
take the example of bibliographic entries at the end of a scholarly
journal article, one DTD may require these entries to be completely
decomposed into author, article title, journal title, volume, page
range, etc., disposing of the punctuation that exists between
components, while another may require just the author’s names and
article title be decomposed, leaving the rest of the entry as it
appeared in hardcopy. From a structured markup perspective, disposing of
the punctuation makes sense (since it contains ‘how it looks’, not
‘what it is’ information). In reality, however, you may need to
convert to a less structured output that requires these entries to be
fully punctuated. Therefore, if your initial DTD required that this
punctuation be removed, you’ll have to write considerably more
complicated code in order to convert from the first to the second DTD.
The solution is quite simple, but current publishing /
markup practices don’t allow for it. Basically, it makes sense to
preserve the punctuation inside of special tagging. This way, it’s
there if you need it, and it can be ignored if you don’t. Take another
example: you’ve got a cross-reference to a figure. The source file
might contain text like “see Fig. 5”, or see “Figure 5”. But
some DTDs require the actual ‘Fig.’ or ‘Figure’ text to be
removed. Here again, if you can hold this information inside of its own
tag, it will be there to put back if you’re converting to a DTD that
requires this information to be there.
2) Avoid nontraditional approaches – Because
you may eventually need to go to another DTD that does not yield as much
flexibility as the one you’re currently designing, try to avoid using
approaches that may be difficult to implement in other applications.
Here are two simple examples, related to special
characters. The first is an approach that we’ve seen used to apply
tagging around a character in order to produce a diacritical above it,
such as <acute>A</acute>, instead of using standard ISO
characters where possible, such as Á. Although this tagging
approach does allow you to produce more types of diacritical characters,
it is likely that a target DTD will not allow an equivalent, making the
conversion task much more difficult. Another example is one in which
attributes are being used to store actual document text. For example:
<chapter title=”Introduction”> .In these situations, special
emphasis of formatting is difficult, and we’ve seen DTDs which
overcome this by ‘inventing’ emphasis character entities, such as
<chapter title=”Purifying H⊂2&esub;O”>
Here, the &sub and &esub; are making use of
character entities in a nonstandard way, again adding complexity to
conversion to other DTDs.
In order to yield maximum utility from your data,
these types of workaround approaches are best avoided.
3) Try to anticipate how other DTDs may require
tagging - If possible, examine other approaches or try to anticipate how
other DTDs may require tagging. This will help make the later conversion
much easier.
For example, some DTDs use empty cross-reference
tagging in SGML (id/idref usage), such as see <xref id=”S35”>Rules
and Regulations.
In this case, the tagging will cause the display of a
point that one can click on to go to a cross-reference link. Here, the
entire text around the hyperlink will not be highlighted (which may
satisfy the requirements of the source DTD). However, If you examine
other DTDs or consider other approaches, you may be able to anticipate
other cross-reference linking approaches that may require tagging around
the whole piece of text to be cross-ref’d (ie: See <xref id=”S35”>Rules
and Regulations</xref>.
Again, this scenario means that the source SGML
tagging uses the first approach (an empty <xref> tag), leaving the
conversion software ‘guessing’ how far the <xref> text should
cover. This guessing will mean that significant effort will have to go
into the coding to optimize the guessing process. We know from
experience that some percentage of the time, the heuristic used for the
guessing will be wrong. Therefore, we’ll probably have to do
considerable manual review of the converted tagging. Clearly, it’s
better to plan for this up front, if possible. Doing extra tagging in
advance will probably require significantly less effort than doing it
manually later.
While each SGML to SGML conversion will have its
unique set of obstacles to overcome, considering some of these
guidelines will help smooth the conversion effort. Sometimes you may not
have that luxury. Think of a situation where the source DTD has already
been built, and you don’t have the flexibility to change it to help
you deal with potential future DTDs. In such a case, you may consider
first going to a ‘super-DTD’ that allows for the features you may
need later, to convert to other DTDs. Then, it’s pretty easy to ‘dummy’
the SGML down to all of the other DTDs. This approach can be extremely
helpful in allowing you to produce final SGML outputs that are less
prone to the errors or complications typical when trying to automate
SGML to SGML conversions.
Michael Gross
Director of Research & Development
Data Conversion Laboratory
Phone: 718-357-8700 x 236
Fax: 718-357-8776
mikegross@dclab.com
Click here to see the
preceding "What's The Story?" article on Converting
SGML to SGML
Click
here to see Mike's "What's The Story?" article - Converting
Quark to XML.
|