Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations

 

 Part Two in a Series on SGML to SGML ...

 

ABSTRACT:  As the Internet evolves and information becomes increasingly valuable, so too does the competition for revenues from that information.  As publishers and societies seek new ways to yield revenues (and maintain existing ones), repurposing of data for licensing reasons becomes a critical part of the overall strategy.  To that end, SGML/XML is the way to go – but which SGML?  And what about multiple SGMLs?

In our last article, we discussed some of the problems that typically occur in SGML to SGML conversions, making them more complex. In this article, we’d like to discuss some ways you can set up your initial DTD so as to make it easier to port your content to other DTDs later.

1)  Try not to dispose of text in the source material, because you may need to put it back – This is absolutely critical, but we’ve found it to be a recurring issue. As we discussed previously, the level at which a paragraph is decomposed into a series of content tags can vary greatly from DTD to DTD. For instance, if we take the example of bibliographic entries at the end of a scholarly journal article, one DTD may require these entries to be completely decomposed into author, article title, journal title, volume, page range, etc., disposing of the punctuation that exists between components, while another may require just the author’s names and article title be decomposed, leaving the rest of the entry as it appeared in hardcopy. From a structured markup perspective, disposing of the punctuation makes sense (since it contains ‘how it looks’, not ‘what it is’ information). In reality, however, you may need to convert to a less structured output that requires these entries to be fully punctuated. Therefore, if your initial DTD required that this punctuation be removed, you’ll have to write considerably more complicated code in order to convert from the first to the second DTD.

The solution is quite simple, but current publishing / markup practices don’t allow for it. Basically, it makes sense to preserve the punctuation inside of special tagging. This way, it’s there if you need it, and it can be ignored if you don’t. Take another example: you’ve got a cross-reference to a figure. The source file might contain text like “see Fig. 5”, or see “Figure 5”. But some DTDs require the actual ‘Fig.’ or ‘Figure’ text to be removed. Here again, if you can hold this information inside of its own tag, it will be there to put back if you’re converting to a DTD that requires this information to be there.

2)  Avoid nontraditional approaches – Because you may eventually need to go to another DTD that does not yield as much flexibility as the one you’re currently designing, try to avoid using approaches that may be difficult to implement in other applications.

Here are two simple examples, related to special characters. The first is an approach that we’ve seen used to apply tagging around a character in order to produce a diacritical above it, such as <acute>A</acute>, instead of using standard ISO characters where possible, such as &Aacute;. Although this tagging approach does allow you to produce more types of diacritical characters, it is likely that a target DTD will not allow an equivalent, making the conversion task much more difficult. Another example is one in which attributes are being used to store actual document text. For example: <chapter title=”Introduction”> .In these situations, special emphasis of formatting is difficult, and we’ve seen DTDs which overcome this by ‘inventing’ emphasis character entities, such as

<chapter title=”Purifying H&sub;2&esub;O”>

Here, the &sub and &esub; are making use of character entities in a nonstandard way, again adding complexity to conversion to other DTDs.

In order to yield maximum utility from your data, these types of workaround approaches are best avoided.

3)  Try to anticipate how other DTDs may require tagging - If possible, examine other approaches or try to anticipate how other DTDs may require tagging. This will help make the later conversion much easier.

For example, some DTDs use empty cross-reference tagging in SGML (id/idref usage), such as see <xref id=”S35”>Rules and Regulations.

In this case, the tagging will cause the display of a point that one can click on to go to a cross-reference link. Here, the entire text around the hyperlink will not be highlighted (which may satisfy the requirements of the source DTD). However, If you examine other DTDs or consider other approaches, you may be able to anticipate other cross-reference linking approaches that may require tagging around the whole piece of text to be cross-ref’d (ie: See <xref id=”S35”>Rules and Regulations</xref>.

Again, this scenario means that the source SGML tagging uses the first approach (an empty <xref> tag), leaving the conversion software ‘guessing’ how far the <xref> text should cover. This guessing will mean that significant effort will have to go into the coding to optimize the guessing process. We know from experience that some percentage of the time, the heuristic used for the guessing will be wrong. Therefore, we’ll probably have to do considerable manual review of the converted tagging. Clearly, it’s better to plan for this up front, if possible. Doing extra tagging in advance will probably require significantly less effort than doing it manually later.

While each SGML to SGML conversion will have its unique set of obstacles to overcome, considering some of these guidelines will help smooth the conversion effort. Sometimes you may not have that luxury. Think of a situation where the source DTD has already been built, and you don’t have the flexibility to change it to help you deal with potential future DTDs. In such a case, you may consider first going to a ‘super-DTD’ that allows for the features you may need later, to convert to other DTDs. Then, it’s pretty easy to ‘dummy’ the SGML down to all of the other DTDs. This approach can be extremely helpful in allowing you to produce final SGML outputs that are less prone to the errors or complications typical when trying to automate SGML to SGML conversions.

Michael Gross
Director of Research & Development
Data Conversion Laboratory
Phone: 718-357-8700 x 236
Fax: 718-357-8776
mikegross@dclab.com

 


Click here to see the preceding "What's The Story?" article on Converting SGML to SGML

Click here to see Mike's "What's The Story?" article -
Converting Quark to XML.

  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
DCL Calendar

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

Best Practices Santa Fe, NM, September 15-17, 2008. More…
XyUser Phoenix, AZ, September 22-24, 2008. More…
9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe. Details TBA.

 
DCL Calendar

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

 
Recent News

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.