Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations
Converting the World's Knowledge

Getting an Encyclopedia On the Web


ABSTRACT: Getting an encyclopedia on-line is a monumental task. All that information has to be converted to a structured format like SGML. DCL specializes in such large complex conversions. The key is a well-defined process that includes careful design and plenty of customer feedback. From detailed conversion specifications to a review pass with pre-composition software, converting the world's knowledge becomes a possibility. This article details the steps in that process through the use of a composite case study.

Putting together an encyclopedia is a monumental task: articles are collected from experts in hundreds of different fields, which then have to be laid out in a consistent format for volume after volume, and then there's the general index, filled with references to this gigantic opus. It takes years to develop an encyclopedia, and the process of updating, refining and expanding never ends.

So how can something of this magnitude get on-line? It's not easy, but clearly it's happening. Not just encyclopedias, but dictionaries, almanacs and most other kinds of reference books, ranging in size from bulky to colossal, are available over the Internet or CD-ROM.

It's not hard to understand why someone in the Board Room would demand on-line access to their encyclopedia, but what does Joe Vice President do when he gets assigned the task?

Choosing SGML

After a good deal of research and asking around, Joe chooses SGML (Standard Generalized Markup Language) from the many options available for electronic publishing. SGML is designed for structured data, a necessary element of electronic publishing. After all, this structure enables multiple uses for your data (CD-ROM, Web pages, FTP, e-mail), hyperlinking and sophisticated search & retrieval. SGML achieves this structure with a Document Type Definition (DTD), which is tailored to specific kinds of documents (encyclopedias vs. dictionaries vs. repair manuals vs. Insurance policies, etc., etc., etc.). Perhaps most importantly, SGML is a standard that is easily converted to most other formats (like HTML for the Web).

Now that Joe has made this decision, the problem becomes getting the encyclopedia into SGML. Conversion to a highly structured format is difficult because it means adding structure to the data. Initially, Joe thinks his SGML team (be they in-house experts or consultants) will convert the data, but he soon realized that the scope of such an effort would require a team of workers dedicated to the conversion alone. Such a team would drain his much-needed SGML people, require a significant learning curve and collect extra manpower that would become obsolete the moment the conversion was over.

Bring in the Conversion Pros

Joe decided to hire a conversion specialist. As the following example will demonstrate, document conversion has its own challenges that go well outside the scope of SGML knowledge alone. (By the way, Joe's conversion is really a composite case study of the work Data Conversion Laboratory has done for various Encyclopedia publishers.)

Design & Setup

To ease the conversion, DCL has established a well-defined process for planning and designing the project with customer feedback. First, Joe sends DCL the encyclopedia itself, along with typesetting tapes. Fortunately, since much of the typesetting was already done electronically, most of the material will not have to be keyed in.

Our data analysts carefully study this material and compare it to previous encyclopedia projects. They also study a "mark-up" that Joe sent. This is comprised of some photocopied encyclopedia pages with tag names written in the margins, so we know how the customer wants the data tagged. You might think this can be determined from the DTD, but there are always ambiguities. To further assure customer agreement, DCL prepares a small sample, called a "proof of concept" sample. Once Joe approves this sample, the data analyst writes up conversion specifications.

These specifications form the road map of the conversion. They list all of the relevant structures in the encyclopedia (articles, bylines, reference lists, headings, lifetimes for the biographical entries, etc.), how they can be recognized in the source data and how they will be tagged in the SGML. Trying to do an SGML conversion (or any complex conversion) without conversion specifications is like building a house without a blueprint, yet many conversions are attempted without such a document.

These specifications are approved by Joe. DCL has found customer feedback essential for a successful conversion. One of the realities of such a complex project is that one cannot completely visualize it at the start. The customer must be given the opportunity to adjust their vision throughout the project, until the newly implemented system finally takes shape. Conversion is therefore best conceived of as a collaborative effort between the customer and the vendor.

A larger sample is prepared, which is called the production sample, because it emulates the full production process. DCL's software is configured to tag the data in the same way as this sample, so that there are no surprises.

Production

This is the phase that we did not rush into, but the wait is more than made up for by our careful preparation, which minimizes the amount of manual work and rework. Weekly deliveries are sent to Joe, which he can plug into his CD software and make sure the data works. When it doesn't work, adjustments can be made. For instance, let's say the chemistry doesn't look right. It's tagged properly, but his CD application can't handle it in this format (it couldn't be tested earlier because the chemistry viewer wasn't ready until after production began). Because this discrepancy was caught as early as possible, Joe has options: DCL can change the conversion process to retag formulae, or Joe can have the chemistry viewer modified.

The automated process more than makes up for the effort spent on setup. The bulk of the work is done through software, which not only saves up-front labor costs, but also reduces efforts to check and correct the converted material.

Format Review

Quality control should be part of any production process. DCL's primary quality check is called a "format review," because the SGML is loaded into precomposition software, which formats the documents to visually demonstrate how they are tagged. This sort of specialized composition is more effective for review than a publishing-oriented full composition package.

DCL's format review phase is unique in the industry, but we feel it's critical to the quality of the finished product. Other vendors promise parseable SGML, but we found that parsing isn't enough. If a magazine editor sent all of his articles to a copy editor for spell checking and grammar checking, but never looked at them himself, he would soon be fired. We feel that a conversion service should have the same responsibility to make sure the data is correct, not just parseable.

Final Review

As mentioned above, conversion is not simply a matter of dropping off your old data and picking up finished documents. Conversion is a team effort. Even after the converted material is received, there is additional work for the client. As a publisher, Joe is very demanding about his documentation. A final, thorough review is done by his own staff. Because of our feedback process and quality control, his editors are able to focus on high-level subject-matter issues (e.g., is this link to John I connected to the right John I?).

Without a format review, customer cleanup can be a long and costly process. And unless you provide feedback throughout the process, you may find that your data is perfectly valid, but does not meet your needs. The subjective nature of SGML makes this mishap a likely, if not inevitable, occurrence if nothing is done to avoid it.

Moving Mountains

Joe's encyclopedia is not only on-line, it's on budget and on schedule. By breaking the project down into manageable stages, doing plenty of preparation up front, and working together, he and DCL were able to successfully move a mountain of data complete with multiple indexing, customized searches and multimedia hyperlinks.

And Joe can hardly wait for the next Board meeting.

Want more information on this topic? Click here!

  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
DCL Calendar

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

Best Practices Santa Fe, NM, September 15-17, 2008. More…
XyUser Phoenix, AZ, September 22-24, 2008. More…
9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe. Details TBA.

 
DCL Calendar

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

 
Recent News

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.