DCL  
     Refer a friend Send this Page to a Friend
     Print friendly version Printer-Friendly Format

    Resource Center

    Fact Sheets

    White Papers

Converting the World's Knowledge

Getting an Encyclopedia On the Web


ABSTRACT: Getting an encyclopedia on-line is a monumental task. All that information has to be converted to a structured format like SGML. DCL specializes in such large complex conversions. The key is a well-defined process that includes careful design and plenty of customer feedback. From detailed conversion specifications to a review pass with pre-composition software, converting the world's knowledge becomes a possibility. This article details the steps in that process through the use of a composite case study.

Putting together an encyclopedia is a monumental task: articles are collected from experts in hundreds of different fields, which then have to be laid out in a consistent format for volume after volume, and then there's the general index, filled with references to this gigantic opus. It takes years to develop an encyclopedia, and the process of updating, refining and expanding never ends.

So how can something of this magnitude get on-line? It's not easy, but clearly it's happening. Not just encyclopedias, but dictionaries, almanacs and most other kinds of reference books, ranging in size from bulky to colossal, are available over the Internet or CD-ROM.

It's not hard to understand why someone in the Board Room would demand on-line access to their encyclopedia, but what does Joe Vice President do when he gets assigned the task?

Choosing SGML

After a good deal of research and asking around, Joe chooses SGML (Standard Generalized Markup Language) from the many options available for electronic publishing. SGML is designed for structured data, a necessary element of electronic publishing. After all, this structure enables multiple uses for your data (CD-ROM, Web pages, FTP, e-mail), hyperlinking and sophisticated search & retrieval. SGML achieves this structure with a Document Type Definition (DTD), which is tailored to specific kinds of documents (encyclopedias vs. dictionaries vs. repair manuals vs. Insurance policies, etc., etc., etc.). Perhaps most importantly, SGML is a standard that is easily converted to most other formats (like HTML for the Web).

Now that Joe has made this decision, the problem becomes getting the encyclopedia into SGML. Conversion to a highly structured format is difficult because it means adding structure to the data. Initially, Joe thinks his SGML team (be they in-house experts or consultants) will convert the data, but he soon realized that the scope of such an effort would require a team of workers dedicated to the conversion alone. Such a team would drain his much-needed SGML people, require a significant learning curve and collect extra manpower that would become obsolete the moment the conversion was over.

Bring in the Conversion Pros

Joe decided to hire a conversion specialist. As the following example will demonstrate, document conversion has its own challenges that go well outside the scope of SGML knowledge alone. (By the way, Joe's conversion is really a composite case study of the work Data Conversion Laboratory has done for various Encyclopedia publishers.)

Design & Setup

To ease the conversion, DCL has established a well-defined process for planning and designing the project with customer feedback. First, Joe sends DCL the encyclopedia itself, along with typesetting tapes. Fortunately, since much of the typesetting was already done electronically, most of the material will not have to be keyed in.

Our data analysts carefully study this material and compare it to previous encyclopedia projects. They also study a "mark-up" that Joe sent. This is comprised of some photocopied encyclopedia pages with tag names written in the margins, so we know how the customer wants the data tagged. You might think this can be determined from the DTD, but there are always ambiguities. To further assure customer agreement, DCL prepares a small sample, called a "proof of concept" sample. Once Joe approves this sample, the data analyst writes up conversion specifications.

These specifications form the road map of the conversion. They list all of the relevant structures in the encyclopedia (articles, bylines, reference lists, headings, lifetimes for the biographical entries, etc.), how they can be recognized in the source data and how they will be tagged in the SGML. Trying to do an SGML conversion (or any complex conversion) without conversion specifications is like building a house without a blueprint, yet many conversions are attempted without such a document.

These specifications are approved by Joe. DCL has found customer feedback essential for a successful conversion. One of the realities of such a complex project is that one cannot completely visualize it at the start. The customer must be given the opportunity to adjust their vision throughout the project, until the newly implemented system finally takes shape. Conversion is therefore best conceived of as a collaborative effort between the customer and the vendor.

A larger sample is prepared, which is called the production sample, because it emulates the full production process. DCL's software is configured to tag the data in the same way as this sample, so that there are no surprises.

Production

This is the phase that we did not rush into, but the wait is more than made up for by our careful preparation, which minimizes the amount of manual work and rework. Weekly deliveries are sent to Joe, which he can plug into his CD software and make sure the data works. When it doesn't work, adjustments can be made. For instance, let's say the chemistry doesn't look right. It's tagged properly, but his CD application can't handle it in this format (it couldn't be tested earlier because the chemistry viewer wasn't ready until after production began). Because this discrepancy was caught as early as possible, Joe has options: DCL can change the conversion process to retag formulae, or Joe can have the chemistry viewer modified.

The automated process more than makes up for the effort spent on setup. The bulk of the work is done through software, which not only saves up-front labor costs, but also reduces efforts to check and correct the converted material.

Format Review

Quality control should be part of any production process. DCL's primary quality check is called a "format review," because the SGML is loaded into precomposition software, which formats the documents to visually demonstrate how they are tagged. This sort of specialized composition is more effective for review than a publishing-oriented full composition package.

DCL's format review phase is unique in the industry, but we feel it's critical to the quality of the finished product. Other vendors promise parseable SGML, but we found that parsing isn't enough. If a magazine editor sent all of his articles to a copy editor for spell checking and grammar checking, but never looked at them himself, he would soon be fired. We feel that a conversion service should have the same responsibility to make sure the data is correct, not just parseable.

Final Review

As mentioned above, conversion is not simply a matter of dropping off your old data and picking up finished documents. Conversion is a team effort. Even after the converted material is received, there is additional work for the client. As a publisher, Joe is very demanding about his documentation. A final, thorough review is done by his own staff. Because of our feedback process and quality control, his editors are able to focus on high-level subject-matter issues (e.g., is this link to John I connected to the right John I?).

Without a format review, customer cleanup can be a long and costly process. And unless you provide feedback throughout the process, you may find that your data is perfectly valid, but does not meet your needs. The subjective nature of SGML makes this mishap a likely, if not inevitable, occurrence if nothing is done to avoid it.

Moving Mountains

Joe's encyclopedia is not only on-line, it's on budget and on schedule. By breaking the project down into manageable stages, doing plenty of preparation up front, and working together, he and DCL were able to successfully move a mountain of data complete with multiple indexing, customized searches and multimedia hyperlinks.

And Joe can hardly wait for the next Board meeting.

Want more information on this topic? Click here!

 
representational space
    Popular Links

    Events

    Recent Events

representational space
representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365, P: 718-357-8700
Data Conversion Lab
Copyright © 1997-2009  Data Conversion Laboratory, Inc. All rights reserved.