DCL  
Refer a friend Email this Page
Print friendly version Print-Friendly
Request Information Request Information
Subscribe  Subscribe

    Resource Center

    Fact Sheets

    White Papers

Why Publishers Should
Use XML...

Every moment publishers put off embracing XML, they're missing a powerful opportunity to reduce composition costs and increase revenues, says David Skurnik (pictured), VP of Sales at Data Conversion Laboratory .

[Thanks to Lori Barber, Business Development Manager and Beth Friedman, Senior Project Manager for their input.]

David Skurnik, VP of Sales at Data Conversion LaboratoryIt's 3pm. You've just had a meeting with your Executive Editor. He told you to cut the cost of production and reduce the time to market of your books and journals. You now go to another meeting. This time with the VP of Marketing. She tells you that, in an effort to increase revenue from existing customers, they are planning to develop dozens of separately purchasable, value-added features, to the online products. What's more, to attract new buyers, they are targeting smaller markets with virtual niche journals built from existing journal content. They are also building up extra distribution channels through content aggregators, and abstracting and indexing services.

The upshot is you will have to produce a lot more products ... and they will have to be easily adaptable for use in multiple vendors' systems.

You then make a beeline to your Executive Editor's office and tell him you can't decrease overall production costs because you now have to support more platforms and produce more products. The Editor smiles sagely at you and wishes you good luck in realizing organizational goals.

Sound Familiar?


"Data needs to be in smart formats like SGML and XML. With intelligent data, technology can be exploited to accomplish your goals."

In today's high tech and ever demanding publishing climate, many production managers yearn for the "good old days" when all they had to produce was paper. But, as we quickly learned, those days are only a pleasant memory. In order to keep the Executive Editor and VP of Marketing happy, we have to embrace technology and re-engineer our production process to exploit new technology.

Before we discuss technology, let's look at what the perfect production environment would it be.  From my conversations with numerous publishing professionals I came up with the following list:

  1. "I want to automate the production process as much as possible so I can shorten the time to publication and reduce costs."
  2. "I want to be able to seamlessly produce print and multiple types of electronic products from a seamless production process."
  3. "I want to make content available for public consumption at a very early stage in the production process."
  4. "I want to retain a single instance of all my content in a repository, so I can reduce the amount of time and expense needed to maintain existing products and produce new products."
  5. "I want the capability to enhance my product by increasing the end user experience without painful new development".
  6. "I want Print on Demand capability."
  7. "I want to protect my content."

Are these goals attainable?

Yes. Technology has made these goals fully attainable. But before the mansion can be built you have to dig the foundation. In the case of Publishing, the foundation is the data.  That is why data needs to be in "smart formats" like SGML and XML. With intelligent data, technology can be exploited to accomplish your goals.

Let's examine the tasks you will ask of your data:

  1. Repurposing - You will need to create new versions of data suitable for derivative uses (e.g. the web, diagnostic equipment, hand-held devices, voice devices).
  2. Searching - You will need the ability to find information through text searches and through more advanced searches that depend on context and "understanding."
  3. Component Reuse - You will need the ability to reuse portions of data for different products and different documentation sets.
  4. Interchange with Vendors - You will need to provide your data in a form that your vendors and distributors can easily use.
  5. Enforce Data Standards - The data will have to be consistent to reflect your branding and be predictable in the sense that people will know where to find the information they are seeking. This will only be accomplished if the way the information is presented is standard throughout your products.
  6. Security - Make it easier for security solutions to protect the data.

Not surprisingly, the benefits of SGML and XML are exactly those mentioned above.

What is SGML and XML?

Definition and History of SGML, HTML and XML

SGML (Standard Generalized Markup Language) was developed in the 1980s as a non-proprietary, platform independent, method of describing the structure of a document rather than its appearance.

What does it mean to tag a document based on structure? 

In a bibliographic reference, for example, there is a title, author(s) details, publisher, year of publication, and perhaps page citations. A reader can distinguish between the different types of information from both experience and the appearance of the data (the title may be in italics, for example). But a reference containing SGML or XML mark-up will have tags representing the many structures within the reference.

A reference  could be tagged as simply as this:

<citation citation_type="journal" id="A123">Abdel Malek, Z, Swope, VB, Pallas, J, Krug, K &amp; Nordlund, JJ. <it>Mitogenic, melanogenic and cAMP responses of cultured neonatal human melanocytes to commonly used mitogens</it>. J Cell Physiol, 150, 416&ndash;425, 1992.</citation>

The more the data is tagged, the more "intelligence" is added to the document. This intelligence makes it easier to "slice and dice" the data for creating new products and to transform the data for other systems. The following example shows every reference element tagged separately.

<jnlref><au><snm>Abdel Malek</snm>, <fnms>Z</fnms></au>, <au><snm>Swope</snm>, <fnms>VB</fnms></au>, <au><snm>Pallas</snm>, <fnms>J</fnms></au>, <au><snm>Krug</snm>, <fnms>K</fnms></au> &amp; <au><snm>Nordlund</snm>, <fnms>JJ.</fnms></au> <tl>Mitogenic, melanogenic and cAMP responses of cultured neonatal human melanocytes to commonly used mitogens</tl>. <pubtl>J Cell Physiol</pubtl>, <vid>150</vid>, <ppf>416</ppf>&ndash; <ppl>425</ppl>, <cd year="1992">1992</cd></jnlref>.

Prior to applying tags to a document, you have to define some basic rules that determine:

  1. What structures within the document are to be tagged. In a bibliographic reference, you would need to determine whether you will be separately tagging every author's name or whether you will only use one tag for the entire group of authors. The decision will be based on what type of intelligence you will want to extract from the data.
  2. What the tag names will be called. Using our previous example, you may wish to use "authors" as the tag to describe the author's name or use "au".
  3. The order of when and where these structures can be found within the document. In our reference example, you would want to ensure that the authors' names always precede the year of publication.

Document Type Definition

These rules comprise a document called a Document Type Definition (DTD). Before any conversion, the DTD has to be developed to give guidance on the basic rules of the conversion.

In the early days, the biggest issues against implementing an SGML solution were that it was complex and that there were not many tools on the market to support it.

In the infancy of the Internet, a universal DTD for tagging documents designed to be viewed on the Internet was developed. This DTD came to be known as Hypertext Markup Language (HTML). Since HTML was focused on presentation and not on structure, the HTML tag set was very limited, and was therefore much easier to implement.

But its advantage of being simple was its biggest drawback since HTML's ability to perform complex searching, linking and document maintenance was very limited.

The challenge was to find a way of marking up documents that was not as complex as SGML but was more powerful than HTML. The solution was XML. XML is an acronym for eXtensible Markup Language and is a derivative of SGML. Since its introduction on to the market, many corporations and organizations like IBM, Microsoft and General Electric have been converting their documentation to XML. And XML has become the de-facto standard for data transfer.

Let's discuss how a combination of good data and technology can help you achieve your publishing goals. We will start at the beginning of the process and move forward to the remainder of the process.

In many instances, collaborative authoring may be involved in your production process. There are Web based systems that allow manuscripts to be authored in Microsoft Word. Edits can either be made directly into the document or the main author can receive an external document containing the edits. If there is only one author involved, the manuscript is usually created in Word; but a manuscript full of equations might be created in Tex or Latex.

Author Templates

In either case, the idea of developing Author Templates is important since it would produce a consistently structured and well-styled manuscript.

The main challenge is how to make sure the author actually uses the template. Depending on the situation, you will have to decide between using a "carrot or a stick" or a combination of both. In any event, even if you were only mildly successful in getting author compliance, you would have achieved a greater degree of consistency in your author manuscripts. This will help you reduce the time and cost of converting your manuscripts into your target format.

Pre-Composition Conversion

Once you have the first cut of your "template driven" manuscript, your copy editors can work in conjunction with the author to get the manuscript into its final form within whatever format they are familiar with. Another option would be to convert to SGML or XML right away and your copyeditors would work within an SGML/XML editing environment until the manuscript reaches its final form. There are several excellent commercially available SGML/XML editors. Converting the manuscripts prior to composition is known as "Up Front Conversion" or "Pre-Composition Conversion".

Finding the best point in the process to convert the manuscript to SGML/XML is tricky. It depends on your copyeditor/author collaborating environment. This is especially true if it involves multiple page proof passes. We typically discuss this issue in great detail before we recommend an approach.


"If the thought of in-house composition seems too bold for your organization, you can still save as much as 40% on your composition cost by providing your compositor with SGML/XML tagged files."

If you are able to utilize a "pre-composition" SGML/XML up front process, you can realize many benefits:

  1. Reducing Costs - There are SGML/XML publishing systems that, after developing styling rules, will accept SGML and automatically render the data. This means that you can actually perform composition in-house, drastically reducing the cost of composition.

    This is especially true for Journals since they follow a similar structure from Journal to Journal and from issue to issue. Books are more challenging since their sheer variety is an anathema to structured markup.

    If the thought of in-house composition seems too bold for your organization, you can still save as much as 40% on your composition cost by providing your compositor with SGML/XML tagged files.
     
  2. Reducing Production Time - If you move to in-house composition, your composition can be immediate. Even if you have to make changes, the changes can take minutes or hours ... not days ... thus drastically reducing the time needed for composition.

    One of my clients does composition in-house and spends an average of 2 minutes per page priming the pages before composition.
     
  3. Earlier Manuscript Viewing - Once the files have been converted to SGML/XML, they are immediately viewed and searchable on the Web. Therefore your authors will love you since they will have their material out to the scientific community much quicker. (Remember the carrot!)

Content Management System (CMS)

If you would rather wait to re-engineer your production process and prefer to convert to SGML/XML post-Composition, you will still realize many benefits since your data will be in SGML/XML. But before we discuss these benefits we need to mention another piece of technology that you can leverage to improve the process - a Content Management System (CMS). As its name suggests, the CMS is designed to manage the content contained within it. The basic features of an SGML and XML based CMS are:

  1. It identifies the original author of the document and grants permission to select individuals that may be required to edit the document.
  2. It tracks all the changes ever made to the document, identifying who made the changes and when the changes were made.
  3. The CMS doesn't store whole documents, it stores pieces or "chunks" of content. These chunks are then assembled by the CMS into a single document when the entire document is required. The level of granularity of the chunks is determined by the level of tagging that was done to the data. Increasing tagging granularity will directly increase the usability of the data. In the case of the first citation example used earlier, because the text is stored as one chunk, there is limited functionality. The user can link from the text of the article where the citation is referenced to the actual citation, and do some low-level searching on the citation text, but not much else. In the second example, the citation is broken down into its components, thereby greatly increasing its functionality. For example, the user can link to the actual article noted in the citation, do searches for all articles by a certain author, or search for keywords within the title of the article.
  4. All similar chunks of data are represented only once even though they may appear in multiple documents.
  5. It stores all the information regarding who requested what data chunks and when they were sent.
  6. It allows new documents to be created from existing chunks stored within the CMS.

Some of the benefits that you will receive from an SGML/XML based CMS are:

  1. Component Reuse/Quicker Product Development - You will be able to store all chunks of data tagged the same way, only one time, eliminating redundancy and giving you a clean inventory of your content. As a result, modifying content and preparing new content by reusing portions of existing content will be much easier. For example, if you have material residing in 9 different properties (books, journals monographs, etc.) that deal with a particular sub-topic, you can create a new product combining all this information almost instantaneously since they would be residing in the CMS. Therefore you can create more targeted products without much additional effort.
  2. Print on Demand - Since an SGML/XML based CMS can have your properties stored on a very granular level, it is a perfect "front end" to any Print on Demand System, allowing you to print a combination of very finely selected pieces of information.
  3. Manage QA Function - Since the CMS can track which user is working on a specific document and tracks document flow, it can assist you in managing your QA function from the initial QA viewing until the final signoff of the document.

Other benefits of SGML/XML data are:

  1. Repurposing - Since tagged data is not concerned with how the structure is supposed to "look" but is only concerned with identifying the structure, it does not contain any information on presentation. Therefore, you can vary the presentation rules depending on the platform you will be using to present the data. For example, I can make a rule that when footnotes are presented on paper they will be 6 point; but, to enhance online readability, they will be 9 point when viewed on a computer screen. This means it will be much easier to produce the same information for paper and multiple electronic platforms like Web and handheld device viewing. Also, since the data is in SGML, aggregators can accept the data -- resulting in the ability to increase distribution channels.
     
  2.  Searching - Since the tags identify the structure of each data, you're able to search not only on text but also on specific structures contained in the text. For example, you would have the ability to easily search all the references in the data since all the references are tagged as references.  In addition to enhancing the user's experience, it can assist in the QA process.
    (See http://www.dclab.com/xmlangel.asp)
     
  3. Interchanges with Vendors - Most aggregators prefer SGML files and most of the remainder will accept SGML. It was previously mentioned that compositors will reduce the price of composition by as much as 40% if you provide them with SGML.
  4. Enforce Data Standards - Since chunk positioning is one of the areas checked by the SGML and XML parsers, in most cases, only documents whose elements are in the correct order will be validated. The resulting output files will therefore be much more uniform.
  5. Data Security - In conjunction with a CMS, XML already has solved the issue with internal security. If tagged appropriately, the CMS will control internal user access down to the character level. Publishers' concern over how to combat unauthorized dissemination and use of their materials is much more difficult to solve since it is external to their environment. Although the issue has not yet been resolved, I believe the silver bullet will include XML tagged data.  

In conclusion, although the demands on Publishers are increasing, technology is evolving ... and understanding exactly how all the pieces fit together will enable you to meet the current challenges.

/FONT>DCLnews Editorial

Read more...Read more on Publishing and XML at DCL Library

Read more...DCLnews recently did a series of interviews with leading figures in STM publishing. Click on the links below to read them:

Return to top

 
representational space
    Popular Links

    Events


representational space
representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th Street, 2nd Floor, Fresh Meadows, NY 11365
718-357-8700
Data Conversion Lab
Copyright © 1997-2010  Data Conversion Laboratory, Inc. All rights reserved.