|
|
Why
Publishers Should Use XML... Every
moment publishers put off embracing XML, they're missing a powerful opportunity
to reduce composition costs and increase revenues, says David Skurnik
(pictured), VP of Sales at Data Conversion Laboratory
.
[Thanks to Lori Barber, Business Development Manager and Beth Friedman, Senior Project Manager for their input.]
It's 3pm. You've just had a meeting with your Executive
Editor. He
told you to cut the cost of production and reduce the time to market
of your books and journals.
You now go to another meeting. This time with the VP of Marketing. She
tells you that, in an effort to increase revenue from existing customers, they are planning to develop dozens of separately purchasable,
value-added features, to the online products. What's more, to attract
new buyers, they are targeting smaller
markets with virtual niche journals built from existing journal content.
They are also building up extra distribution channels through content
aggregators, and abstracting and indexing services.
The upshot is you will
have to produce a lot more products ... and
they will have to be easily adaptable for use in multiple vendors'
systems.
You then make a beeline to your Executive
Editor's office
and tell him you can't decrease overall production costs because you
now have to support more platforms and produce more products. The Editor smiles sagely
at you and wishes you good luck in realizing organizational
goals.
Sound
Familiar?
"Data needs to be in smart formats like SGML and
XML. With intelligent data, technology can be exploited to accomplish
your goals."
|
In today's high tech and ever demanding
publishing climate, many production managers yearn for the "good old days"
when all they had to produce was paper. But, as we quickly learned, those
days are only a pleasant memory. In order to keep the Executive Editor
and
VP of Marketing happy, we have to embrace technology and re-engineer our
production process to exploit new technology.
Before we discuss technology, let's
look at what the perfect production environment would be. From
my conversations with numerous publishing professionals I came up with the
following list:
- "I want to automate the production process
as much as possible so I can shorten the time to publication and reduce costs."
- "I want to be able to seamlessly produce
print and multiple types of electronic products from a seamless production
process."
- "I want to make content available for
public consumption at a very early stage in the production process."
- "I want to retain a single instance of all
my content in a repository, so I can reduce the amount of time and
expense needed to maintain existing products and produce new
products."
- "I want the capability to enhance my
product by increasing the end user experience without painful new
development".
- "I want Print on Demand
capability."
- "I want to protect my content."
Are these goals
attainable?
Yes. Technology has made
these goals fully attainable. But before the mansion can be built you have
to dig the foundation. In the case of Publishing, the foundation is
the data. That is why data needs to be in "smart formats" like SGML
and XML. With intelligent data, technology can be exploited to accomplish
your goals.
Let's examine the tasks you
will ask of your data:
- Repurposing - You will need to create new
versions of data suitable for derivative uses (e.g. the web, diagnostic
equipment, hand-held devices, voice devices).
- Searching - You will need the ability to
find information through text searches and through more advanced
searches that depend on context and "understanding."
- Component Reuse - You will need the
ability to reuse portions of data for different products and different
documentation sets.
- Interchange with Vendors - You will need
to provide your data in a form that your vendors and distributors can
easily use.
- Enforce Data Standards - The data will
have to be consistent to reflect your branding and be predictable in the
sense that people will know where to find the information they are
seeking. This will only be accomplished if the way the information is
presented is standard throughout your products.
- Security - Make it easier for security
solutions to protect the data.
Not surprisingly, the benefits of SGML and
XML are exactly those mentioned above.
What is SGML and
XML?
Definition and History of
SGML, HTML and XML
SGML (Standard Generalized Markup Language)
was developed in the 1980s as a non-proprietary, platform independent,
method of describing the structure of a document rather than its
appearance.
What does it mean to tag a document based on
structure?
In a bibliographic reference, for example, there is a title,
author(s) details, publisher,
year of publication, and perhaps page citations. A reader can
distinguish between the different
types of information from both experience and the appearance of the data (the title
may be in italics, for example). But a reference containing SGML or
XML mark-up will have tags representing the many structures
within the reference.
A reference could be tagged as simply as this:
<citation
citation_type="journal" id="A123">Abdel Malek, Z,
Swope, VB, Pallas, J, Krug, K & Nordlund, JJ. <it>Mitogenic,
melanogenic and cAMP responses of cultured neonatal human melanocytes to
commonly used mitogens</it>. J Cell Physiol, 150, 416–425, 1992.</citation>
The more the data
is tagged, the more "intelligence" is added to the
document. This intelligence makes it easier to "slice and
dice" the data for creating new products and to transform
the data for other systems. The following example shows every
reference element tagged separately.
<jnlref><au><snm>Abdel Malek</snm>,
<fnms>Z</fnms></au>, <au><snm>Swope</snm>,
<fnms>VB</fnms></au>,
<au><snm>Pallas</snm>, <fnms>J</fnms></au>,
<au><snm>Krug</snm>, <fnms>K</fnms></au>
& <au><snm>Nordlund</snm>,
<fnms>JJ.</fnms></au> <tl>Mitogenic, melanogenic and
cAMP responses of cultured neonatal human melanocytes to commonly used
mitogens</tl>. <pubtl>J Cell Physiol</pubtl>,
<vid>150</vid>, <ppf>416</ppf>–
<ppl>425</ppl>, <cd year="1992">1992</cd></jnlref>.
Prior to applying tags to a document, you
have to define some basic rules that
determine:
- What structures within the document are to
be tagged. In a bibliographic reference, you would need to determine
whether you will be separately tagging every author's name or whether you will only use one tag for the
entire group of authors. The decision will be based on what type of
intelligence you will want to extract from the data.
- What the tag names will be called. Using
our previous example, you may wish to use "authors" as the tag to
describe the author's name or use "au".
- The order of when and where these
structures can be found within the document. In our reference example, you
would want to ensure that the authors' names always precede the
year of publication.
Document Type Definition
These rules comprise a document called a
Document Type Definition (DTD). Before any conversion, the DTD has to be
developed to give guidance on the basic rules of the
conversion.
In the early days, the biggest issues against
implementing an SGML solution were that it was complex and that there were
not many tools on the market to support it.
In the infancy of the Internet, a universal
DTD for tagging documents designed to be viewed on the Internet was
developed. This DTD came to be known as Hypertext Markup Language (HTML).
Since HTML was focused on presentation and not on structure, the HTML tag
set was very limited, and was therefore much easier to
implement.
But its advantage of being simple was its
biggest drawback since HTML's ability to perform complex searching,
linking and document maintenance was very limited.
The challenge was to find a way of marking up
documents that was not as complex as SGML but was more powerful than HTML.
The solution was XML. XML is an acronym for eXtensible Markup Language and
is a derivative of SGML. Since its introduction on
to the market, many corporations and organizations like IBM, Microsoft and
General Electric have been converting their documentation to XML. And XML
has become the de-facto standard for data transfer.
Let's discuss how a combination of good data
and technology can help you achieve your publishing goals. We will start at
the beginning of the process and move forward to the remainder of the
process.
In many instances, collaborative authoring
may be involved in your production process. There are Web based systems
that allow manuscripts to be authored in Microsoft Word. Edits
can either be made directly into the document or the main author can
receive an external document containing the edits. If there is only one
author involved, the manuscript is usually created in Word; but a manuscript
full of equations might be created in Tex or Latex.
Author Templates
In either case, the idea of developing Author
Templates is important since it would produce a consistently structured and
well-styled manuscript.
The main challenge is how to make
sure the author actually uses the template. Depending on the situation, you will
have to decide between using a "carrot or a stick" or a combination of
both. In any event, even if you were only mildly successful in getting
author compliance, you would have achieved a greater degree of consistency
in your author manuscripts. This will help you reduce the time and cost of
converting your manuscripts into your target format.
Pre-Composition
Conversion
Once you have the first cut of your "template
driven" manuscript, your copy editors can work in conjunction with the
author to get the manuscript into its final form within whatever format
they are familiar with. Another option would be to convert to SGML or XML
right away and your copyeditors would work within an SGML/XML editing
environment until the manuscript reaches its final form. There are several
excellent commercially available SGML/XML editors. Converting the
manuscripts prior to composition is known as "Up Front Conversion"
or "Pre-Composition Conversion".
Finding the best point in the process
to convert the manuscript to SGML/XML is tricky. It depends on your
copyeditor/author collaborating environment. This is especially true if it
involves multiple page proof passes. We typically discuss this issue
in great detail before we recommend an approach.
"If the thought of in-house composition seems
too bold for your organization, you can still save as much as 40% on
your composition cost by providing your compositor with SGML/XML tagged
files."
|
If you are able to utilize a
"pre-composition" SGML/XML up front process, you can realize many
benefits:
- Reducing Costs
- There are SGML/XML
publishing systems that, after developing styling rules, will accept
SGML and automatically render the data. This means that you can actually
perform composition in-house, drastically reducing the cost of
composition.
This is especially true for Journals since they
follow a similar structure from Journal to Journal and from issue to
issue. Books are more challenging since their sheer variety is an anathema to
structured markup.
If the thought of in-house composition seems
too bold for your organization, you can still save as much as 40% on
your composition cost by providing your compositor with SGML/XML tagged
files.
- Reducing Production Time - If you move to
in-house composition, your composition can be immediate. Even if you
have to make changes, the changes can take minutes or hours ... not
days ... thus drastically reducing the time needed for
composition.
One of my clients does composition in-house and spends
an average of 2 minutes per page priming the pages before
composition.
- Earlier Manuscript Viewing - Once the
files have been converted to SGML/XML, they are immediately viewed and
searchable on the Web. Therefore your authors will love you since they
will have their material out to the scientific community much quicker.
(Remember the carrot!)
Content Management System (CMS)
If you would rather wait to re-engineer your
production process and prefer to convert to SGML/XML post-Composition, you
will still realize many benefits since your data will be in SGML/XML. But
before we discuss these benefits we need to mention another piece of
technology that you can leverage to improve the process - a Content
Management System (CMS). As its name suggests, the CMS is designed to
manage the content contained within it. The basic features of an SGML and
XML based CMS are:
- It identifies the original author of the
document and grants permission to select individuals that may be
required to edit the document.
- It tracks all the changes ever made to the
document, identifying who made the changes and when the changes were
made.
- The CMS doesn't store whole documents, it
stores pieces or "chunks" of content. These chunks are then assembled by
the CMS into a single document when the entire document is required. The
level of granularity of the chunks is determined by the level of tagging
that was done to the data. Increasing tagging granularity will directly increase the usability of the data.
In the case of the first citation example used earlier, because the text is stored as one chunk, there is limited functionality. The user can link from the text of the article where the citation is referenced to the actual citation, and do some low-level searching on the citation text, but not much else. In the second example, the citation is broken down into its components, thereby greatly increasing its functionality. For example, the user can link to the actual article noted in the citation, do searches for all articles by a certain author, or search for keywords within the title of the article.
- All similar chunks of data are represented
only once even though they may appear in multiple documents.
- It stores all the information regarding
who requested what data chunks and when they were sent.
- It allows new documents to be created from
existing chunks stored within the CMS.
Some of the benefits that you will receive
from an SGML/XML based CMS are:
- Component Reuse/Quicker Product
Development - You will be able to store all chunks of data tagged the
same way, only one time, eliminating redundancy and giving you a clean
inventory of your content. As a result, modifying content and preparing
new content by reusing portions of existing content will be much easier.
For example, if you have material residing in 9 different properties
(books, journals monographs, etc.) that deal with a particular
sub-topic, you can create a new product combining all this information
almost instantaneously since they would be residing in the CMS.
Therefore you can create more targeted products without much
additional effort.
- Print on Demand - Since an SGML/XML based
CMS can have your properties stored on a very granular level, it is a
perfect "front end" to any Print on Demand System, allowing you to print
a combination of very finely selected pieces of information.
- Manage QA Function - Since the CMS can
track which user is working on a specific document and tracks document
flow, it can assist you in managing your QA function from the
initial QA viewing until the final signoff of the document.
Other benefits of SGML/XML
data are:
- Repurposing - Since tagged data is not
concerned with how the structure is supposed to "look" but is only
concerned with identifying the structure, it does not contain any information on
presentation. Therefore, you can vary the presentation rules
depending on the platform you will be using to present the data. For
example, I can make a rule that when footnotes are presented on paper they
will
be 6 point; but, to enhance online readability, they will be 9 point when viewed on
a computer screen. This means it will be much easier
to produce the same information for paper and multiple electronic
platforms like Web and handheld device viewing. Also, since the
data is in SGML, aggregators can accept the data -- resulting in the
ability to increase distribution channels.
- Searching - Since the tags identify
the structure of each data, you're able to search not only on text but
also on specific structures contained in the text. For example, you
would have the ability to easily search all the references in the data
since all the references are tagged as references. In addition to
enhancing the user's experience, it can assist in the QA
process.
(See http://www.dclab.com/xmlangel.asp)
- Interchanges with Vendors - Most
aggregators prefer SGML files and most of the remainder will accept
SGML. It was previously mentioned that compositors will reduce the price
of composition by as much as 40% if you provide them with SGML.
- Enforce Data Standards - Since chunk
positioning is one of the areas checked by the SGML and XML parsers, in
most cases, only documents whose elements are in the correct order will
be validated. The resulting output files will therefore be much more
uniform.
- Data Security -
In conjunction with a CMS,
XML already has solved the issue with internal security. If tagged
appropriately, the CMS will control internal user access down to the
character level. Publishers' concern over how to combat
unauthorized dissemination and use of their materials is much more
difficult to solve since it is external to their environment.
Although the issue has not yet been resolved, I believe the silver
bullet will include XML tagged data.
In conclusion, although the demands on
Publishers are increasing, technology is evolving ... and understanding
exactly how all the pieces fit together will enable you to meet the current
challenges.
DCLnews
Editorial
Read more on
Publishing and XML
at
DCL Library
DCLnews
recently did a series of interviews with leading figures
in STM publishing. Click on the links below to read them:
Return
to top
|
|
|
|
|
CIDM Best Practices Conference September 13–15, 2010 Hampton, Virginia
Vasont Users' Group Meeting September 27–30, 2010 Hershey, Pennsylvania
Internet Librarian Conference October 25–27, 2010 Monterey, California
Journal Article Tag Suite Conference (JATS-Con) November 1–2, 2010 Bethesda, Maryland
SPARC Digital Repositories Meeting November 8–9, 2010 Baltimore, Maryland
More Events »
|
|
|
|
 |
|
|