DCLWiki | Client Area  
DCL  

representational space

   Refer a friend  Email this Page
   Print friendly version Print-Friendly
   Request Information Request Information
   Subscribe  Subscribe

          LinkedInTwitterFacebook

representational space
Services
Content Reuse
Document Conversion
Quality Assurance
Rendering & Publishing
SPL Labeling
Source Formats
   - Word Processors
   - Publishing Systems
   - PDF
   - Other Formats
Target Formats
   - XML & SGML
   - ePub
   - DITA
   - Military DTDs
   - NLM
   - Public DTDs
   - S1000D
   - Other Standards
Other Services »
representational space
Memberships

DocBook to DITA Conversion Automation - Improving the Yield?

By Mikhail Vaysbukh, Data Conversion Laboratory

With DITA implementations on the rise, and an entrenched DocBook community already in place, the resulting market interest has spurred interest in automated DocBook to DITA conversion. So I would expect offerings of automated DocBook to DITA conversion scripts to emerge in the next 6-10 months. This article addresses the real questions, "What should I expect from automated tools?" and "Will they work for me?" from the viewpoint of live experience with numerous DocBook to DITA conversions. The answers to these questions are not usually obvious.

As there have been a number of articles recently published about the differences between and the advantages/disadvantages of DocBook and DITA (http://www.dclab.com/converting_to_dita, http://www.dclab.com/dita_legacy.asp, http://www.dclab.com/dita_docbook.asp), I'll move right into the topic at hand - how to evaluate, and improve the quality, of what you get out of DocBook to DITA conversion tools and services. The following is a list of questions to ask yourself in the evaluation process:

DITA Checklist
  1. Is my DocBook data compatible with an out-of-the-box automated conversion script?

  2. How compatible is my DocBook document structure with the requirements of DITA and how "pure" should I require my converted DITA to be?

  3. How consistent is my source data?

  4. What are the topic types within my document set?

  5. What is my level of topic granularity?

  6. What are my data enrichment requirements?

1. Is my DocBook data compatible with an out-of-the-box automated conversion script?

"The problem is that most software will assume a "pure" out of the box DocBook implementation, and the reality borne through my experience, is that there are very few "pure" DocBook implementations out there."

The problem is that most software will assume a "pure" out of the box DocBook implementation, and the reality borne through my experience, is that there are very few "pure" DocBook implementations out there. Most DocBook users have made some customized tweaks and additions to the base DTD. If you are one of them, an out-of-the-box conversion tool will likely not work out of the box and you should evaluate how much work it will need for the product to work well enough that you don't end up with lots of cleanup work. Cleanup work is often more expensive than doing the upfront work to avoid it.

2. How compatible is my DocBook document structure with the requirements of DITA and how "pure" should I require my converted DITA to be?

DITA is different than traditional document layouts in its emphasis on modules of reusable content (topics) that can be strung together in different ways - it's not necessarily a linear presentation in the way traditional books are. With "pure DITA" the more modularity and reuse the better. However most DocBook documents were not authored assuming modularized DITA stand alone topics (for example many times a number of procedures were included as one - under the one heading, or the same procedure in several places might have minor variations). The purist approach would require re-authoring it all to be "true" DITA, but that's usually prohibitive. The conversion process may deal with this issue at a number of levels and may require workarounds to maintain existing data layout and produce valid DITA. See http://www.dclab.com/dita_topic.asp for more examples of such incompatibility.

3. How consistent is my source data?

"Understanding the degree of inconsistency in your document set will help you ensure that the conversion process or a tool you choose addresses them appropriately."

DocBook allows a fair degree of flexibility in its structure and since writing is creative in nature, many writers have used this flexibility to add creative touches to convey information to the user in the way they felt was best. However, while the intentions were noble, this approach often results in documents with inconsistent use of element structures - presenting difficulties for automated types of conversions. Understanding the degree of inconsistency in your document set will help you ensure that the conversion process or a tool you choose addresses them appropriately.

4. What are the topic types within my document set?

Topics are groupings of information that one would consider a reasonable module that could be repeated in other places if necessary. The next factors to evaluate are the various topic types in your document set as they would apply to DITA: task, concept and reference. Since DocBook does not distinguish its components based on these types, the content you want outputted as a task may have identical source tagging to the content you want outputted as a concept.

If the entire document consists of only the same type of topics (like concept) or 80%-90% of the same topic type, then out of the box conversion script may still be a good option. Otherwise consider that you would need a way to impart the information on the various topic types to minimize after-the-fact rework.

5. What is my level of topic granularity?

Technical documentation often has multiple levels of heading hierarchy, and the headings might mean different levels of granularity in different parts of the documents, and since most documents were authored without a modularized DITA idea of stand alone topics, new topics will vary based on the actual content. So the "why don't you just…" rules like each <block> becoming a new DITA topic will likely not work that well. There's more sophistication needed to make this work well

"If you however find that granularity was maintained at a fairly consistent level, or if you choose to pre-edit your documents to define the same level of hierarchy for chunking into DITA topics, then an out-of-the-box conversion script may still be a good option."

If you however find that granularity was maintained at a fairly consistent level, or if you choose to pre-edit your documents to define the same level of hierarchy for chunking into DITA topics, then an out-of-the-box conversion script may still be a good option. Otherwise, without some pre-planning, you might end up with significant after-the-fact cleanup.

6. What are my data enrichment requirements?

The true benefits of DITA come from features and tagging that don't exist in your source documents. Adding those tags and features is what we mean here by data enrichment. It's best to truly understand what kinds of added (or enriched) tags you want from your conversion and to find a solution that will support those new requirements. DITA has a large number of specialized tags like User Interface Element to add additional levels of granularity. For example in your existing source data <uicontrol> may be done just as bolded text, making it difficult for software to distinguish when <emphasis role="bold"> should be converted to a <b> tag and when it should be converted to <uicontrol>.

Another example is a use of conrefs to facilitate content re-use. If the variable that you'd like to re-use via conrefs are done as regular text, special conversion routines may need to be developed to add previously not available information. (NOTE: In most cases it's not a simple search/replace as phrases often appear as part of the bigger context that can be impacted by the conversion. For example, you want to replace "DCL" with a variable that will be defined as "Data Conversion Laboratory Inc.", but if your text contains this term in another context too, like Part No. 235-DCL-0001 then a simple search/replace type of action would introduce an error.

Understanding your data enrichment requirements prior to the conversion is an important checklist item to ensure that you've selected an effective conversion solution. Sometimes an automated script will not be able meet all these requirements

So if you are ready to consider converting your DocBook docs to DITA, make sure you thoroughly analyze your document set first, and get answers to some of these questions by asking yourself and asking your vendor. Doing this will take you a long way down the path of knowing if and how you can move to DITA and enjoy all the benefits it offers.

About the Author

Mikhail Vaysbukh is a Senior Project Manager for DCL, a PMI-Certified Project Management Professional (PMP) and an industry recognized expert on DITA, XML and data conversion. Instrumental in developing DCL's DITA conversion software suite, Mikhail has been with DCL for over 10 years and has served in a number of positions in both production management and project management. Mikhail is an active member in technical industry groups and a co-author of the Open eBook Publication structure specification. He holds a BS in Business Management and Data Communications from Touro College.

DCLnews Editorial
June 2008

 
“Socially Enabling Documentation
in the Cloud“
Watch now!

“Content Strategy: It's Not About Technology“
“Converting to S1000D: What you need to know before, during and after“
DCL Library
Articles, fact sheets, presentations and white papers
Events

RSuite 2011 User Conference
October 25, 2011
Philadelphia, PA

LAVA-Con
November 13-16, 2011
Austin, TX

Digital Book World
January 23-25, 2012
New York, NY

More Events »
News

The Optical Society Selects Data Conversion Laboratory (DCL) For Major Publishing Project


Data Conversion Laboratory Completes eBook Projects For Information Today And Plexus Publishing

Data Conversion Laboratory and Alexander Street Press Collaborate on METS/ALTO Implementation

          More News »

representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th Street, 2nd Floor, Fresh Meadows, NY 11365
718-357-8700
Data Conversion Lab
Copyright © 1997-2011  Data Conversion Laboratory, Inc. All rights reserved.