Data Conversion Laboratory, Revolutionizing Publishing for the Digital Age 
  DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area     
menu
Data Conversion Lab

About DCL
  Why go to DCL?
  Clients
  Company Background
  Management
  DCL in the News
  Events
  Holiday Calendar
  Mission

DCL News
  Current Issue
  Back Issues
  Subscribe

Technology
  Technology Resources
  FAQ's
  Glossary
  Presentations
  DCL Work Tracking

Press Info

Clients' Area

Contact DCL
  Directions
  Request Estimate
  Positions

Books2Bytes
Popular Pages
* Current Issue of DCLnews
* DCL featured in The Columbia Guide to Digital Publishing
* Slash Document Costs
* Ann Rockley on ROI in CM
* PDF Resources
* XML Conversion Resources
* Roundtrip Document Conversion
* DCL Resources Library
*

Converting Legacy Data...

*

Aviation & Aerospace

*

PDF Conversion to XML & MS-Word

*

PDF Conversion

*

Quark to XML

* Getting Content into XML
Fact Sheets
* Public Access for Research Materials
* S1000D Conversion
* Content Reuse Assessment
* Document Conversion
* SPL - Pharmaceutical Industry
* Harmonizer™
* Jeppesen Map Revision Service
Technical Papers
* Why STM Publishers Should Use XML...
* Department of Defense and the Power of XML
* Your Data in XML
* SGML to SGML 1
* SGML to SGML 2
* Quark to XML
* Plan Ahead
* Do it Yourself?
* Encyclopedia
Presentations
* Conversion to XML: Documents versus Data (11/2003)
* Data Migration Considerations  (6/2003)
* Technology for Cost-Containment and Efficiency  (4/2003)
* Converting Textbooks to Meet the National XML Standard for Accessibility  (3/2003)
* More Presentations
Weeding Out Wasted Words Saves Money and Lives

Identifying duplicate paragraphs and phrases in document collections brings savings and removes potentially damaging errors - even for organizations not yet ready for a content management system. DCLnews reports.

Companies and organizations that produce a lot of content might be surprised at how much material is duplicated, or near duplicated. Document sets are often carrying far more "weight" than they need to. This leads to unnecessary maintenance expense and wasted hours searching for the right document or text fragment. It also increases exposure to risk due to errors creeping in - some of which may have been in a document set for years.

Removing duplicate data and harmonizing near-duplicates has many advantages. Not only is it a valuable advance step in the run up to implementing a content management system (CMS), but it also brings measurable cost savings and makes data sets far more accurate and precise.

Did you know?

Long before anyone thought of content management systems or content reuse, document management systems were originally developed to help law offices maintain better control over the many documents that legal professionals create. The basic mechanisms of the first document management systems added information about a document to the file that contained the document. They also organized information supplied by users in a database and included information about the relationships between different documents.

Document management systems essentially created libraries of documents in a computer system or a network. These libraries contained a "card catalog" where information supplied by the user was stored. Information could then be retrieved in more sensible and intuitive ways than scanning through directories and folders, in the hope that a file's name might reveal what the file contained. Eventually, document management systems added version tracking, document sharing, electronic review, publishing management and workflow integration.

Many consider the key achievement of early document management systems was to have created "a file system within the file system."

"Identifying duplicate or near-duplicate paragraphs and phrases in documentation sets is a way of cleaning house and getting your data in order," says Mark Gross, president of Data Conversation Laboratory (DCL). "It also allows you to hone in on a document standard by weeding out the variations in content chunks until you're left with the normalized sections that can be earmarked for content reuse."

Removing wasted words and harmonizing near-duplicates in this way not only reduces costs and cuts down on maintenance, but also makes document sets ready to load into a content management system (CMS).

"That you can prepare your data in parallel to system development, and that no after-the-fact cleaning up is necessary is a big bonus when you consider it can sometimes take as long as two years to implement a content management system," says Gross.

Specialist tools

Gross believes there are very definite gains to be had from evaluating the potential for content reuse and removing variations in data sets before installing a content management system. To this end, DCL has set up specialist tools that can analyze full document sets, even 100,000 pages or more, and provide detailed reports of duplicate and near-duplicate data in document collections.

"We first run a document set through our conversion engine to standardize data, then our content reuse application looks at each paragraph to see whether there is repetition with slight variations anywhere else in the collection," Gross explains.

The variations might prove to be one word or even a comma. Over a large document set this can amount to a lot of unnecessary repetition. What's more, the very act of searching for repetition can reveal typos and even potentially damaging errors.

"People often find errors in their documents that have been there for many years," says Gross. "It might be a misplaced decimal point in a specification or the omission of one word like ‘don't', an error which could prove disastrous in a technical manual. It wouldn't be an exaggeration to say, in some instances, removing such errors could save lives."

Huge volumes

Without performing a content reuse evaluation, the only way errors would be found is by manually looking through the whole document. This is not an option for an aircraft or vehicle repair manual, for example, which nowadays are huge electronic volumes.

The same is true for Help files. Often segments of documentation are simply variations on others because over the years technical writers have added material to the "melting pot". The downside is every time changes need to be made to a specific Help subject, all the repetitions need to be found by hand so they can be changed too - a big job.

"In cases like that, our tools could first be used to highlight all the repetitions so changes can be made faster," says Gross. "Going on from there, the tools could be used to standardize the material, so it is the same throughout the set of files."

Even if you’re not yet ready for a CMS

He stresses that you can benefit from a content reuse evaluation even before an organization is ready for a CMS.

"When it comes to documentation, cleaning up house has a measurable impact in terms of time saved and on the costs of maintaining large document sets. That holds true even if you’re not yet ready to implement a content management system right away," says Gross.

Many industries would gain from standardizing their documentation, argues Gross. Auto manufacturers, for example, offer a number of models of the same car, which means the maintenance manuals have slight variations. It's the same with aircraft or boat engine makers and with the pharmaceutical industry - even the legal profession.

March 22nd, 2005
DCLnews Editorial

Further Information

 

  Structured Product Labeling

Content Reuse

Subscribe

Books2Bytes

DCL Library

Columbia Guide
GSA Schedule
AIA Member
DCL Calendar

Best Practices Santa Fe, NM, September 15-17, 2008. More…
XyUser Phoenix, AZ, September 22-24, 2008. More…
9th Annual Vasont Users' Group Meeting, Hershey, PA, October 6-8, 2008. More…

DITA/TECHCOMM 2008, Raleigh, NC, November 3-6 2008. More…

ATA e-Business Europe. Details TBA.

 
Recent News

Doc Train Life Sciences Indianapolis, IN, June 23-25, 2008. More…

X-Pubs London, England, June 22-24, 2008. More…

Mark Logic User San Francisco, CA, June 10-12, 2008. More…

PTC User Long Beach, CA, June 2-4, 2008. More…

Ultramain User Conference 2008, Albuquerque, NM, May 11-15, 2008. More…

Documentation and Training West 2008 Vancouver, BC, May 6-9, 2008. More…

CMS/DITA Santa Clara, CA, April 7-9, 2008. More…

DIA Med Comm Orlando, FL, March 10-11, 2008. More…

DIA EDM Philadelphia, PA, February 5-7, 2008. More…

Gilbane Boston Conference Boston, MA, November 29, 2007. More…

The LavaCon Conference on Advanced Technical Communication and Project Management New Orleans, LA, October 27-30, 2007. More…

2007 ATA e-Business Forum Miami, Florida, Oct 17-19, 2007. More…

DITA 2007™-East, Raleigh, North Carolina, October 4-6, 2007. More…

2007 XyUser Group Fall Conference, Boston, MA, Sept 23-26, 2007. More…

Mark Logic 2007 User Conference, San Francisco, CA, May 15-17, 2007. More…

Content Management Strategies/DITA North America Conference 2007, Boston, MA, March 26-28, 2007. More…

DIA 18th Annual Workshop, San Diego, CA. March 4-7, 2007. More…

DIA 2007 EDM & CDM Conference, Philadelphia, PA, Feb 6 - 8, 2007. More…

DITA 2007 – West, San Jose, CA, February 5-7, 2007. More…

Framemaker 2006 Chautauqua, Austin, TX, Nov 8-10, 2006. More…

PTC/User World Event 2006, Grapevine, TX, June 4-6. More…

19th Annual DIA Conference Philadelphia, PA, February 7-9. More…

XyUser's Conference, San Diego, California, September 11-14. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Structured Product Labeling, Washington, DC, August 23-24. More…

Tri-XML 2005, Raleigh, NC , July 28. DCL's Don Bridges delivered a presentation on "Content Reuse" More…

Pharmaceutical Labeling and Product Identification, Whippany, NJ, June 16-17. DCL's Don Bridges delivered a presentation on "Structured Product Labeling (SPL) and the Implications of Implementing an XML Solution." More…

More…

Data Conversion Laboratory, Inc.   61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365   718-357-8700   convert@dclab.com

Copyright © 1997-2008  Data Conversion Laboratory, Inc. All rights reserved.