DCL  
representational space

   Refer a friend  Email this Page
   Print friendly version Print-Friendly
   Request Information Request Information
   Subscribe  Subscribe

LinkedInTwitterFacebook

representational space
Services
Content Reuse
Document Conversion
Quality Assurance
Rendering & Publishing
SPL Labeling
Source Formats
   - Word Processors
   - Publishing Systems
   - PDF
   - Other Formats
Target Formats
   - XML & SGML
   - DITA
   - Military DTDs
   - NLM
   - Public DTDs
   - S1000D
   - Other Standards
Other Services >>
representational space
Memberships
Weeding Out Wasted Words Saves Money and Lives

Identifying duplicate paragraphs and phrases in document collections brings savings and removes potentially damaging errors - even for organizations not yet ready for a content management system. DCLnews reports.

Companies and organizations that produce a lot of content might be surprised at how much material is duplicated, or near duplicated. Document sets are often carrying far more "weight" than they need to. This leads to unnecessary maintenance expense and wasted hours searching for the right document or text fragment. It also increases exposure to risk due to errors creeping in - some of which may have been in a document set for years.

Removing duplicate data and harmonizing near-duplicates has many advantages. Not only is it a valuable advance step in the run up to implementing a content management system (CMS), but it also brings measurable cost savings and makes data sets far more accurate and precise.

Did you know?

Long before anyone thought of content management systems or content reuse, document management systems were originally developed to help law offices maintain better control over the many documents that legal professionals create. The basic mechanisms of the first document management systems added information about a document to the file that contained the document. They also organized information supplied by users in a database and included information about the relationships between different documents.

Document management systems essentially created libraries of documents in a computer system or a network. These libraries contained a "card catalog" where information supplied by the user was stored. Information could then be retrieved in more sensible and intuitive ways than scanning through directories and folders, in the hope that a file's name might reveal what the file contained. Eventually, document management systems added version tracking, document sharing, electronic review, publishing management and workflow integration.

Many consider the key achievement of early document management systems was to have created "a file system within the file system."

"Identifying duplicate or near-duplicate paragraphs and phrases in documentation sets is a way of cleaning house and getting your data in order," says Mark Gross, president of Data Conversation Laboratory (DCL). "It also allows you to hone in on a document standard by weeding out the variations in content chunks until you're left with the normalized sections that can be earmarked for content reuse."

Removing wasted words and harmonizing near-duplicates in this way not only reduces costs and cuts down on maintenance, but also makes document sets ready to load into a content management system (CMS).

"That you can prepare your data in parallel to system development, and that no after-the-fact cleaning up is necessary is a big bonus when you consider it can sometimes take as long as two years to implement a content management system," says Gross.

Specialist tools

Gross believes there are very definite gains to be had from evaluating the potential for content reuse and removing variations in data sets before installing a content management system. To this end, DCL has set up specialist tools that can analyze full document sets, even 100,000 pages or more, and provide detailed reports of duplicate and near-duplicate data in document collections.

"We first run a document set through our conversion engine to standardize data, then our content reuse application looks at each paragraph to see whether there is repetition with slight variations anywhere else in the collection," Gross explains.

The variations might prove to be one word or even a comma. Over a large document set this can amount to a lot of unnecessary repetition. What's more, the very act of searching for repetition can reveal typos and even potentially damaging errors.

"People often find errors in their documents that have been there for many years," says Gross. "It might be a misplaced decimal point in a specification or the omission of one word like ‘don't', an error which could prove disastrous in a technical manual. It wouldn't be an exaggeration to say, in some instances, removing such errors could save lives."

Huge volumes

Without performing a content reuse evaluation, the only way errors would be found is by manually looking through the whole document. This is not an option for an aircraft or vehicle repair manual, for example, which nowadays are huge electronic volumes.

The same is true for Help files. Often segments of documentation are simply variations on others because over the years technical writers have added material to the "melting pot". The downside is every time changes need to be made to a specific Help subject, all the repetitions need to be found by hand so they can be changed too - a big job.

"In cases like that, our tools could first be used to highlight all the repetitions so changes can be made faster," says Gross. "Going on from there, the tools could be used to standardize the material, so it is the same throughout the set of files."

Even if you’re not yet ready for a CMS

He stresses that you can benefit from a content reuse evaluation even before an organization is ready for a CMS.

"When it comes to documentation, cleaning up house has a measurable impact in terms of time saved and on the costs of maintaining large document sets. That holds true even if you’re not yet ready to implement a content management system right away," says Gross.

Many industries would gain from standardizing their documentation, argues Gross. Auto manufacturers, for example, offer a number of models of the same car, which means the maintenance manuals have slight variations. It's the same with aircraft or boat engine makers and with the pharmaceutical industry - even the legal profession.

DCLnews Editorial

Further Information

 

 
representational space
DCL Library
Articles, fact sheets, presentations and white papers
representational space
Events

Content Management Strategies/DITA North America 2010 Conference,
April 19–21 2010, Santa Clara, California

2010 ATA e-Business Forum,
May 17–19, 2010, Seattle, WA

representational space

representational space
representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th Street, 2nd Floor, Fresh Meadows, NY 11365
718-357-8700
Data Conversion Lab
Copyright © 1997-2010  Data Conversion Laboratory, Inc. All rights reserved.