|
||||
| DCLab.com | About DCL | Tech Info | Press Info | Contact Us | DCLNews | Partners | Wiki | Client Area | ||||
|
Identifying duplicate paragraphs and phrases in document collections brings savings and removes potentially damaging errors - even for organizations not yet ready for a content management system. DCLnews reports. Companies and organizations that produce a lot of content might be surprised at how much material is duplicated, or near duplicated. Document sets are often carrying far more "weight" than they need to. This leads to unnecessary maintenance expense and wasted hours searching for the right document or text fragment. It also increases exposure to risk due to errors creeping in - some of which may have been in a document set for years. Removing duplicate data and harmonizing near-duplicates has many advantages. Not only is it a valuable advance step in the run up to implementing a content management system (CMS), but it also brings measurable cost savings and makes data sets far more accurate and precise.
"Identifying duplicate or near-duplicate paragraphs and phrases in documentation sets is a way of cleaning house and getting your data in order," says Mark Gross, president of Data Conversation Laboratory (DCL). "It also allows you to hone in on a document standard by weeding out the variations in content chunks until you're left with the normalized sections that can be earmarked for content reuse." Removing wasted words and harmonizing near-duplicates in this way not only reduces costs and cuts down on maintenance, but also makes document sets ready to load into a content management system (CMS). "That you can prepare your data in parallel to system development, and that no after-the-fact cleaning up is necessary is a big bonus when you consider it can sometimes take as long as two years to implement a content management system," says Gross. Specialist tools Gross believes there are very definite gains to be had from evaluating the potential for content reuse and removing variations in data sets before installing a content management system. To this end, DCL has set up specialist tools that can analyze full document sets, even 100,000 pages or more, and provide detailed reports of duplicate and near-duplicate data in document collections. "We first run a document set through our conversion engine to standardize data, then our content reuse application looks at each paragraph to see whether there is repetition with slight variations anywhere else in the collection," Gross explains. The variations might prove to be one word or even a comma. Over a large document set this can amount to a lot of unnecessary repetition. What's more, the very act of searching for repetition can reveal typos and even potentially damaging errors. "People often find errors in their documents that have been there for many years," says Gross. "It might be a misplaced decimal point in a specification or the omission of one word like ‘don't', an error which could prove disastrous in a technical manual. It wouldn't be an exaggeration to say, in some instances, removing such errors could save lives." Huge volumes Without performing a content reuse evaluation, the only way errors would be found is by manually looking through the whole document. This is not an option for an aircraft or vehicle repair manual, for example, which nowadays are huge electronic volumes. The same is true for Help files. Often segments of documentation are simply variations on others because over the years technical writers have added material to the "melting pot". The downside is every time changes need to be made to a specific Help subject, all the repetitions need to be found by hand so they can be changed too - a big job. "In cases like that, our tools could first be used to highlight all the repetitions so changes can be made faster," says Gross. "Going on from there, the tools could be used to standardize the material, so it is the same throughout the set of files." Even if you’re not yet ready for a CMS He stresses that you can benefit from a content reuse evaluation even before an organization is ready for a CMS. "When it comes to documentation, cleaning up house has a measurable impact in terms of time saved and on the costs of maintaining large document sets. That holds true even if you’re not yet ready to implement a content management system right away," says Gross. Many industries would gain from standardizing their documentation, argues Gross. Auto manufacturers, for example, offer a number of models of the same car, which means the maintenance manuals have slight variations. It's the same with aircraft or boat engine makers and with the pharmaceutical industry - even the legal profession.
March 22nd, 2005
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Data Conversion Laboratory, Inc. 61-18 190th St., 2nd Floor, Fresh Meadows, NY 11365 718-357-8700 convert@dclab.com Copyright © 1997-2008 Data Conversion Laboratory, Inc. All rights reserved. |