In the last issue, I laid out the basics of XML, and why to use it. While I'm a big fan of XML for many purposes, it's a misconception that it's the single best solution in every scenario, and it's worthwhile to consider the alternatives in situations where the benefits of XML are not necessary. In this article, I discuss alternatives to XML, SGML, and HTML that might be suitable when budgets are more limited.
While XML is perfect for highly coded information, other options can work well for many kinds of information. Markup languages are at the high end of the cost spectrum, so if you don't need the benefits they provide, you certainly should consider the alternatives discussed below.
Scanning Your Collection
The cheapest (and simplest) option is scanning-that is, taking an electronic image of a paper document. If it's a small project, you may even want to scan it yourself, either in your office or at home. However, a large-scale scanning project would require more than your home scanner, and more time than you might wish to devote.
|
"If all you want is for people to be able to see a book or article on their screen, then simple scanning may be a sensible option."
|
Costs for simple scanning of machine-feedable sheets can be as low as 10 to 20 cents per page. If all you want is for people to be able to see a book or article on their screen, then simple scanning may be a sensible option. For example, a library with a collection of rare books might opt for scanning (particularly if the collection is not in English), since the objective would be simply to get the books scanned as an initial preservation method.
However, you need to consider the type of material you're scanning. If it's rare, old, and/or fragile, it will require special handling, such as gloves for the workers handling the materials and high-end scanners that recognize differences in paper size, color saturation, and other document irregularities. The quality and type of paper with which you're dealing will also play a part in determining special handling requirements. For this kind of work, you're probably best advised to seek out the expertise of an experienced specialist.
Beyond simple scanning, you may want to add some coding. You want to keep searchability in mind when you're contemplating a conversion project. Adding metadata will make it easier for people to find the documents you've transferred to electronic form.
Dirty OCR
What about "dirty OCR" as a data conversion possibility? This is the step beyond scanning. Dirty OCR costs only slightly more than scanning, and you can even find freeware and royalty-free OCR programs that might suit your needs. If you need better accuracy and more sophisticated features, you should expect to pay between $200 and $700 for an OCR program. But beware-quality can vary greatly. If the quality of your original materials is pristine, then you can anticipate 99% accuracy. However, if it's not clean, possibly with dirt or some handwriting on it, accuracy can drop dramatically to less than 80%.
Dirty OCRing is relatively inexpensive, costing only a few cents more than simple scanning. Plus, it produces a searchable document-although searching capabilities may be less than perfect if the word you're searching for is the one your dirty OCR didn't get right. Despite dirty OCR's shortcomings, it might be the right answer in some situations. Take, for example, a litigation support project. When you need to go through a million pages obtained through a discovery request, the cost of converting it perfectly may be prohibitive, and dirty OCR could be good enough to get the job done. The same applies to large library collections where the cost of clean OCR is prohibitive and imperfect searchability is adequate.
If, after you've opted for dirty OCR, you find that higher quality is needed, things could get complicated. Going back to the dirty OCR and cleaning it up to achieve higher accuracy is no longer an automated process, and does add significant additional costs. The increment from cleaned up OCR to XML is often not that big.
If you're stuck with OCR that is too dirty for your purposes, rather than converting to full XML, you might benefit from an in-between step that involves tagging only the most important elements. Simple tags that identify whether a particular item is a person's name, an author, a product, or a date can be extremely helpful in improving searchability. For example, if you search for the year 1978, you can be sure the search retrieves a date, rather than a page number or a street address.
PDF
PDF, officially known as Portable Document Format, has become fairly ubiquitous (see our PDF Resources page in the Resource Center, http://www.dclab.com/pdf.asp). From a cost perspective, it is important to realize that there are different forms of PDF. It's relatively simple to take a simple image and turn it into an image PDF file. This is an automated process, and the drawback is the same as what you encounter with simple images: you cannot search the file.
The next level up is PDF with text behind it, which is usually the same as having dirty OCR associated with your page images, with all the cost/quality tradeoffs I've discussed. The most useful form of PDF is PDF Normal, which is normally easy to produce from a word processor or publishing system. However, PDF Normal is very difficult to produce properly when your source data is coming from paper or from legacy documents, and therefore it is often not the best option given these circumstances.
The general downside of PDF is that it won't reformat to fit different media formats, so if you've composed a document for letter-size paper you'll need to scroll around a lot when you display it on a computer screen. For this reason, if you need something that's reusable and refomattable, XML is the better choice. You can show it on your computer screen and it will look one way, but if you format it for printing, it will appear appropriately sized for that format too. The information remains the same while the document adapts to whatever the viewing medium.
Tradeoffs
|
"Inevitably, the decision you make will involve a tradeoff between cost and functionality. Just how perfect does the final product need to be?"
|
Inevitably, the decision you make will involve a tradeoff between cost and functionality. Just how perfect does the final product need to be? Even with scanning, there are levels of quality to be considered: Do you want to take the time for a person to examine the quality of the scanned page, or will you settle for a totally automated process? For example, Google Books uses a highly automated scanning process, but sometimes the quality of the scanned pages is not as good as you might want. Images can be blurred if a page is turned too fast during the scanning process, or page images may include the fingerprints of the person responsible for running the scanner. If you need granular searchability, expect to pay more than if you simply want a scanned copy, since it requires more indexing and more organization.
|
"If it does the job, then why do more? One way to answer this is to consider your future needs. Will the product of the conversion effort suffice not just today, but also in five years?"
|
So… if it does the job, then why do more? One way to answer this is to consider your future needs. Will the product of the conversion effort suffice not just today, but also in five years? Granted, you can't foresee every future contingency, but I've learned that planning for future use can pay off. One of DCL's clients, a company in the publishing industry, was originally happy with very lightly tagged text. Then the company got a new publishing partner who wanted to repurpose the material as part of a different database with more extensive tagging. To suit their new partner's needs, the company had to go back and enhance the tagging-an expensive endeavor that could have been included at almost no additional cost during the original digitization.
|
"If you intend to discard your original documents, then quality control of the electronic versions becomes critical to the integrity of your data. You can't go back and rescan something that no longer exists."
|
Another question to consider is what you plan to do with your source data once you've completed your data conversion project. If you intend to discard your original documents, then quality control of the electronic versions becomes critical to the integrity of your data. You can't go back and rescan something that no longer exists. It's crucial to do it right the first time.
Ask yourself about the market for your data. If it's quite small, there's probably no need to seek elaborate solutions. If it's large, then you'll likely want to choose XML. Sometimes, market size will fool you. In a recent project, DCL used dirty OCR on a collection of sermons scanned for Yeshiva University. The assumption was that there was probably a small cluster of people who would be interested in accessing the collection. However, now that it has become available, it has become a major resource, with a much larger audience than the university thought was possible.
|
"The argument for quality revolves around doing things right the first time around. Putting a little more effort into quality control, as opposed to running an automated process that may miss things, can result in significant savings later."
|
The argument for quality revolves around doing things right the first time around. Putting a little more effort into quality control, as opposed to running an automated process that may miss things, can result in significant savings later. But there is no one-size-fits-all solution to your conversion needs, and not every project requires perfect quality. Take an inventory of your project's needs, and ask yourself: What's the cost of quality? What's its worth?
DCLnews Editorial
September 2009