By Ralph Gammon, Editor, Document Imaging Report
For most of its history, the document imaging market has mainly been focused on transactional capture. That’s because making transactions more efficient has always provided the best ROI, or the best bang for the buck if you will, associated with digitizing paper. However, with document imaging hardware and software prices falling, technology improving, and a recent increased focus on data management (or Big Data) there are new markets for document digitization starting to open up.
Data Conversion Laboratory (DCL) is in a perfect position to take advantage of some of these opportunities. The Queens-based service bureau has never focused on transactional document capture. Instead, its roots are in the technical documentation business.
"About 10-15 years ago, we started working with mission critical information like technical manuals," said Mark Gross, CEO of DCL. "We started doing a lot of work with military materials as well as materials for manufacturers like Xerox and Caterpillar".
"We worked with a lot of archival materials and did raster to vector conversions. We also developed a specialty in creating XML that could be used as meta data. This XML data enabled our customers to better identify their information if they needed to do something with it further down the line. We are really focused on cataloguing and indexing information rather than extracting it for transactions."
It was my interest in e-libraries that led to my discussion with Gross. I had received an e-mail from DCL’s PR firm about the value that "millennials" are finding in libraries. "62% of Americans under age 30 agree there is 'a lot of useful, important information that is not available on the Internet,' was one of the numbers presented in the e-mail— which was used to lead into the question of how libraries can better serve these electronically savvy would-be users."
"Libraries have a lot of the same needs that industries with archival material have," said Gross. "We have worked with organizations like the New York Public Library and the Library of Congress for several years. We don’t typically scan books or copyrighted materials for them. Google has already done a lot of that. Google’s images are good, but often times the text files associated with the images are not very good. We work with organizations that really want higher quality work."
"We capture presidential papers for universities, for example. The New York Public Library hired us to capture the catalogue cards associated with a number of items they had from the 1964/65 World’s Fair. This involved transcribing the text on the cards to create a digital catalogue."
To create meta data for digitized items, DCL uses a combination of OCR, proprietary recognition and correction technology, and key entry. "We work with several different OCR engines," said Gross. "When we can, we like to combine them to get the best of all worlds."
"We’ve also built some proprietary processes around OCR. For example, we capture a large number of filing documents every day for the U.S. Patent Office. The Patent Office can’t send the documents out of the country for key entry. So, to save on cost, they wanted to completely automate their capture. Standard OCR couldn’t be applied, because these documents have a lot of math and chemistry information on them."
"We developed a process that we run prior to applying OCR—that can find all non-textual elements on an imaged page, such as graphics and tables. We can suck those elements off and then apply OCR only to the text. We then have the ability to locate the critical meta data, extract it, and convert it to XML. The end result is a batch of documents with 99.6% OCR accuracy, which is not bad for being untouched by humans and good enough for what the Patent Office needs."
DCL has also developed specialized hardware for some applications. "We have some high-speed scanners with ADFs," said Gross. "But, there are a lot of people with that type of equipment, so we also partner for high-speed scanning. We’ve also done work with the Library of Congress’ American Memories collections, where we were working with stuff like Congressional documents from the early 1800s. You don’t touch these documents without white gloves."
"For that, we developed a special scanner with reduced light. This is just the opposite of high-speed scanning. The curator brings in one document at a time and stands there. Once it’s imaged, we try and incorporate some automation to pull out the meta data."
Gross realizes that DCL will never be able to take human intervention completely out of the capture process. "Our goal is to limit the human element," he said. "For example, we’ve developed technology which helps us automate the identification of hyphenated words, vs. words that are just split from one line to the next. That’s just one step, but it’s something."
"Our goal is to continuously make improvements to our processes and automate more and more, while at the same time improving quality to the point where things just need to be looked at rather than proofed. More automation is critical to enabling us to increase our volume while keeping costs down. This helps keep our pricing low, which is important when dealing with library-related projects—that aren’t always the most generously funded."
That said, Gross views the market for conversion services for libraries as growing. "People still use libraries, but there is a lot more focus on distributing and accessing information electronically," he said. "The Queens Public library, for example, has 5,000 iPads available for loan. Part of the role of a library is dissemination of information and by digitizing information, they are better able to fulfill that."
"Very often a library doesn’t even know what is has until it starts creating a digital catalogue. This can be very valuable in a research environment. And some organizations just need the room that digitizing collections can create. We’ve worked with several libraries within government agencies that have several years of physical records that were taking up tens of thousands of square feet. Once you digitize these materials and make them available online, all you need is a small reading room."
Gross added that even academically focused publishers are taking advantage of digitization to better distribute research. "There are some scientific associations that are in the business of publishing, but a lot of interested parties have never seen their work because it hasn’t been properly indexed online," he said. "If they can’t find it on Google, nobody is buying it."
"Now, the market is changing. There are books more than a hundred years old that are finally being read because nobody could find them before. Or there is stuff like the archives of southern newspapers that through digitization is finally being made available to interested Civil War historians. “The market for digitization is changing as technology improves. It’s becoming much broader— and electronic distribution is cheap."
Subscribe today for a free trial of Document Imaging Report.