DCL/How the Optical Society Converted Nearly 100 Years of Content

How the Optical Society Converted Nearly 100 Years of Content

Mark Gross, President, Data Conversion Laboratory, appearing in Associations Now

Reprinted with permission.

Copyright, ASAE: The Center for Association Leadership, July 13, 2015 Washington, DC.

With almost a century's worth of content in its library, the Optical Society of America decided to digitally convert the material to provide members and others with more robust product offerings. Find out how OSA did it.

The challenge of managing large libraries of content and deciding to convert legacy content into new product offerings can be daunting—especially for a world-class scientific publisher of eight highly rated technical journals, with collections going back to 1917.

In 2011, the Optical Society of America (OSA) decided it was time to convert its legacy journal content to digital formats—all the way back to volume one, issue one. Why convert the entire library? So OSA could offer members, researchers, and others around the world who read and cite the society's legacy journal material more robust, technologically savvy product offerings.

To begin the process, the team could have spent a year analyzing content, followed by a major handoff to the conversion experts and a return to other activities. Instead, OSA chose a more agile approach to fast-track the project, including working with a vendor, which allowed the organization to get an earlier start on addressing challenges while creating more consistent results and getting the three-year project done on time and on budget.

The team decided to convert the entire back file to full-text XML, which would free the scientific content (equations, tables, and algorithms) from PDF constraints without losing essential information.

Why XML?

Extensible Markup Language (XML) provides reliable and uniform representation of textual information and other types of content as tagged data, independent of formatting restrictions. In OSA's case, the highly visual nature of optics science—with appealing subjects such as rainbows, holograms, and lasers—made an image bank a natural outgrowth product and one that could not have been developed effectively without the reliable, uniform structure that XML provides.

Working with Data Conversion Laboratory (DCL), the OSA team formulated a strategic plan with the expectation that things would change. The central challenge was to maintain a collaborative and agile approach as one surprise after another was encountered in the legacy content. The sheer volume of articles (more than 750,000 pages) and the variety of legacy input formats made traditional spec writing impractical.

DCL instead applied a software process based on a hub-and-spoke framework to support pre-processing of content to its source type—including PDF, XML, and Standard Generalized Markup Language (SGML)—to set the stage. The content was further processed and additional modules were integrated to handle newly discovered structures. This process (shown in the figure below) also provided a visual representation of this consistent content lifecycle for team members who were less versed in the technologies, making the entire project easier to understand.

DCL Software Model flowchart

The team used automated tools to perform quality assurance and developed learning databanks to house examples and support continuing process and quality improvements in the future.

Low-Hanging Fruit: Identifying Quick Wins

DCL identified some quick wins within materials that were already mapped to OSA's own XML Document Type Definition (DTD) and thus ready to map to NLM 3.0 DTD, a publishing standard developed by the U.S National Library of Medicine. DCL and OSA worked together to incorporate the OSA-provided rules into DCL's conversion software, which accurately cleans up and normalizes content in the course of conversion. The Optics Image Bank provided a visual search across all figures and images, even by the context of the figure caption, in-text reference, or related images.

End Results

Within 18 months, six years' worth of content had been converted into two well-received new offerings on the Optics InfoBase platform.

OSA's commitment to a full-text XML conversion of important scientific journal files also allows for more flexible reuse in the future. For example, OSA now uses the converted material to build a robust search offering of image-based content—one that reaches new target markets that value images in scientific material. And, the content structured in NLM XML will continue to provide value in the future by supporting rapid development of new offerings that keep the organization's content viable.