DCLWiki | Client Area  
DCL  

representational space

   Refer a friend  Email this Page
   Print friendly version Print-Friendly
   Request Information Request Information
   Subscribe  Subscribe

          LinkedInTwitterFacebook

representational space
Services
Content Reuse
Document Conversion
Quality Assurance
Rendering & Publishing
SPL Labeling
Source Formats
   - Word Processors
   - Publishing Systems
   - PDF
   - Other Formats
Target Formats
   - XML & SGML
   - DITA
   - Military DTDs
   - NLM
   - Public DTDs
   - S1000D
   - Other Standards
Other Services »
representational space
Memberships

Issues of XML conversion

Converting Tables To XML: Top 10 Challenges & Pitfalls

OTHER XML RESOURCES ON DCLAB.COM

Converting from Quark to XML

Converting Adobe PageMaker and InDesign documents to XML

XML & SGML - What's the Difference?

DCL Technical Library, XML pages

One of the more challenging issues when performing document conversion is converting tables and tabular type material. Technical documentation and scientific journal/book publishing is typically loaded with lots of long and complex tables, and the conversion challenge is one of getting the conversion right, so that minimal cleanup work is needed on the converted documents.

In our many years of document conversion, we've been faced with all types of tables to convert, and have come up with a list of the top 10 challenges and pitfalls that one might expect to encounter.

1. Simulated tables using tabs

This is by far the most difficult type of table to convert. Many desktop publishing and word processing programs do not have a built-in table editor (or did not in the past), and so users often resort to using tabs to simulate the look of a table. When converting to a structured markup language such as XML, you can't leave the tabs even if you wanted to, as there is usually no equivalent in the tagging structure. So this has to be turned into a table markup of some sort.

Unfortunately, knowing what cell a particular piece of text belongs to is difficult to discern as, very often, the number of tabs does not correspond to the number of columns in the "logical" table.

In these cases, all of the table's structure (including what goes into each cell, spanning, and alignments) will all have to be guessed at, using sophisticated algorithms which attempt to infer structure from appearance clues.

Documents done in Quark Xpress have historically been done in this way, and many people still use it today, even though Quark has added a table editing facility into recent versions of its popular publishing tool. So when converting from Quark to XML, this is one of the big obstacles to doing it properly.

2. Usage of hard returns to simulate rows

NEW ARTICLE ALERT!

Be first in line to read new articles on XML, SGML, and data conversion.
Subscribe to DCLnews, Data Conversion Laboratory's popular tech newsletter now!

Sometimes, even in cases where the publishing tool has a table editor, authors use hard returns within cells to get text to line up properly (on the same line). This is bad. If you want text in multiple columns to appear on the same line, they should be set up as a row within the table. That way, no matter how long the rows are, the table editor can ensure the cells appear together. But in a typical situation where hard returns within each cell are used to maintain alignment, each cell may contain five or six hard returns.

When you convert this type of "ugly" table construct to XML you could, in theory, simply replicate the construct of the original table. This is assuming that a table cell hard return markup is available to you (if it isn't, this course of action is not an option). The problem is, when the table is rendered in its new environment (where the width of the columns is probably different), the text in some cells may wrap differently. This will leave you with the text from different columns no longer matching what the author intended.

The proper way to convert these is to remove (either manually or via software) all of the hard returns, and correctly restructure the tables - using tables and spanning - so that the document will render properly in various types of platforms.

3. Table footnotes

Tables often have lots of footnotes at the bottom with important information that could not fit into the table itself (the most famous of which was the asterisk used to refer to Roger Maris's 61 home run baseball record, which indicated that it was in 161 games).

In many cases, desktop published tables have footnote references done, not as logical footnote references (which typically should be converted to hyperlinks to the footnotes in the XML), but as typographic marking, such as a superscript. You should then expect to go in and convert all of the footnote markings to their proper structured markup, either manually or using software (which is preferable).

4. Target table model limitations

When you convert to SGML or XML, you need to use a table markup model to represent your tables. The most common table markups in use today are the CALS table model (and it's slim line version, the OASIS table model), and the HTML table model. Each of these table models has limitations, and cannot necessarily represent all the kinds of things that one can do in a sophisticated publishing table editor.

For instance, the CALS table model does not support different types of cell borders (such as double lines and dotted lines) and has no support for cell shadings or vertical cell text. This means you may not be able to exactly replicate the look of the original table when you render it via these structured table markup languages.

5. Page image tables

This is a cousin to the "tabbed" tables mentioned above. Some authors construct their tables using spaces and "line-draw" characters to simulate the appearance of a table. This kind of table was very common in the early days of word processing before table editing facilities were introduced. This method typically required the use of a fixed pitch font, which made it easy to get columns to line up.

As in the case of the "tabbed" table, the entire table structure will need to be inferred - although the use of line draw characters often helps in determining the structure.

You would hope this type of table would occur less frequently nowadays. But even though word processing has become prevalent in all aspects of society, there are still lots of authors who treat the computer terminal like a "glass typewriter," and simply use spaces to get things to align (and have no clue why they can't get things to line up properly using spaces within proportional fonts). The difficulties faced in converting these types of tables will probably never go away.

6. MS Word tables sub-fragments

Microsoft Word's table editor let's you change column widths as often as you like you wish throughout the body of a table. In many respects, it's more of a "row" editor, in the sense that each row can have a different number of columns and cells.

Very often, an author will (even inadvertently) drag a column marker a tiny bit, thereby creating what appears to be 12 tables with one row each, as opposed to a single table with 12 rows (which is what the author intended).

Again, the challenge is to determine what is really going on in the table (either manually or with software), and put it back into the proper table markup.

7. Dummy tables used to simulate a particular appearance

This problem is most associated with HTML web pages, where people often use table constructs to simulate a particular appearance on a page because the proper facility may not exist. Authors or designers, for example, might do it to force a certain text alignment or to insert some hypertextual page navigation links.

There is a significant challenge when converting these types of tables because, in the converted markup, these really should not be marked up as tables at all, and should be removed. Determining what are real tables on these sorts of pages can only be partially done using software, and often need to be fixed manually.

Data converters' worst nightmare?

Writers and designers working in HTML often use tables for page layout. Look comes first because their priority is getting the message across. Correct usage of HTML code can take second place, making material more difficult to convert.

DCLnews editor and UK journalist John Shreeve puts it this way:

"Deadlines and getting your message heard means you'll blast out an HTML table without worrying about naming elements properly. It's not malicious, of course. But it does make you a worst nightmare for people in the data conversion business. Luckily DCL humors me!"

8. Tables continued across multiple pages

Often in documents that contain pages that span many pages, you'll find tables that are artificially split into many separate tables, with words such as "Table 5-7 (continued)" placed by each page. To represent this properly in table structured markup, all of these tables should be logically merged into one, since the breakup is usually an artificial one done to make the table fit on individual paper pages.

This is the kind of construct that needs to be removed to make the structured document repurposable to multiple platforms.

9. Double-page-wide "faked" tables

This doesn't happen often, but often enough that you need to be aware of it. Sometimes you may have a table that's too wide to fit on one published page, and needs to span across two pages. So, for instance, a 12 column table may appear on pages 14 and 15, but be composed as 2 six column tables which, when printed in a book, will appear as one very wide table!

In this case, proper conversion markup requires you to combine the two 6 column tables that appear in the source document into one 12 column table. Again, you would do this manually or, if it happens often enough, through software.

10. Tables done as text boxes

Desktop publishing packages often have facilities to draw text boxes, and anchor them to specific locations on the page by specifying an absolute positioning. Sometimes, in order to accomplish a very specific appearance, we see tables drawn in this way. Converting these types of tables to the proper table markup from the original source document can be a real nightmare. This is because the material coming in is a series of text boxes that the user has laid out to specific locations on the page. Without the author being on hand to explain his or her intention (extremely unlikely), discerning the table layout is incredibly difficult.

In these cases, it is better to extract the table structure from the PDF version of the file, where you at least have a better idea where the various cells lie on the page.

The bottom line...

The general rule of thumb is that documents composed using the proper features of a publishing package are the simplest to convert, and this especially applies to tabular material.

Since (unfortunately) people often do not know how to use many of the features of a publishing tool, you should expect to have to deal with many of the problems that we've mentioned above.

Being aware that you might encounter them is important. On first appearance, you may think you've gotten a good table conversion, only to have it look like gibberish when rendering it on a different platform.

Do not underestimate the things that can be done improperly in a table conversion, or the pain that you may experience trying to get it to the proper table markup. Without proper preparation, it can be a nightmare. The key is to be prepared and anticipate the issues.

Mike Gross
12/3/2003

 
representational space
DCL Library
Articles, fact sheets, presentations and white papers
representational space
Events

CIDM Best Practices Conference
September 13–15, 2010
Hampton, Virginia

Vasont Users' Group Meeting
September 27–30, 2010
Hershey, Pennsylvania

Internet Librarian Conference
October 25–27, 2010
Monterey, California

Journal Article Tag Suite Conference (JATS-Con)
November 1–2, 2010
Bethesda, Maryland

SPARC Digital Repositories Meeting
November 8–9, 2010
Baltimore, Maryland

More Events »

representational space

News
Brill Again Turns to Data Conversion Laboratory (DCL™) for Key Project


DCL and GeerStreet Announce Strategic Partnership


DCL's “Dan Tonkery on the iPad and the Future of Technical Publications” Published in CIDM News


DCL's “Guide to Conversion Cost Variables” Published in Best Practices Newsletter


DCL's “Dan Tonkery on the iPad and the Future of Technical Publications” Translated on German Blog

More News »


representational space
representational space representational space representational space representational space representational space representational space representational space


Corporate office:
61-18 190th Street, 2nd Floor, Fresh Meadows, NY 11365
718-357-8700
Data Conversion Lab
Copyright © 1997-2010  Data Conversion Laboratory, Inc. All rights reserved.