Clean HTML from Word: Can It Be Done?
"I have to find out how to get clean HTML from Word," I told a colleague.
"Good luck!" he chortled. I thought his raucous laughter was in bad taste. I told him that, too, as I gathered what dignity I could muster, stepped over his prone form, and set about my work.
Here's the problemand it's a real one. Word comes with a built-in command to Save as Web Page so that you can easily convert print documents. The HTML that you get preserves all the formattingtables, code, and linksof the original as it strips out any graphics and tucks them away in a separate file. Unfortunately, at the same time, Word creates a lot of junk formatting, including Microsoft-proprietary tags and attributes that could keep your text from displaying correctly in all browsers.
At this point you may need some help, preferably a one-step conversion, producing spotless, fast-loading, unencumbered HTML that will run anywhere.
Now, none of us want to lose any formatting from the original document, even while the original formatting was for a printed page, not the changing parameters of the Internet. It only makes sense that, whether in a browser or on a piece of paper, tables should look like tables, images should appear in about the right spot, and code should look like well-formatted code.
Finally, we could wrap up our wish list by adding that this lovely conversion ought to be done without spending much money.
Accordingly, I undertook to test some of the simplest (and cheapest) ways to make life and well-formed HTML easier.
For text to work with, I asked my editor, who does this kind of conversion all the time, if she had a troublesome file that might offer some problems. She assured me that she just happened to have a dandy one.
She was right.
Converting with Word
I began by converting the text my editor provided as a benchmark. As a .doc file, this little monster had weighed in at 68KBnot large, but fierce. It contained all the good stuff: links, tables, and code samples. It didn't contain any graphics. Converting to a web page using Word's Save as Web Page feature left it at 71KB.
When I examined the converted file, it was 20 printed pages of coding, full of Office-specific tags, repetitions of fonts, and general sludge.
The logical place to start trimming this mess lay in Word itself. You can strip a large hunk of goop from Word HTML by choosing Web Page, Filtered in the Save as Type list when you save the file (see Figure 1). This setting removes all the Office-specific tags from the file and takes out plenty of formatting junk.
Saving in this manner knocked our test file from 71KB to 44KB and preserved the links, tables, and code. I decided that any third-party package had to do at least this well or offer something this option didn'tfor example, control over what kinds of things are removed.