Home > Articles > Web Services > XML

Clean HTML from Word: Can It Be Done?

  • Print
  • + Share This
Laurie Rowell's short answer to the question above: Yes, with a bit of effort. With a number of utilities available at relatively low cost, you can finagle Microsoft Word's output to something that resembles clean HTML. You might need to roll up your sleeves and dig around in the code or the formatting (depending on the application you choose), but you should end up with something you can put on the web without shame.
Like this article? We recommend

"I have to find out how to get clean HTML from Word," I told a colleague.

"Good luck!" he chortled. I thought his raucous laughter was in bad taste. I told him that, too, as I gathered what dignity I could muster, stepped over his prone form, and set about my work.

Here's the problem—and it's a real one. Word comes with a built-in command to Save as Web Page so that you can easily convert print documents. The HTML that you get preserves all the formatting—tables, code, and links—of the original as it strips out any graphics and tucks them away in a separate file. Unfortunately, at the same time, Word creates a lot of junk formatting, including Microsoft-proprietary tags and attributes that could keep your text from displaying correctly in all browsers.

At this point you may need some help, preferably a one-step conversion, producing spotless, fast-loading, unencumbered HTML that will run anywhere.

Now, none of us want to lose any formatting from the original document, even while the original formatting was for a printed page, not the changing parameters of the Internet. It only makes sense that, whether in a browser or on a piece of paper, tables should look like tables, images should appear in about the right spot, and code should look like well-formatted code.

Finally, we could wrap up our wish list by adding that this lovely conversion ought to be done without spending much money.

Accordingly, I undertook to test some of the simplest (and cheapest) ways to make life and well-formed HTML easier.

For text to work with, I asked my editor, who does this kind of conversion all the time, if she had a troublesome file that might offer some problems. She assured me that she just happened to have a dandy one.

She was right.

Converting with Word

I began by converting the text my editor provided as a benchmark. As a .doc file, this little monster had weighed in at 68KB—not large, but fierce. It contained all the good stuff: links, tables, and code samples. It didn't contain any graphics. Converting to a web page using Word's Save as Web Page feature left it at 71KB.

When I examined the converted file, it was 20 printed pages of coding, full of Office-specific tags, repetitions of fonts, and general sludge.

The logical place to start trimming this mess lay in Word itself. You can strip a large hunk of goop from Word HTML by choosing Web Page, Filtered in the Save as Type list when you save the file (see Figure 1). This setting removes all the Office-specific tags from the file and takes out plenty of formatting junk.

Figure 1Figure 1

Saving in this manner knocked our test file from 71KB to 44KB and preserved the links, tables, and code. I decided that any third-party package had to do at least this well or offer something this option didn't—for example, control over what kinds of things are removed.

  • + Share This
  • 🔖 Save To Your Account