Published Jun 10, 1999 by Addison-Wesley Professional. Part of the Tools and Techniques for Computer Typesetting series. The series editor may be contacted at firstname.lastname@example.org. This book shows how you can publish LaTeX documents on the Web. LaTeX was born of the scientist's need to prepare well-formatted information, particularly with pictures and mathematics included; the Web was born of the scientist's need to communicate information electronically. Until now, it has been difficult to find solutions that address both needs. HTML and today's Web browsers deal inadequately with the nontextual components of scientific documents. This book, at last, describes tools and techniques for transforming LaTeX sources into Web formats for electronic publication, and for transforming Web sources into LaTeX documents for optimal printing.
You will learn how to:
You will find practical descriptions of:
In addition to giving the Internet location of the software described in this book, the authors also provide a full, annotated catalogue of URLs for the standards and documentation relating to this fast-moving area.
Many of the packages and programs described in this book are freely available in public software archives, and the source code for examples has been placed on CTAN, the TeX archives.
List of Figures.
List of Tables.
1. The Web, its documents, and LaTeX.
The Web, a window on the Internet.
The Hypertext Transport Protocol.
Universal Resource Locators and Identifiers.
The Hypertext Markup Language.
LaTeX in the Web environment.
Overview of document formats and strategies.
Staying with DVI.
PDF for typographic quality.
Down-translation to HTML.
Java and browser plug-ins.
Other LaTeX-related approaches to the Web.
Is there an optimal approach?
What is PDF?
Generating PDF from TeX.
Creating and manipulating PDF.
Setting up fonts.
Adding value to your PDF.
Rich PDF with LaTeX: The hyperref package.
Implicit behavior of hyperref.
Additional user macros for hyperlinks.
Special support for other packages.
Creating PDF and HTML forms.
Validating form fields.
Designing PDF documents for the screen.
Catalog of package options.
Generating PDF directly from TeX.
Setting up pdfTeX.
Graphics and color.
A few words on history.
Principles for Web document generation.
Required software and customization.
Running LaTeX2HTML on a LaTeX document.
Customizing the local installation.
Extension mechanisms and LaTeX packages.
Mathematics modes with LaTeX2HTML.
An overview of LaTeX2HTML’s math modes.
Advanced mathematics with the math extension.
Unicode fonts and named entities, in expert mode.
HTML 4.0 and style sheets.
Large images and HTML 2.0.
Future use of MathML.
Support for different languages.
Titles and keywords.
Multilingual documents using babel.
Images using special fonts.
Converting transliterations using preprocessors.
Extending LaTeX sources with hypertext commands using the html package.
Hyperlinks to external documents.
Enhancements appropriate for HTML.
Alternative text for hyperlinks.
Navigation and layout of HTML pages.
Example of linking various external documents.
Picture representation of special content.
A complete example.
Manual creation of hypertext elements.
Raw hypertext code.
Cascading Style Sheets.
How TeX4ht works.
From LaTeX to DVI.
From DVI to HTML.
Extended customization of TeX4ht.
Tables of contents.
Parts, chapters, sections, and so on.
Defining sectioning commands.
The inner workings of TeX4ht.
The translation process.
Running the tex4ht program.
A look at t4ht.
From DVI to GIF.
A taste of the lg file.
The font control files.
The control file.
IBM techexplorer Hypermedia Browser.
Basic formatting issues.
Your browser and techexplorer.
Adding hypertext links.
Popping up windows and footnotes.
Using images, sound, and video.
Defining and using pop-up menus.
Using color in your documents.
Building a document hierarchy.
Alternating between two displayed expressions.
Printing from techexplorer.
Searching in a document.
Optimizing your documents for techexplorer.
An introduction to WebTeX.
Using the APPLET tag with WebEQ.
Preparing HTML pages via the WebEQ Wizard.
Embedded content problems and future developments.
Will HTML lead to the downfall of the Web?
HTML 4: A richer and more coherent language.
HTML 4 goodies.
HTML 4, the end of the old road.
Different types of markup.
Generalized logical markup.
SGML to HTML and XML.
Extensible Markup Languages.
What is XML?
The components of XML.
Declaring document elements.
The detailed structure of an XML document.
XML is truly international.
XML document components.
The XML declaration.
The document type declaration.
XML parsers and tools.
Emacs and psgml.
The perlSGML programs.
The DTDParse tool.
The Language Technology Group XML toolbox.
Validating documents with XML parsers.
Style sheet languages: A short history.
Programming or style sheets, which is better?
Formatting with Perl.
Principles of operation.
Generating a LaTeX instance.
Cascading Style Sheets.
The basic structure of a Cascading Style Sheet.
Associating style sheets with a document.
A quick look at Cascading Style Sheet properties.
Cascading Style Sheets for formatting XML documents.
The <CODE>invitation</CODE> example revisited.
Generating HTML with another document instance.
Document Style Semantics and Specification Language.
The components of DSSSL.
Creating style sheets with DSSSL.
The TeX backend for Jade and the JadeTeX macros.
The Jade SGML transformation interface.
Formatting real-life documents with DSSSL.
Extensible Style Language.
The general structure of an XSL style sheet.
Building the source tree.
Formatting objects and their properties.
Proposed extensibility mechanism.
Using XSL to generate HTML or LaTeX.
Using XSL to generate formatting objects.
Introduction to MathML.
MathML, Unicode, and XML entities.
Web browser support for MathML.
Converting LaTeX to MathML.
An example LaTeX file and its translation to XML.
The LaTeX source.
LaTeX converted to XML.
Document Type Definition for XML version.
<CODE>techexplorer</CODE> scripting examples.
The HyperTeX standard.
Configuring TeX4ht to produce XML.
Starting from scratch.
Adding XML tags.
Getting deeper for extra configurations.
XML name spaces.
Examples of important DTDs.
The DocBook DTD.
The AAP effort and ISO 12083.
Text Encoding Initiative.
A DTD for BIBTeX.
LaTeX -like markup, from DTD to printed document.
Transforming HTML into XML.
HTML in XML.
The Extensible HyperText Markup Language.
Java event-based interface.
The SAX Java classes.
Running a SAX application.
Codes for languages, countries, and scripts.
The Unicode standard.
Character codes and glyphs.
Unicode and ISO/IEC 10646-1.
UTF-8 and UTF-16 encodings.
Foreign languages in XML.
Handling non-Latin encodings with UTF8.
The aim of this book is to provide help for authors, primarily scientists, who want to invest in the Web or other hypertext presentation systems but are not living in the world of Microsoft Word or QuarkXPress. They have an investment in markup systems such as LaTeX and have special needs in fields like mathematics, non-European languages, and algorithmic graphics. The book will tell them how to
The World Wide Web has invaded all areas of society, and science is no exception to this rule. This should come as no surprise since the Web paradigm was born at CERN, one of the largest scientific laboratories in the world.
The present ubiquitous Web interface is the result of basic research that took place in the first years of the 1990s at CERN. Before then use of the Internet had been mostly an affair of specialists. It needed the genius and insight of Tim Berners-Lee and collaborators to create a tool that allowed physicists participating in CERN's high-energy physics program but located all over the world to exchange data and information via the Internet in an intuitive and "user-friendly" way. Their work led directly to the development of the HTML language, the HTTP protocol, and the URL addressing scheme--the three basic pillars on which the Web is built. From the very beginning, the group took the farsighted decision to share their work freely with the Internet community. Then, thanks also to the appearance of the graphic interface of the Mosaic browser, the Web paradigm was received enthusiastically by developers and users alike. The growth of the number of Web sites and users became exponential, culminating in the Web Woodstock at CERN in May 1994. CERN, a scientific laboratory dedicated to basic research, did not have the resources to coordinate Web development further, and hence these responsibilities were transferred to the international World Wide Web Consortium W3C, which at present consists of three main components: the Laboratory for Computer Science at MIT MIT, USA; INRIA INRIA, France; and Keio University KEIO, Japan. The Consortium is supported by DARPA DARPA and the European Commission EC.
One lesson to be learned from the history of the advent of the Web is that basic research, in completely unexpected ways, can lead to very important and wide-ranging spin-offs for society.
Although most people do not realize it, SGML (in the form of the ubiquitous lingua franca of the Web, HTML) is today without doubt the leading markup language for electronic documents. Similarly LaTeX has been used for over a decade for marking up scientific documents. Even today there is no viable alternative to print texts containing a lot of mathematics without using LaTeX. Therefore it seems reasonable to look for ways to find a (possibly) automatic procedure to translate LaTeX documents in a form that is exploitable on the Web. Conversely, documents marked up in XML and HTML should be able to benefit from the high typographic qualities of the TeX processor.
Therefore in this book we explain how LaTeX can be used as the central component of an electronic document strategy for the Web. We show how you can reuse your existing LaTeX documents on the Web by translating them into HTML, and how, by using some LaTeX extension packages, you can more fully exploit the hypertext capabilities of HTML. Today HTML and Web browsers cannot deal very well with nontextual document components, such as pictures (which are translated into bitmap images) or mathematics. We also address the translation of LaTeX into PDF and the possibilities of interpreting LaTeX commands directly by extensions of a browser.
We also introduce you to the secrets of XML, the extensible markup language, which uses a subset of SGML and which is set to replace HTML as it allows for application-dependent extensions. In particular, we look at MML--the mathematical markup language--its syntax and how it can be generated, and what it can be used for.
Going in the other direction, we discuss various strategies to transform Web source documents marked up in XML or HTML into LaTeX or PDF for optimal printing, in particular using DSSSL and XSL style sheets.
Many tools for transforming TeX-based source files into HTML have been developed over the years. The programs described in this book are a representive sample chosen mainly because we were familiar with them and have used them ourselves. The absence of a description of other tools in this book in no way implies that we consider them to be less useful or of inferior quality.
We suggest that all readers look at Chapter 1 before going any further, because this chapter introduces how we think--that the Web is not a threat to LaTeX, but an opportunity and why you should or should not continue to write in LaTeX. We also present a short introduction to the Web from the point of view of the LaTeX user.
Chapter 2 treats the subject of how to marry hyperdocuments with page fidelity using the Portable Document Format (PDF).
The conversion of LaTeX documents into HTML is tackled in Chapters 3 and 4. In Chapter 3 we discuss LaTeX2HTML, which uses Perl to interpret LaTeX source documents and to generate HTML code. Extension packages can be easily added in the form of Perl routines, while various extensions to the LaTeX language make LaTeX2HTML a real high-performance tool to generate hypertext documents.
We take a different approach in Chapter 4, where TeX4ht uses a redefinition of LaTeX's TeX macros to generate HTML or XML, possibly using also the MML application for expressing the mathematics.
Recently we have seen the development of browsers (with plug-ins) that are able to interpret mathematical markup directly. Chapter 5 looks at implementations that can direcly interpret large subsets of native LaTeX code without prior translation into HTML, in particular
techexplorer, a plug-in for Netscape and Internet Explorer developed by IBM, and WebEQ, a Java applet for rendering math.
Chapter 6 looks at the broader picture and gives a gentle introduction to SGML (Standard Generalized Markup Language); it explains how XML (eXtensible Markup Language), a simpler and more "Internet and user-friendly" variant of SGML will become an important element in any future document strategy for the Internet. It is anticipated that XML, combined with object databases and other current object-oriented technologies, will revolutionize our document management at all levels. Tools for authoring and interpreting XML will be described, and we will spend some time building a LaTeX-like XML markup language.
TeX was originally developed by Don Knuth to print his math books in accordance with the highest standards of the typographic art. Therefore it should come as no surprise that TeX has been proposed as a typesetting engine for Web material. Tools to translate XML sources into various output formats are described in Chapter 7. The use of Cascading Style Sheets (CSS), Document Style Semantics and Specification Language (DSSSL), and Extensible Style Language (XSL) for controlling the translation process will be detailed.
Chapter 8 tackles the "hot" issue of how to take maximal advantage of LaTeX's optimal mathematical notation to translate LaTeX markup into XML and MathML (Mathematical Markup language), a companion to XML to present and work with math on the Web.
The book ends with appendixes that contain technical information to complement the chapters in the book. We provide an introduction to Web name spaces, discuss internationalization issues, and review a few important XML DTDs. We also explain where you can find the software mentioned in this book.
When The LaTeX Graphics Companion was in its early stages, Sebastian Rahtz and Michel Goossens intended to include coverage of the Portable Document Format, SGML, and the Web in that book. It became apparent, however, that the hypertext and SGML material would require a whole book of their own, so as soon as the Graphics Companion was completed, work started on this Web Companion. Even more than is the case with most TeX work, the packages and programs related to the Web and TeX were changing very rapidly; it was decided, therefore, to ask the authors of three of the most important packages to work with Rahtz and Goossens, to make sure that the chapters would be up-to-date and accurate.
The chapter on LaTeX2HTML is primarily the work of Moore; that on TeX4ht the work of Gurari; and that on IBM
techexplorer and WebEQ that of Sutor; Goossens and Rahtz shared the remaining chapters between them. Gurari, Moore, and Sutor also contributed significantly to the rest of the book by commenting on material, contributing sections, and discussing the issues involved.
It is, perhaps, a tribute to the Internet that the five authors never met in person as a group during the entire writing and editing process. The nearest they came was a pleasant dinner in St. Malo at the 1998 EuroTeX meeting, where all but Eitan Gurari were present.
Unless explicitly mentioned otherwise, all packages and programs described in this book are freely available in public software archives; some are in the public domain, while others are protected by copyright. Some programs are available only in source form or work only on certain computer platforms, and you should be prepared for a certain amount of "getting your hands dirty" in some cases. We also cannot guarantee that later versions of packages or programs will give results identical to those in our book. Many of them are under active development, and new or changed versions appear several times a year; we completed this book in the winter of 1998-1999, and tested the examples with versions current at that time.
As regular users of the World Wide Web will know, keeping track of URLs is a tricky, error-prone process as sites continually disappear or change their structure. In this book, therefore, we do not give formal URLs in the text, but rather give pointers (typeset like W3C) to a catalog of URLs in the Appendix. This catalog will be kept up to date and will be available in the CTAN directory mentioned earlier. We have also tried to clear up some of the fog of acronyms by providing a glossary of terms.
This book was prepared using LaTeX. The main text font is Adobe Janson, the sans serif font is Y&Y's European Modern Sans, the math is set in Y&Y MathTime Plus, and the literal typewriter text is set in Y&Y's European Modern typewriter.
The LaTeX style was refined and generalized by Frank Mittelbach from that developed by him and Sebastian Rahtz for The LaTeX Graphics Companion, which, in turn, was derived from the style by Frank Mittelbach and Michel Goossens for The LaTeX Companion.
We are grateful to Nelson Beebe (University of Utah), Tim Bray (Textuality), Mimi Burbank (Florida State University), David Carlisle (NAG), Hans Hagen (Pragma), Han The Thahn (Masaryk University, Brno), T. V. Raman (Adobe Systems), D. P. Story (University of Akron), Michael Downes (American Mathematical Society), Peter Flynn (University College, Cork), Chris Maden (O'Reilly), Thomas Merz (Munich), and Chris Rowley (Open University) for advice, encouragement, and comments on draft chapters.
Sebastian Rahtz would like to take this opportunity to thank Tanmoy Bhattacharya, David Carlisle, Patrick Daly, Yannis Haralambous, and many others, for their help with the
hyperref package, and Berthold Horn (Y&Y) for sponsoring part of the development.
Eitan M. Gurari is very thankful to Gertjan Klein and Sebastian Rahtz for their contribution to the development of TeX4ht. Gertjan's help came at early stages of the project, offering important code and advice for making TeX4ht a portable tool and providing numerous detailed comments and suggestions for configuring the output. Sebastian got involved in the project at later stages, providing an enormous amount of feedback, setting up challenging objectives, collaborating in the development of interesting configuration files, aggressively promoting the system, and heavily editing my contribution to this book. Aside and beyond the professional aspects, Gertjan and Sebastian were great Net associates!
Robert Sutor wants to express express his gratitude to Bill Pulleyblank, Marshall Schor, and Dick Jenks of the IBM Research Division for their support during the time
techexplorer was developed.
Ross Moore would like to acknowledge first Nikos Drakos, for his foresight in designing a translator such as latextohtml and establishing its basic design principles. There is insufficient space here to list all those who have made significant contributions; we thank them all. Among them we especially wish to acknowledge Marcus Hennecke and Herb Swan, who were the most significant contributors when Nikos could no longer be involved. We also wish to acknowledge Jens Lippman, Scott Nelson, and Marek Rouchal who continue to supply the support necessary to develop, maintain, and distribute the latest revisions of the LaTeX2HTML program. Second Ross wants to thank Michel Goossens, Mimi Jett, Jerold Marsden, Robert Miner, and Kristoffer Rose for supporting visits to various places around the world, where ideas for extensions to LaTeX2HTML were discussed and/or developed; some of these visits have directly affected the contents of this book.
On the publishing side, Frank Mittelbach (series editor) did an excellent job of trying to keep us on the straight and narrow path, and Peter Gordon (Addison Wesley Longman, Inc.) provided all the encouragement, support, jokes, and help any authors could want. When it came to production, Helen Goldstein and Maureen Willard were very patient with our idiosyncrasies and steered us safely to completion and edited our dubious prose into real English.
We would like to ask you, dear reader, for your collaboration. We kindly invite you to send your comments, suggestions, or remarks to any of the authors. We will be glad to correct any mistakes or oversights in a future edition and are open to suggestions for improvements or the inclusion of important developments we may have overlooked. We will maintain a list of errata in a file called
webcomp.err in the LaTeX distribution, and this will contain current addresses for the authors.