HTML to LaTeX (version 2.6)

This page describes version 2.6 of html2tex, a program which can be used to converts a collection of related HTML files into a single LaTeX file. Such a LaTeX file can be processed into a PostScript file. To generate a single LaTeX file from a number of HTML files, the user needs to give a skeleton LaTeX file and indicate where translated versions of the HTML files should be included. The user also has to specify for each HTML file at which level (chapter, section, subsection, ..) it should be included. Links between the different HTML files are mapped to references in the LaTeX file. External links can be included as footnotes or as a bibliography.

The generation of LaTeX is configurable. The mapping of each HTML tag to LaTeX commands can be specified. (This mapping can even be changed dynamically during the processing of the HTML file.) It is also possible to exclude certain parts from the HTML files from the generated LaTeX file, or to include LaTeX parts in HTML comment lines, which are ignored by HTML viewers. This makes it possible to maintain sources for both HTML and LaTeX in the same HTML files.

The program performs certain checking of the HTML files, in order to be able to generate correct LaTeX output, but this checking is not guaranteed to conform any HTML standard. At some places the checking might be more relaxed, while at other places more restrictive then HTML 2.0. So far, there is not much support for extensions beyond HTML 2.0.

The program does extensive checking of links between the different files. Because of this reason it can also be used as a link checking program, by giving it a single HTML file, and specify that it should scan all referenced pages in the local directory (and its sub-directories).

Links to excluded HTML files (and other URL's) can either be reported as footnotes, or as a sorted bibliography in the LaTeX file.

Error messages are reported on the standard output file. The program can also generate an extensive cross-references file mentioning all the anchor tags.


The HTML to LaTeX conversion program is implemented by the C program html2tex.c, which needs to be compiled first. (The program is developed with the popular gcc compiler, which is freely available under the GNU public license.)

The program takes a single file as input. This should be a skeleton LaTeX file (or, if the program is only used for link checking, a HTML file with the extension .html). It will generate an LaTeX file with the same name as the input file, but with the extension .tex added to it.

The input file

The input file should contain valid LaTeX commands. In the file all lines starting with %html will be interpreted as special lines by the conversion program. These are used to indicate which HTML files should be included, and to set the various options. The following special commands are recognized by the html2tex:

Special command in the HTML files

The following special commands (inside HTML comments) are recognized in the HTML files:

The program recognizes comments inside a pair of double dashes (--), in any of the HTML tags including <! >. It also recognizes any text in a <! > tag not surrounded by double dashes as comment, but not without generates a warning message for it.

Defining mappings

As we wrote above the various mappings of HTML tags to LaTeX can be changed in both the input file (as a line of the form %html -d tag-name options "LaTeX-open" "LaTeX-close"), and inside comments in the HTML files (in the form of latex-def tag-name options "LaTeX-open" "LaTeX-close").

They change the mapping of the tag-name HTML tag to the given LaTeX formating commands. The strings LaTeX-open and LaTeX-close are put around the text that is marked by the HTML tag. (The string in LaTeX-close is generated at the proper place, in case the closing tag is not obligatory in the HTML syntax.) If the LaTeX command has to include a double quote one should use two double quotes in the string. If a real newline (the `\n' character) has to be included, use `\nl' instead. (There is no LaTeX command starting with this sequence, but there are many starting with `\n'.)

The options are used for some special kind of translating. The following options are possible:

The pseudo HTML tags (which cannot occur in the HTML files) L1 to L9 specify what LaTeX commands should be generated for which section level. The definition of these pseudo-tags is changed by the command %html -s style for setting the document style.

The default settings are the ones given below, using the format to be used in the input file:

%html -d html    ""  ""
%html -d head    ""  ""
%html -d title   ""  ""
%html -d body    -on ""  ""
%html -d address ""  ""
%html -d h1      -l1 "{\\LARGE \\bf " "}"
%html -d h2      -l2 "{\\Large \\bf " "}"
%html -d h3      -l3 "{\\large \\bf " "}"
%html -d h4      -l4 "{\bf " "}"
%html -d h5      -l5 "{\\small \\bf " "}"
%html -d h6      -l6 "{\\footnotesize \\bf " "}"
%html -d p       "\nl\nl"  ""
%html -d ul      -igh "\nl\begin{itemize}"  "\nl\end{itemize}\nl"
%html -d menu    -igh "\nl\begin{itemize}"  "\nl\end{itemize}\nl"
%html -d dir     -gnh "\nl\begin{itemize}"  "\nl\end{itemize}\nl"
%html -d ol      -igh "\nl\begin{enumerate}"  "\nl\end{enumerate}\nl"
%html -d li      "\nl\item "  ""
%html -d lh      "\nl\item "  ""
%html -d dl      -igh "\nl\begin{description}"  "\nl\end{description}\nl"
%html -d dt      "\nl\item["  "]"
%html -d dd      ""  ""
%html -d a       ""  ""
%html -d q       "``"  "''"
%html -d i       -iim "{\em "  "}"
%html -d em      "{\em "  "}"
%html -d b       "{\bf "  "}"
%html -d strong  "{\bf "  "}"
%html -d tt      "{\tt "  "}"
%html -d samp    "{\tt "  "}"
%html -d kbd     "{\tt "  "}"
%html -d var     "{\sl "  "}"
%html -d dfn     "{\sc "  "}"
%html -d code    "{\tt "  "}"
%html -d blink   ""  ""
%html -d cite    "{\em "  "}"
%html -d blockquote  -igh "\begin{quotation} "  "\end{quotation}\nl"
%html -d bq      -igh "\begin{quotation} "  "\end{quotation}\nl"
%html -d u       "\underbar{"  "}"

%html -d pre     -verb "\begin{verbatim} "  "\end{verbatim}\nl"
%html -d xmp     -verb "\begin{verbatim} "  "\end{verbatim}\nl"
%html -d listing -verb "\begin{verbatim} "  "\end{verbatim}\nl"
%html -d br      -br "\newline\nl"  ""
%html -d hr      "\vspace{1mm}\hrule "  ""
%html -d img     ""  ""
%html -d isindex ""  ""
%html -d select  ""  ""
%html -d link    ""  ""
%html -d center  "{\centering "  "}"
%html -d meta    ""  ""
%html -d table   ""  ""
%html -d tr      ""  ""
%html -d td      ""  ""
%html -d sup     "$^{" "}$"
%html -d sub     "$_{" "}$"

Suggested alternative settings for the various tags are:

%html -d title -on "\newpage\thispagestyle{myheadings}\markright{\sc{}" "}\pagenumbering{arabic}\nl\nl"
%html -d h1 -l1 "{\nl\nl\smallskip\LARGE\bf\noindent " "}\nl\nl\noindent{}"
%html -d h2 -l2 "{\nl\nl\smallskip\Large\bf\noindent " "}\nl\nl\noindent{}"
%html -d h3 -l3 "{\nl\nl\smallskip\large\bf\noindent " "}\nl\nl\noindent{}"
%html -d h4 -l4 "{\nl\nl\smallskip\bf\noindent " "}\nl\nl\noindent{}"
%html -d h5 -l5 "{\nl\nl\smallskip\small\bf\noindent " "}\nl\nl\noindent{}"
%html -d h6 -l6 "{\nl\nl\smallskip\footnotesize\bf\noindent " "}\nl\nl\noindent{}"
%html -d code -math
%html -d blockquote "\nl{\parindent=2em\narrower\nl" "\nl}\nl"

The default setting for the pseudo tags for the book and report styles are:

%html -d l1      "\nl\nl\chapter{"  "}\nl\nl"
%html -d l2      "\nl\nl\section{"  "}\nl\nl"
%html -d l3      "\nl\nl\subsection{"  "}\nl\nl"
%html -d l4      "\nl\nl\subsubsection{"  "}\nl\nl"
%html -d l5      "\nl\nl\paragraph{"  "}\nl"
%html -d l6      "\nl\nl\subparagraph{"  "}\nl"
%html -d l7      ""  ""
%html -d l8      ""  ""
%html -d l9      ""  ""

The default setting for the pseudo tags for the article styles is:

%html -d l1      "\nl\nl\section{"  "}\nl\nl"
%html -d l2      "\nl\nl\subsection{"  "}\nl\nl"
%html -d l3      "\nl\nl\subsubsection{"  "}\nl\nl"
%html -d l4      "\nl\nl\paragraph{"  "}\nl"
%html -d l5      "\nl\nl\subparagraph{"  "}\nl"
%html -d l6      ""  ""
%html -d l7      ""  ""
%html -d l8      ""  ""
%html -d l9      ""  ""

The default setting for the pseudo tags for the plain style is:

%html -d l1      "\nl\nl\section*{"  "}\nl\nl"
%html -d l2      "\nl\nl\subsection*{"  "}\nl\nl"
%html -d l3      "\nl\nl\subsubsection*{"  "}\nl\nl"
%html -d l4      "\nl\nl\paragraph*{"  "}\nl"
%html -d l5      "\nl\nl\subparagraph*{"  "}\nl"
%html -d l6      ""  ""
%html -d l7      ""  ""
%html -d l8      ""  ""
%html -d l9      ""  ""


The options can be used to configure the LaTeX fragments which are generated by the program for the various kinds of references. The options can be given in the input file (as a line of the form %html -o option-name option-value), and inside comments in the HTML files (in the form of latex-opt option-name option-value).

There are options that determine the cases in which references should be generated and when not. For example, it will often be the case that an HTML file contains a HREF tag, whenever an email address is given, which can be used to send an email. As the essential information is already provided it is not necessary to include it in a footnote or a bibliographic entry. The following options can be used for this purpose:

By default all these options are on.

The references can be divided into internal and external. The internal references are HREF tags that point to a file that is included in the LaTeX output, and external are those that are not. Internal references can be mapped to phrases, that state to look at the corresponding section. External references have to be given completely, either as a footnote at the bottom of the page or as a bibliographic entry. They are generated as bibliographic entries if the input file contains a line with `%html -b' (or if the program option -b is given), otherwise they are generated as footnotes. There are four generation modes:

These four modes can be set for three different environments, namely: the headers, LaTeX alltt environments, and all the remaining parts. The options for this are:

There are also options that determine the format in which the various kinds of references are to be generated (including the format of the bibliographic entries). All these options make use of format strings (like those used in C), where the percentage symbol followed by letter indicates a place holder for a string or number that has to be outputted. A double percentage symbol causes a single percentage symbol to be printed. All these options should contain LaTeX formating commands. Because references can be generated in fragile environments `%p' has to be used at places where a `\protect' is required in a fragile environment. Also because a `\footnote' is not allowed every where, a `%F' has to be used instead.

These are the options for internal references:

The options for external references as footnotes are:

The options for citations are:

The options for the bibliographic entries are:

The following options deal with the formating of all kinds of references. They make it possible to add additional formating around the anchor text or the image tag. The "%R" indicates the place where the reference should be placed. This can either be an internal or an external reference, in the running text or as a footnote. In case the "%R" appears in an fragile environment, it should be changed into "%fR". In case it appears in a place where a \footnote would not be proper, a combination of an "%mR" and an "%tR" can be used to indicate the place of the footnote marker and the footnote text, respectively. (An "f" can be added if they occur in a fragile environment.)

Program options

If the program is given an input file with the extension .html, it does not generate a LaTeX output file, but only analyse the file, and the files it references (if the -s option is given).

The program recognizes the following command line options:


There is still a long road to go with respect to bugs. I still cannot process the web testing pages correctly.

Known bugs are:

The source

The source (zipped) of html2tex falls under the GNU General Public License, and thus no warrants what so ever are implied! Versions 2.1 to 2.5 are described elsewhere.

Notice: There is a newer version with some bug fixes.


I would like to thank the following people for their contributions:

Future plans

There are a number of things that still have to be fixed: