html2tex Version 2.5 (beta)
This page describes versions 2.1 to 2.5 of
html2tex, a program which
converts a collection of related HTML files into a single LaTeX file.
(The newest version is version 2.6.)
Such a LaTeX file can be processed into a PostScript file. To
generate a single LaTeX file from a number of HTML files, the user
needs to give a skeleton LaTeX file and indicate where translated
versions of the HTML files should be included. The user also has to
specify at for each HTML file at which level (chapter, section,
subsection, ..) it should be included. Links between the different
HTML files are mapped to references in the LaTeX.
The generation of LaTeX is configurable. The mapping of each HTML
tag to LaTeX commands can be specified. (This mapping can even be
changed dynamically during the processing of the HTML file.) It is
also possible to exclude certain parts from the HTML files from the
generated LaTeX file, or to include LaTeX parts in HTML comment
lines, which are ignored by HTML viewers. This makes it possible to
maintain sources for both HTML and LaTeX in the same HTML files.
The program performs certain checking of the HTML files, in order
to be able to generate correct LaTeX output, but this checking does
not conform any HTML standard. At some places the checking might be
more relax, while at other places more restrictive then HTML 2.0. So
far, there is not much support for extensions beyond HTML 2.0.
The program does extensive checking of links between the different
files. Because of this reason it can also be used as a link checking
program, by giving it a single HTML file, and specify that it should
scan all referenced pages in the local directory (and its
sub-directories).
Links to excluded HTML files (ant other URL's) can either be
reported as footnotes, or as a sorted bibliograph in the LaTeX file.
Error messages are reported on the standard output file. The
program can also generate an extensive cross-refernces file
mentioning all the ancor tags.
Functionality
The HTML to LaTeX conversion program is implemented by the C
program html2tex.c, which needs to be compiled first. (The
program is developed with the popular gcc compiler, which
is freely available under the GNU public license.)
The program takes a single file as input. This should be a
skeleton LaTeX file without any extension (or, if the program is only
used for link checking, a HTML file with the extension .html)
It will generate an LaTeX file with the same name as the input file,
but with the extension .tex.
The input file
The input file should contain valid LaTeX commands. In the file
all lines starting with %html will be interpreted as
special lines by the conversion program. These are used to indicate
which HTML files should be included, and to set the various options.
The following special commands are recognized by the html2tex:
- %html fn.html level
Causes the the file fn.html to be included as
LaTeX at the given input line. The level should be
an integer to specify the indentation depth of the headers. A value
of 1 will map the H1 tag to the \section (or to
\chapter for the book document style).
- %html -r URL
Specifies the URL of the directory of the input file. This is
needed to detect if any given URL's in the HTML files map to local
HTML files. This command should be given before any HTML file is
included as LaTeX.
- %html -b
Causes LaTeX bibitems to be generated at the place of the input
file for all excluded HTML files (and other URL's) as LaTeX
bibitems. If this command is not given anywhere in the input file
(and also not the -b command line option), all external
URL's are given as footnotes.
- %html -d tag-name options "LaTeX-open"
"LaTeX-close"
Changes the mapping of the tag-name HTML tag to
the given LaTeX formating commands. See below
for a complete description.
- %html -s style
Indicate the style that should be used. The default assumes the
article document style. Currently, the following values for style
are supported:
- book: for the book document style.
- plain: for an article style without section
numbering like (most) HTML browsers do.
The command causes the mapping of the H1 to H7
tags to be set correctly for the given document style. This command
should be given before all commands to include HTML files as LaTeX.
- %html -l from-URL to-URL
To indicate that the from-URL is a (symbolic)
link to to-URL. To be used when there are two (or
more) URL's for the same physical file. The given URL's should be
relative to the root-URL.
- %html -m rel-URL comp-URL
To display a different URL then the one found in the HTML files, if
for example, one wants an ftp URL instead of an http
URL, or if one wants to reference the original source, in case one
has a local mirror of certain URL's.
- %html -i URL
To indicate that the URL should be ignored. To be
used when there are additional HTML pages (for navigation purposes)
that you do not want to be referenced in the document. The given URL
should be an relative URL to the root-URL.
- %html fn.txt line-length
Causes the the file fn.txt to be included as
verbatim at the given input line. The optional value line-length
can be given to wrap long lines. (Not implemented yet.)
- %html -o option-name option-value
Setting various LaTeX generation options. The various options are
explained below. (Not implemented in 2.2.)
Special command in the HTML files
The following special commands (inside HTML comments) are
recognized in the HTML files:
- latex latex-commands
Causes the latex-commands to be copied to the LaTeX output file.
Use `&', `<', `>',
and `‐' for the characters `&', `<',
`>', and `-'respectively.
- latex-off
- latex-on
Causes the HTML text and tags to be omitted from the generated
LaTeX files. These special commands are recognized as tags and
should be placed at the proper places with respect to the other
tags. They can be nested.
- latex-def tag-name options "LaTeX-open"
"LaTeX-close"
Changes the mapping of the tag-name HTML tag to
the given LaTeX formating commands. Follows the same rules as the
special command `%html -d' in the input file, except that
`&', `<', `>',
and `‐' should be used for the characters `&',
`<', `>', and `-' respectively.
See below for a detailed description.
- latex-rep latex-commands
Causes the latex-commands to be copied to the LaTeX output file,
just like `--latex latex-commands--', but if it
occurs inside a normal HTML tag, it replaces the LaTeX output that
would otherwise have been generated. (Does not work in version
2.2)
- latex-opt option-name option-value
Causes the LaTeX generation option option-name to
be set to the value of option-value. The various
options are explained below. (Not
implemented in 2.2.)
The program recognizes comments inside a pair of double dashes (--),
in any of the HTML tags including <! >. It also
recognizes any text in a <! > tag not surrounded by
double dashes as comment, but not without generates a warning message
for it.
Defining mappings
As we wrote above the various mappings of HTML tags to LaTeX can
be changed in both the input file (as a line of
the form %html -d tag-name options "LaTeX-open"
"LaTeX-close"), and inside comments
in the HTML files (in the form of latex-def tag-name options
"LaTeX-open" "LaTeX-close").
They changes the mapping of the tag-name HTML
tag to the given LaTeX formating commands. The strings LaTeX-open
and LaTeX-close are put around the text that is
marked by the HTML tag. (The string in LaTeX-close
is generated at the proper place, in case the closing tag is not
obligatory in the HTML syntax.) If the LaTeX command has to include a
double quote one should use two double quotes in the string. If a
real newline (the `\n' character) has to be included, use `\nl'
instead. (There is no LaTeX command starting with this sequence, but
there are many starting with `\n'.)
The options are used for some special kind of translating. The
following options are possible:
- -math
To be used for math mode. This mode assumes that everything that is
inside the tags, is correct for the LaTeX math environment. The
contents is copied literally, except for # and %
which are quoted.
- -iim
To be used in combination with -math to ignore the HTML
tags for italics as LaTeX math mode uses italics by default.
- -off
Causes the text inside the HTML tags to be excluded from the
generated LaTeX file. The LaTeX-open is outputted
to the LaTeX file (if not inside another tag with -off),
but LaTeX-close not.
- -on
Causes the text inside the HTML tags to be included from the
generated LaTeX file. At the start of the file generation is
switched off (one-level). In case of nested TAGS with -off,
the -on does only cancel one level.
- -verb
To be used for verbatim LaTeX environment. Ignores all nested HTML
tags that would conflict with the LaTeX verbatim environment. (Partial
operational in 2.2)
- -alltt
To be used for the alltt LaTeX environment, which is like
verbatim, but allows some additional formating. (Equivalent with
-verb in 2.2)
- -br
To be used for HTML tags that produce an error message when
generated on an empty line (like \newline).
- -igh
To be used for HTML tags which do not allow section commands inside
their generated LaTeX output.
Because HTML files can be included at different levels, the
heading tags (H1 to H6) do not refer to the
heading tags as they occur in the HTML files, but to their translated
equivalents. For this reason, we have added an additional H7
tag for an additional nested level. In case heading tags are used
inside other tags as a means of formating they are internally
translated to the F1 to F7 tags.
The default settings (for Version 2.5, slightly different from
Version 2.2) are the ones given below, using the format to be used in
the input file:
%html -d html "" ""
%html -d head "" ""
%html -d title "" ""
%html -d body -on "" ""
%html -d address "" ""
%html -d h1 "\nl\nl\chapter{" "}\nl\nl"
%html -d h2 "\nl\nl\section{" "}\nl\nl"
%html -d h3 "\nl\nl\subsection{" "}\nl\nl"
%html -d h4 "\nl\nl\subsubsection{" "}\nl\nl"
%html -d h5 "\nl\nl\paragraph{" "}\nl"
%html -d h6 "\nl\nl\subparagraph{" "}\nl"
%html -d h7 "" ""
%html -d f1 "{\LARGE \bf " "}"
%html -d f2 "{\Large \bf " "}"
%html -d f3 "{\large \bf " "}"
%html -d f4 "{\bf " "}"
%html -d f5 "{\small \bf " "}"
%html -d f6 "{\footnotesize \bf " "}"
%html -d p "\nl\nl" ""
%html -d ul -igh "\nl\begin{itemize}" "\nl\end{itemize}\nl"
%html -d menu -igh "\nl\begin{itemize}" "\nl\end{itemize}\nl"
%html -d dir "-gnh \nl\begin{itemize}" "\nl\end{itemize}\nl"
%html -d ol -igh "\nl\begin{enumerate}" "\nl\end{enumerate}\nl"
%html -d li "\nl\item " ""
%html -d lh "\nl\item " ""
%html -d dl -igh "\nl\begin{description}" "\nl\end{description}\nl"
%html -d dt "\nl\item[" "]"
%html -d dd "" ""
%html -d a "" ""
%html -d q "``" "''"
%html -d i -iim "{\em " "}"
%html -d em "{\em " "}"
%html -d b "{\bf " "}"
%html -d strong "{\bf " "}"
%html -d tt "{\tt " "}"
%html -d samp "{\tt " "}"
%html -d kbd "{\tt " "}"
%html -d var "{\sl " "}"
%html -d dfn "{\sc " "}"
%html -d code -math "$" "$"
%html -d blink "" ""
%html -d cite "\begin{quote} " "\end{quote}\nl"
%html -d blockquote -igh "\begin{quotation} " "\end{quotation}"
%html -d bq -igh "\begin{quotation} " "\end{quotation}"
%html -d u "\underbar{" "}"
%html -d pre -verb "\begin{verbatim} " "\end{verbatim}\nl"
%html -d xmp -verb "\begin{verbatim} " "\end{verbatim}\nl"
%html -d listing -verb "\begin{verbatim} " "\end{verbatim}\nl"
%html -d br -br "\newline\nl" ""
%html -d hr "\vspace{1mm}\hrule " ""
%html -d img "" ""
%html -d isindex "" ""
%html -d select "" ""
%html -d link "" ""
%html -d center "{\centering " "}"
%html -d meta "" ""
%html -d table "" ""
%html -d tr "" ""
%html -d td "" ""
Options
The options can be used to configure the LaTeX fragments that are
generated by the program for the various kinds of references. The
options can be given in the input file (as a line
of the form %html -o option-name option-value),
and inside comments in the HTML files (in the
form of latex-opt option-name option-value).
There are options that determine the cases in which references
should be generated and when not. For example, it will often be the
case that an HTML file contains a HREF tag, whenever an email address
is given, which can be used to send an email. As the essential
information is already provided it is not neccessary to include it in
a footnote or a bibliographic entry. The following options can be
used for this purpose:
- dni_email [on|off]
This option determine whether email addresses are included in the
references/bibliography, if they appear in the text.
- dni_news [on|off]
This option determine whether news groups are included in the
references/bibliography, if they appear in the text.
- dni_ftp [on|off]
This option determine whether ftp addresses are included in the
references/bibliography, if they appear in the text.
- dni_other [on|off]
This option determine whether all other kind of URL's are included
in the references/bibliography, if they appear in the text.
By default all these options are on.
The references can be divided into internal and external. The
internal references are HREF tags that point to a file that is
included in the LaTeX output, and external are those that are not.
Internal references can be mapped to phrases, that state to look at
the corresponding section. External references have to be given
completely, either as a footnote at the bottom of the page or as a
bibliographic entry. They are generated as bibliographic entries if
the input file contains a line with `%html -b' (or if the
program option -b is given), otherwise
they are generated as footnotes. There are four generation modes:
- normal: this means that the internal references are
generated with a `(cf. Section)' text, and external references as
either a footnote or a citation.
- cffn: this is the same as the above, except that the
internal references given as a footnote with the a `See Section'
text.
- fn: this is the same as the above, execept that
citations are also given as a footnote. This option generates a
footnote for each kind of reference.
- none: This option prevents the generation of any
references.
These four modes can be set for three different environments,
namely: the headers, LaTeX alltt environments, and all the remaining
parts. The options for this are:
- href_in_header [normal|cffn|fn|none]
This controls the generation of HREF tags inside headers. The
default value is normal.
- href_in_alltt [normal|cffn|fn|none]
This controls the generation of HREF tags inside LaTeX alltt
environments. The default value is none.
- href [normal|cffn|fn|none]
This controls the generation of HREF tags at all other places. The
default value is normal.
There are also options that determine the format in which the
various kinds of references are to be generated (including the format
of the bibliographic entries). All these options make use of format
strings (like those used in C), where the percentage symbol followed
by letter indicates a place holder for a string or number that has to
be outputted. A double percentage symbol is used to denote a
percentage symbol. All these options should contain LaTeX formating
commands. Because references can be generated in fragil environments
`%p' has to be used at places where a `\protect'
is required in a fragil environment. Also because a `\footnote'
is not allowed every where, a `%F' has to be used instead.
These are the options for internal references:
- filenr format-string (Default value: "f%d")
This option is used to specify the format of the file references.
- label format-string (Default value: "%p\label{%f}")
This option is used to specify the format of a label without an
additional name part. "%f" indicates the place
of the file part of the label.
- label_n format-string (Default value: "%p\label{%f:%n}")
This option is used to specify the format of a label with an
additional name part. "%n" indicates the place
of the name text.
- cf format-string (Default value: "
(cf. Section~%p\ref{%f})")
This option is used to specify the format of an internal reference
without an additional name part in the running text.
- cf_n format-string (Default value: "
(cf. Section~%p\ref{%f:%n})")
This option is used to specify the format of an internal reference
with an additional name part in the running text.
- f_cf format-string (Default value: "%p%F{See
also Section~\ref{%f}.}")
This option is used to specify the format of an internal reference
without an additional name part inside a footnote.
- f_cf_n format-string (Default value: "%p%F{See
also Section~\ref{%f:%n}.}")
This option is used to specify the format of an internal reference
with an additional name part inside a footnote.
The options for external references as footnotes are:
- f_news format-string (Default value: "%p%F{See
URL news:%n}")
This option is used to specify the format for a newsgroup. "%n"
indicates the place of the newsgroup name.
- f_mailto format-string (Default value: "%p%F{See
URL mailto:%m}")
This option is used to specify the format for an email address. "%m"
indicates the place of the email address.
- f_ftp format-string (Default value: "%p%F{See
URL ftp://%s}")
This option is used to specify the format for an ftp site. "%s"
indicates the place of the site.
- f_ftp_d format-string (Default value: "%p%F{See
URL ftp://%s/%d}")
This option is used to specify the format for a directory on an ftp
site. "%d" indicates the place of the directory
path.
- f_ftp_f format-string (Default value: "%p%F{See
URL ftp://%s/%f}")
This option is used to specify the format for a file on an ftp
site. "%f" indicates the place of the file name.
- f_ftp_df format-string (Default value: "%p%F{See
URL ftp://%s/%d/%f}")
This option is used to specify the format for a file, in a
directory on an ftp site.
- f_URL format-string (Default value: "%p%F{See
URL %U}")
This option is used to specify the format an URL without an
additional name part. "%U" indicates the place
of the URL.
- f_URL_n format-string (Default value: "%p%F{See
URL %U\#%n}")
This option is used to specify the format an URL with an additional
name part. "%U" indicates the place of the URL.
The options for citations are:
- citenr format-string (Default value: "b%d")
This option is used to specify the format of the citation labels.
- cite format-string (Default value: "%p\cite{%c}")
This option is used to specify the format of a normal citation
without an additional name part. "%c" indicates
the place of the citation label.
- cite_n format-string (Default value: "%p\cite[%n]{%c}")
This option is used to specify the format of a normal citation with
an additional name part. "%n" indicates the
place of the name text.
- f_cite format-string (Default value: "%p%F{See
\cite{%c}}")
This option is used to specify the format of a citation as a
footnote without an additional name part.
- f_cite_n format-string (Default value: "%p%F{See
\cite[%n]{%c}}")
This option is used to specify the format of a citation as a
footnote with an additional name part.
The options for the bibliographic entries are:
- b_news format-string (Default value: "news:%n")
This option is used to specify the format for a newsgroup. "%n"
indicates the place of the newsgroup name.
- b_mailto format-string (Default value: "mailto:%m")
This option is used to specify the format for an email address. "%m"
indicates the place of the email address.
- b_ftp format-string (Default value: "ftp://%s")
This option is used to specify the format for an ftp site. "%s"
indicates the place of the site.
- b_ftp_d format-string (Default value: "ftp://%s/%d")
This option is used to specify the format for a directory on an ftp
site. "%d" indicates the place of the directory
path.
- b_ftp_f format-string (Default value: "ftp://%s/%f")
This option is used to specify the format for a file on an ftp
site. "%f" indicates the place of the file name.
- b_ftp_df format-string (Default value: "ftp://%s/%d/%f")
This option is used to specify the format for a file, in a
directory on an ftp site.
- b_URL format-string (Default value: "%U")
This option is used to specify the format an URL without an
additional name part. "%U" indicates the place
of the URL.
- b_URL_n format-string (Default value: "%U\#%n")
This option is used to specify the format an URL with an additional
name part. "%U" indicates the place of the URL.
The following options deal with the formating of all kinds of
references. The make it possible to add additional formating around
the anchor text or the image tag. The "%R"
indicates the place where the reference should be placed. This can
either be an internal or an external reference, in the running text
or as a footnote. In case the "%R" appears in an
fragile environment, it should be changed into "%fR".
In case it appears in a place where a \footnote would not
be proper, a combination of an "%mR" and an "%tR"
can be used to indicate the place of the footnote marker and the
footnote text, respectively. (An "f" can be added
if they occur in a fragile environment.)
- t_href format-string (Default value: "%R")
This option is used to specify the format to be used with a
reference in an HREF tag.
- t_img format-string (Default value: "\fbox{\tt
%n %mR}%tR")
This option is used to specify the format to be used with a
reference in an IMG tag, when there is not alternative text
specified. "%n" indicates the place of the file
name (without the path) of the imagine.
- t_img_r format-string (Default value: "%r")
This option is used to specify the format to be used with a
reference in an IMG tag, when there is an alternative text
specified. "%r" indicates the place of the
alternative text.
Program options
If the program is given an input file with the extension .html,
it does not generate a LaTeX output file, but only analyse the file,
and the files it references (if the -s option is given).
The program recognizes the following command line options:
- -i : print info.
- -w : print warning (and info).
- -p : pendantic: does not report ommissions of HTML
open and close tags.
- -s : scan not include HTML files. The program will
scan all HTML files that can be reached from the included files, and
that are found in the directory (and its sub-directories) of the
input file.
- -r URL : the URL of the directory in which
the program is runned. This is needed to find out if any full URL
points to a local HTML file.
- -b : make bibliograph. If this option is not given,
references to external URL will appear in footnotes. The input file
should contain a line with %html -b.
- -cr : make a cross-reference file with the extenstion
.ref (which contains some additional error messages
referential errors. In version 3.0 these should be given with the
other error messages.)
- -d: generate lots and lots of debug statements :-).
Only to be used if you want to know what it all does.
The sources
There are several versions available, which are given below. For
all versions: No warrants what so ever are implied!. Each
version has a version number and a date at the top of the source
file. Please use these for bug reports. I try to fix small bugs as
soon as possible.
- Version 2.1, March 5, 1996: Contains many improvements,
including the corrections by
Warwick Allison.
This version includes additional HTML checking, and configuration of
LaTeX output generation was added.
- Version 2.2, May 8, 1996: This is the current stable
version. I keep applying small bug-fixes to this version.
- Version 2.4, May 8, 1996: This version includes many more
customization options, but has not been tested properly.
- Version 2.5, May 17, 1996: This version was created to improve
link checking.
- Version 2.6, today, 1996: This is the current Beta-testing
version. Please try this version, and report any bugs found. This
version allows the user to determine which heading tags should be
mapped to which level. The (additional) tags starting with F have
been removed, instead of this (additional) tags L1, L2, to L9 have
been defined. No documentation is available yet. A separate
description is available.
Please check the revision history in the source for more
information. (What happend to version 2.3? I guess, I skipped
that number by accident.)
Acknowledgements
I would like to thank the following people for their
contributions:
- Michael Ritzert
- Philip W. Miller
- Wolfgang Wander
- Juergen 'Fuzzy' Matern
- Warwick Allison
- Rejnold Byzio and Arno Schielke
Future plans
There are a number of things that still have to be fixed:
- surpression of tags inside <PRE> </PRE>
tags in the generated verbatim/alltt environment. (Included in
version 2.2)
- Improvement of generating code for images using a formatting
string. For example, I myself would like to generate a box with
inside the name of the figure, followed by a reference (or footnote)
to the URL. (Included in version 2.4)
- Recognizing active URL's mentioned in the text. For example, a
HTML page could contain a email address inside a <A HREF="mailto:
"> tag. In this case the HREF tag should be ignored. (Included
in version 2.4)
- Making the "(Cf. Section )" text configurable, taking
care of references to chapters as well. Possible also using
footnotes. (Included in version 2.5)
- Missing files: report once as missing, instead of with each
occurance (if not -w). (Included in version 2.5)
- Surpressing of certain "(Cf. Section)". Make a
difference between: references to super section, references to sub
sections, and other references (to parallel sections).
- Add ability to specify which tags should be used for sections.
E.g., some people want to skip H1, and add TITLE.
- Add support for iso-latin1, ASCII character > 127, and &#..;
codes.
- More advanced text processing. For example map " to `` or
'' depending on the context.
- Clean up *.ref file generation.
- Read in second pass only for those files that contain errors.
- More checkinf for PRE.
- Support for forms and tables.