Documentation for the AscToHTM conversion utility : How it works

Documentation for the AscToHTM Text to HTML converter

How it works

AscToHTM analyses your document, looking at how the text is laid out, and trying to identify and quantify the rules used by the author to format the document. These rules are then used to set "policies" that determine how each part of the document should be interpreted. These policies are then used during the output pass to decide how the output document should be formatted.

The user can choose to manually set "Policies", thereby overriding the software's analysis, and may additionally set some policies that only apply to the output pass (such as which fonts should be used). These manual options may be saved in a policy file and reloaded the next time. Different policy files may be created for different document sets, or for different types of output.

For example analysis might determine that a large number of lines appear to be "underlined", and that these may be headings. Having made this decision, lines that are underlined will become headings, while those that are numbered or capitalised may not. If this is the wrong decision, the user can disable the use of underlined headings via a policy file, and even choose to recognize capitalised headings instead should they wish.

Contents of this section

Assumptions made by the program
The analysis pass
The collating pass
The output pass
Standards compliance

Assumptions made by the program

AscToHTM makes one big assumption :-

Each text file has been laid out in a consistent manner by its author in a way that makes it easy for a human reader to understand

Given this, AscToHTM tries to read the text file and mark it up in HTML accordingly. This is achieved by making three passes through the document, an analysis pass, a collating pass, and an output pass.

Note: Sadly this assumption is not always true :(

The analysis pass

During the analysis pass AscToHTM gathers together all the statistics that it needs to analyse how the author has laid out the file.

For example, the distribution of line indentations and line lengths is observed, together with the number and types of bullets, section headings and lots of other stuff.

Once this has been done, the program uses this data to determine how the author has structured the document. For example are the section headings underlined, capitalised or numbered? If numbered, what style of numbering is used, and at what level of indentation is the heading placed?

This information is then used to set the analysis polices (see the Policy manual) which may then be overridden by the user, or by loading a policy file with different values.

The collating pass

Having performed the analysis, the program makes a second "collating" pass. This is effectively a dry run for the output pass.

During this pass the program determines how the file will be output, what headings there are and where certain key in-line tags occur.

It also assembles any contents list.

This information is then used during the output pass to reduce the likelyhood of errors, and to ensure all internal hyperlinks are valid and will point to the correct file location.

The output pass

During the output pass AscToHTM generates the HTML file (there's nothing like stating the obvious :-)

The HTML generated depends only on the original document, the calculated document policy, and any user policies supplied.

Understanding the HTML generated describes the markup produced in more detail.

Standards compliance

Earlier versions of AscToHTM (before version 3.2) made no real attempt to be standards compliance. Now standards compliance is a stated goal or the program. Sadly I can't guarantee standards compliance because the HTML generation is so complex that errors can and do occur, but it is a goal, and usually documents will validate with few problems.

Compliance has proved to be vital to get cross-browser compatibility, and to stand a chance of successfully applying CSS to created pages.

Original versions of AscToHTM were (loosely) targeted at producing HTML 3.2 code.

Currently the software is targeted at "HTML 4.0 Transitional", which allows CSS, but also permits <FONT> tags (although these are deprecated). This is a compromise standard that is best placed to be well viewed by V3 and V4 browsers.

Back to Contents List