Documentation for the AscToPDF conversion utility : Understanding the PDF generated

Documentation for the AscToPDF conversion utility

The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html

Understanding the PDF generated

Before converting files to PDF, AscToPDF first attempts to analyse your document looking for the following components.

Text layout

Paragraph detection

Indentation detection

Bullets and list detection

Text formatting

Emphasis detection

Unix Emphasis character detection

Headings and section titles

Numbered heading detection

Capitalised heading detection

Underlined heading detection

Embedded heading detection

Key phrase headings

Numbered paragraph detection

diagrams and tables

Line detection

Form feed page markers

User defined pre-formatted text

Automatically detected pre-formatted text

Table detection

Code sample detection

ASCII art and diagram detection

Text block detection

Other formatted text

Text layout

The software can detect several types of text layout. For more details see the following topics.

Paragraph detection

Indentation detection

Hanging paragraph indent detection

Bullets and list detection

Paragraph detection

AscToPDF can automatically detect paragraphs in your document. Normally this is done by detecting blank lines between paragraphs, but when there are no blank lines other features such as short lines at the end of a paragraph and an offset at the start of each new paragraph may also be taken into account.

Indentation detection

AscToPDF performs statistical analysis on the document to determine at what character positions indentations occur. This information is used on the output pass to determine the indentation level for each source line.

In calculating the indent positions AscToPDF first converts all tabs to spaces. This may result in unexpected indent positions, but shouldn't normally be a problem. If it is, adjust the Tab size policy.

AscToPDF may reject indentations that appear too close together, so as to keep the number of indent levels manageable.

You can override the analysis by specifying your own indentation policy. This can sometimes be useful to add an extra indentation level, or to better match up bullet paragraphs with non-bullet paragraphs.

Hanging paragraph indent detection

Some documents have hanging paragraph indents. That is, the first line of each paragraph starts at an offset to the rest of the paragraph.

AscToPDF struggles heroically with this, and tries not to treat this as text at two indent levels, but it does occasionally get confused.

If writing a text file from scratch with AscToPDF in mind, then it is best to avoid this practice.

Bullets and list detection

AscToPDF detects and supports several types of bullets and lists. This has the effect of putting the bulleted text one level of indentation to the right of the current text.

Should the analysis fail, you can override any and all of these via the analysis bullet policies

Bullet paragraphs
AscToPDF will attempt to detect bullet paragraphs, that is, paragraphs that belong to the bullet point. To do this it attempts to match the indentation of follow-on lines with that past the bullet character(s) on the bullet line itself.

Currently this detection only stretches to the paragraph containing the bullet.

Possible problems

1): Numbered bullets may sometimes get confused with numbered sections. This can be corrected by switching off numbered sections (if there aren't any), replacing the numbered bullets by letters or roman numerals, or by moving the numbered bullets to a different indentation level from the section numbers.
2): AscToPDF currently only detects the first paragraph belonging to a bullet. If the bullet has several paragraphs there may be alignment problems, as the positioning of the second and subsequent paragraphs will depend on the indentation policy. Sometimes careful balancing of the indentations and the indentation policies can sort the problem.

Bullet chars

Bullet chars are lines of the type

        - this is a bullet line

        - this is a bullet paragraph
          because it carries over onto
          more lines

That is, a single character followed by the bullet line. AscToPDF can determine via statistical analysis which character, if any, is being used in this way. Special attention is paid to the '-' and 'o' characters.

Numbered bullet detection

AscToPDF can spot numbered bullets. These can sometimes be confused with section headings in some documents. This is one area where the use of a document policy really pays dividends in sorting the sheep from the goats.

Alphabetic bullet detection

AscToPDF detects upper and lower case alphabetic bullets.

Roman Numeral bullet detection

AscToPDF detects upper and lower case roman numeral bullets.

Text formatting

In addition to various types of formatted text layouts, the software can detect a number of special types of text formatting, including the following.

Emphasised text

Unix emphasis characters

Emphasis detection

AscToPDF can look for text emphasised by placing asterisks (*) either side of it, or underscores (_). AscToPDF will convert the enclosed text to bold and italic respectively using Bold and italic tags respectively.

AscToPDF will also look for combinations of asterisks and underscores which will be placed in bold italic. The asterisks and underscores should be properly nested.

The emphasised word or phrase should span no more than a few lines, and in particular should not span a blank line. If the phrase is longer, or if AscToPDF fails to match opening and closing emphasis marks, the characters are left unconverted.

Tests are made to ignore double asterisks and underscores, and sometimes adjacent punctuation will prevent the text being marked up.

Only markup that occurs in matched pairs over 2-3 lines will be converted, so _this and that* won't be converted.

Unix emphasis character detection

AscToPDF also tries to handle use of Ctrl-H in Unix documents. In such documents Ctrl-H can be used to overstrike characters. Common effects are double printing and underlining. Where detected AscToPDF will use bold and underlining markup.

Examples could include:-

The word this^H^H^H^H____ is underlined. The word that^H^H^H^Hthat is bold (overwritten twice).

Headings and section titles

AscToPDF recognises various types of headings. Where headings are found, and deemed to be consistent with the prevailing document policy (correct indentation, right type, in numerical sequence etc), AscToPDF will use the standard "Heading n" styles.

In addition to this, AscToPDF will insert a bookmark to allow direct access via the PDF bookmarks feature

Numbered heading detection

Sections of type N.N.N can be checked for consistency, and references to them can be spotted and converted into hyperlinks.

At present more exotic numbering schemes using roman numerals and letters of the alphabet are not fully supported.

Capitalised heading detection

AscToPDF can treat wholly capitalised lines as headings. It also allows for such headings to be spread over more than one line.

Underlined heading detection

AscToPDF can recognize underlined text (e.g. a row of minus signs), and optionally promote the preceding line to be a section header.

The "underlining" line should have no gaps in it, and should be a similar length to the preceding heading. If these conditions aren't met you'll probably get a horizontal rule instead.

If you're authoring a file from scratch, it is probably best to use underlined headings for ease of use.

Embedded heading detection

The program can look for headings "embedded" in the first paragraph. Such headings are expected to be a complete sentence or phrase in UPPER CASE at the start of a paragraph. Where detected the heading will be marked up in bold, rather than <Hn> markup, although it will still be added to, and accessible from any hyperlinked contents list you generate for the document.

At present such headings are not auto-detected... you need to switch on the Expect Embedded headings policy.

Key phrase headings

The program can now look for lines that start with particular words or phrases (such as "Chapter", "Part", Title") of your choice and treat these lines as headings. Previously this only worked in a limited way if the heading line was also numbered ("Chapter 1") etc.

To use this feature, set the policy Heading Key phrases

Numbered paragraph detection

Some types of documents use what look like section numbers to number paragraphs (e.g. legal documents, or sets of rules).

AscToPDF can recognize this, and mark up such lines by placing the number in bold, and not using the "Heading n" style on the whole line.

Mail and USENET headers

Some documents, especially those that were originally email or USENET posts, come with header lines, usually in the form of a number of lines with a keyword followed by a colon and then some value.

AscToPDF can recognize these (to a limited extent). Where these are detected the program will parse the header lines to extract the Subject, Author and Date of the article concerned. A heading containing this information will then be generated to replace all the unsightly header lines.

Pre-formatted text

The software can detect various forms of pre-formatted text. This is text laid out in such a way that the spacing used is critical. Spacing is not normally preserved in conversion to PDF, so the correct detection and handling of these special types of text is quite important.

Types of text recognised include the following

Lines

Form feed page markers

User defined pre-formatted text

Automatically detected pre-formatted text

Tables

Code samples

Diagrams and ASCII art

Text blocks

Other formatted text

Line detection

Lines are interpreted in context. If they appear to be underlining text, or part of some pre-formatted structure such as a table, then they are treated as such. Otherwise they become horizontal rules.

An attempt is made to interpret half-lines etc as such, although the effect is only approximate.

Form feed page markers

Form feeds or page breaks become page breaks in the PDF

User defined pre-formatted text

AscToPDF allows users to define their own regions of pre-formatted text, using the BEGIN_PRE and END_PRE pre-processor tags (see Using the pre-processor).

For example :-

      The use of BEGIN_PRE and END_PRE preprocessor
        commands (see 7.1) in
          the text documents
            tells AscToHTM that
              this portion of the
            document
          has been formatted
        by the user and
      should be left unchanged.

Automatically detected pre-formatted text

AscToPDF attempts to spot sections of preformatted text. This can vary from a single line (e.g. a line with a page number on the right-hand margin) to a complete table of data.

Where such text is detected AscToPDF analyses the section to determine what type of pre-formatted text it is. Options include

Tables

Code samples

ASCII Art and diagrams

some other formatted text

A number of policies allow you to control

whether or not the program looks for such text

how sensitivity it is to "pre-formatted" text

how inclined the program is to "extend" the region to adjacent lines

whether or not table generation should be attempted

various aspects of any table analysis that is carried out.

See Pre-formatted text policies for full details.

You can adjust the sensitivity of AscToPDF to pre-formatted text by setting the minimum number of lines required for a pre-formatted region using the Minimum size of automatic <PRE> section policy.

When AscToPDF detects such regions it marks them up in fixed width font which tells PDF this region is pre-formatted.

When tables are detected, AscToPDF will attempt to generate the correct PDF table.

When AscToPDF gets the detection wrong you can use the AscToPDF pre-processor to mark up regions of your document you wish preserved.

Table detection

Tables are marked out by their use of white space, and a regular pattern of gaps or vertical bars being spotted on each lines. AscToPDF will attempt to spot the table, its columns, its headings, its cell alignment and entries that span multiple columns or rows.

Should AscToPDF wrongly detect the extent of a table, you can mark up a section of text by using the TABLE pre-processor markup (see the Tag manual). Alternatively you can try adding blank lines before and after, as the analysis uses white space to delimit tables.

You can alter the characteristics of all or individual tables via the table pre-processor commands (see TABLE).

Code sample detection

AscToPDF attempts to recognize code fragments in technical documents. The code is assumed to be "C++" or "Java"-like, and key indicators are, for example, the presence of ";" characters on the end of lines.

Should AscToPDF wrongly detect the extent of a code fragment, you can mark up a section of text by using the CODE pre-processor markup.

Or you can suppress the whole thing altogether via the policy Expect code samples.

ASCII art and diagram detection

AscToPDF attempts to recognize ASCII art and diagrams in documents. Key indicators include large numbers of non-alphanumeric characters and the use of white space.

However, some diagrams use the same mix of line and alphabetic characters as tables, so the two sometimes get confused.

Should AscToPDF wrongly detect the extent or type of a diagram, you can mark up a section of text by using the DIAGRAM pre-processor markup.

Text block detection

If AscToPDF detects a block of text at a large indent, it will now place that text in such a way as to preserve as faithfully as possible the original indent.

Other formatted text

If AscToPDF detects formatted text, but decides that it is neither table, code or art (and it knows what it likes), then the text may be put out "as normal", but with the original line structure preserved.

In such regions other markup (such as bullets) may not be processed such as it would be elsewhere.

Back to Contents List