Documentation for the AscToPDF conversion utility |
The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html
Before converting files to PDF, AscToPDF first attempts to analyse your document looking for the following components.
The software can detect several types of text layout. For more details see the following topics.
AscToPDF can automatically detect paragraphs in your document. Normally this is done by detecting blank lines between paragraphs, but when there are no blank lines other features such as short lines at the end of a paragraph and an offset at the start of each new paragraph may also be taken into account.
AscToPDF performs statistical analysis on the document to determine at what character positions indentations occur. This information is used on the output pass to determine the indentation level for each source line.
In calculating the indent positions AscToPDF first converts all tabs to spaces. This may result in unexpected indent positions, but shouldn't normally be a problem. If it is, adjust the Tab size policy.
AscToPDF may reject indentations that appear too close together, so as to keep the number of indent levels manageable.
You can override the analysis by specifying your own indentation policy. This can sometimes be useful to add an extra indentation level, or to better match up bullet paragraphs with non-bullet paragraphs.
See also Indentation policy and Bullet policies
Some documents have hanging paragraph indents. That is, the first line of each paragraph starts at an offset to the rest of the paragraph.
AscToPDF struggles heroically with this, and tries not to treat this as text at two indent levels, but it does occasionally get confused.
If writing a text file from scratch with AscToPDF in mind, then it is best to avoid this practice.
AscToPDF detects and supports several types of bullets and lists. This has the effect of putting the bulleted text one level of indentation to the right of the current text.
Should the analysis fail, you can override any and all of these via the analysis bullet policies
Bullet paragraphs
AscToPDF will attempt to detect bullet paragraphs, that is, paragraphs
that belong to the bullet point. To do this it attempts to match
the indentation of follow-on lines with that past the bullet
character(s) on the bullet line itself.
Currently this detection only stretches to the paragraph containing the bullet.
Possible problems
Bullet chars are lines of the type
- this is a bullet line - this is a bullet paragraph because it carries over onto more lines
That is, a single character followed by the bullet line. AscToPDF can determine via statistical analysis which character, if any, is being used in this way. Special attention is paid to the '-' and 'o' characters.
AscToPDF can spot numbered bullets. These can sometimes be confused with section headings in some documents. This is one area where the use of a document policy really pays dividends in sorting the sheep from the goats.
AscToPDF detects upper and lower case alphabetic bullets.
AscToPDF detects upper and lower case roman numeral bullets.
In addition to various types of formatted text layouts, the software can detect a number of special types of text formatting, including the following.
AscToPDF can look for text emphasised by placing asterisks (*) either side of it, or underscores (_). AscToPDF will convert the enclosed text to bold and italic respectively using Bold and italic tags respectively.
AscToPDF will also look for combinations of asterisks and underscores which will be placed in bold italic. The asterisks and underscores should be properly nested.
The emphasised word or phrase should span no more than a few lines, and in particular should not span a blank line. If the phrase is longer, or if AscToPDF fails to match opening and closing emphasis marks, the characters are left unconverted.
Tests are made to ignore double asterisks and underscores, and sometimes adjacent punctuation will prevent the text being marked up.
Only markup that occurs in matched pairs over 2-3 lines will be converted, so _this and that* won't be converted.
AscToPDF also tries to handle use of Ctrl-H in Unix documents. In such documents Ctrl-H can be used to overstrike characters. Common effects are double printing and underlining. Where detected AscToPDF will use bold and underlining markup.
Examples could include:-
The word this^H^H^H^H____ is underlined. The word that^H^H^H^Hthat is bold (overwritten twice).
AscToPDF recognises various types of headings. Where headings are found, and deemed to be consistent with the prevailing document policy (correct indentation, right type, in numerical sequence etc), AscToPDF will use the standard "Heading n" styles.
In addition to this, AscToPDF will insert a bookmark to allow direct access via the PDF bookmarks feature
Sections of type N.N.N can be checked for consistency, and references to them can be spotted and converted into hyperlinks.
At present more exotic numbering schemes using roman numerals and letters of the alphabet are not fully supported.
AscToPDF can treat wholly capitalised lines as headings. It also allows for such headings to be spread over more than one line.
AscToPDF can recognize underlined text (e.g. a row of minus signs), and optionally promote the preceding line to be a section header.
The "underlining" line should have no gaps in it, and should be a similar length to the preceding heading. If these conditions aren't met you'll probably get a horizontal rule instead.
If you're authoring a file from scratch, it is probably best to use underlined headings for ease of use.
The program can look for headings "embedded" in the first paragraph. Such headings are expected to be a complete sentence or phrase in UPPER CASE at the start of a paragraph. Where detected the heading will be marked up in bold, rather than <Hn> markup, although it will still be added to, and accessible from any hyperlinked contents list you generate for the document.
At present such headings are not auto-detected... you need to switch on the Expect Embedded headings policy.
The program can now look for lines that start with particular words or phrases (such as "Chapter", "Part", Title") of your choice and treat these lines as headings. Previously this only worked in a limited way if the heading line was also numbered ("Chapter 1") etc.
To use this feature, set the policy Heading Key phrases
Some types of documents use what look like section numbers to number paragraphs (e.g. legal documents, or sets of rules).
AscToPDF can recognize this, and mark up such lines by placing the number in bold, and not using the "Heading n" style on the whole line.
Some documents, especially those that were originally email or USENET posts, come with header lines, usually in the form of a number of lines with a keyword followed by a colon and then some value.
AscToPDF can recognize these (to a limited extent). Where these are detected the program will parse the header lines to extract the Subject, Author and Date of the article concerned. A heading containing this information will then be generated to replace all the unsightly header lines.
The software can detect various forms of pre-formatted text. This is text laid out in such a way that the spacing used is critical. Spacing is not normally preserved in conversion to PDF, so the correct detection and handling of these special types of text is quite important.
Types of text recognised include the following
Lines are interpreted in context. If they appear to be underlining text, or part of some pre-formatted structure such as a table, then they are treated as such. Otherwise they become horizontal rules.
An attempt is made to interpret half-lines etc as such, although the effect is only approximate.
Form feeds or page breaks become page breaks in the PDF
AscToPDF allows users to define their own regions of pre-formatted text, using the BEGIN_PRE and END_PRE pre-processor tags (see Using the pre-processor).
For example :-
The use of BEGIN_PRE and END_PRE preprocessor commands (see 7.1) in the text documents tells AscToHTM that this portion of the document has been formatted by the user and should be left unchanged.
AscToPDF attempts to spot sections of preformatted text. This can vary from a single line (e.g. a line with a page number on the right-hand margin) to a complete table of data.
Where such text is detected AscToPDF analyses the section to determine what type of pre-formatted text it is. Options include
- Tables
- Code samples
- ASCII Art and diagrams
- some other formatted text
A number of policies allow you to control
- whether or not the program looks for such text
- how sensitivity it is to "pre-formatted" text
- how inclined the program is to "extend" the region to adjacent lines
- whether or not table generation should be attempted
- various aspects of any table analysis that is carried out.
See Pre-formatted text policies for full details.
You can adjust the sensitivity of AscToPDF to pre-formatted text by setting the minimum number of lines required for a pre-formatted region using the Minimum size of automatic <PRE> section policy.
When AscToPDF detects such regions it marks them up in fixed width font which tells PDF this region is pre-formatted.
When tables are detected, AscToPDF will attempt to generate the correct PDF table.
When AscToPDF gets the detection wrong you can use the AscToPDF pre-processor to mark up regions of your document you wish preserved.
Tables are marked out by their use of white space, and a regular pattern of gaps or vertical bars being spotted on each lines. AscToPDF will attempt to spot the table, its columns, its headings, its cell alignment and entries that span multiple columns or rows.
Should AscToPDF wrongly detect the extent of a table, you can mark up a section of text by using the TABLE pre-processor markup (see the Tag manual). Alternatively you can try adding blank lines before and after, as the analysis uses white space to delimit tables.
You can alter the characteristics of all or individual tables via the table pre-processor commands (see TABLE).
AscToPDF attempts to recognize code fragments in technical documents. The code is assumed to be "C++" or "Java"-like, and key indicators are, for example, the presence of ";" characters on the end of lines.
Should AscToPDF wrongly detect the extent of a code fragment, you can mark up a section of text by using the CODE pre-processor markup.
Or you can suppress the whole thing altogether via the policy Expect code samples.
AscToPDF attempts to recognize ASCII art and diagrams in documents. Key indicators include large numbers of non-alphanumeric characters and the use of white space.
However, some diagrams use the same mix of line and alphabetic characters as tables, so the two sometimes get confused.
Should AscToPDF wrongly detect the extent or type of a diagram, you can mark up a section of text by using the DIAGRAM pre-processor markup.
If AscToPDF detects a block of text at a large indent, it will now place that text in such a way as to preserve as faithfully as possible the original indent.
If AscToPDF detects formatted text, but decides that it is neither table, code or art (and it knows what it likes), then the text may be put out "as normal", but with the original line structure preserved.
In such regions other markup (such as bullets) may not be processed such as it would be elsewhere.
Converted from
a single text file by
AscToHTM © 2006 John A Fotheringham |