Documentation for the Detagger markup removal utility : Using a Text Commands File

Documentation for the Detagger html to text converter and markup removal utility

The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html

Using a Text Commands File

As of version 2.3, Detagger allows the use of "Text Commands". These are commands that allow you to modify the text before it is converted.

The commands should be placed in an external "Text Commands File". This file can be chosen from Conversion Options -> Config File Locations menu option.

Contents of this section

Text Commands available

Text Command : ignore_line
Text Command : remove_text
Text Command : replace_text

Text Command line elements

line_selection
line_match
match_type
replace_type

An example use of a Text Commands File

Text Commands available

Text Command : ignore_line

The ignore_line command identifies lines that should be ignored in the input.

Syntax:

        ignore_line <line_selection>

Any line matching the specified line_selection criteria will be ignored in the output. This can be a useful way of ignoring page markers in an input file, as these don't always transfer well under the conversion.

Text Command : remove_text

The remove_text command identifies text that should be removed from the input.

Syntax:

        remove_text <match_type> "match string"

Any line containing text that matches the specified match_type for the supplied "match string" will have the matching text removed.

Text Command : replace_text

The remove_text command identifies text that should be removed from the input.

Syntax:

        replace_text <match_type> "match string" by_string "new string"
or
        replace_text <match_type> "match string" by_character "<char>"

Any line containing text that matches the specified match_type for the supplied "match string" will have the matching text replaced.

If the replacement is specified as

by_string "new string"

then the text is replaced by the new string. If the replacement is specified as

by_character "<char>"

then the string is replaced by a string of equal length consisting of this single character repeated. This can be useful for example to replace change bar characters by spaces in a document where the change bars have confused the program, or to replace other characters inside a table that are confusing the detection of the table's true layout.

Note:: The new string can contain The DATA fragment tag which allows details about the file (e.g. it's name) to be included in the substitution. See An example use of a Text Commands File

Text Command line elements

line_selection

The line_selection element is actually a combination of a number of simpler elements as follows

Syntax:

        <line_match> <match_type> "match string"

That is the line_selection consists of a line_match, a match_type, and then the actual "match string" to be matched. All three elements must be present in order for the line_selection to be valid.

The following are all valid examples

        starting_with   string          "Chapter"
        starting_with   exact_phrase    "Author : "
 
        containing      phrase          "click here"
        containing      string          "http://"

line_match

The line_match element specifies where on the input line the specified text should be located. The options are

	starting_with	Text should be at start of line (ignoring any white space)
	containing	Text can be anywhere on the input line

Care should be used when using the containing option, as false matches are more likely to occur.

match_type

The match_type element specifies how any supplied match string should be matched. The options are

string

This specifies that a string should be matched.
This is, in fact, the most general of match types
and is the one that would normally be used. This
match type is case-insensitive.

exact_string Same as "string", but case-sensitive.

phrase

A "phrase" is a string that is surrounded by white space
and/or punctuation on either side (see below).
This match type is case-insensitive

exact_phrase Same as "phrase", but case-sensitive.

wildcard Not yet supported (*)

The match_type phrase is a special case. This is a string that is surrounded by white space or punctuation on either side. So whereas the string "the" would match "then", the phrase "the" wouldn't because the "n" in "then" is not a white space character.

The start and end of a line count as white space, and any leading or trailing punctuation is allowed. Phase is therefore a more precise match - even for single words - than string.

Consider the following example, concentrating on the letters "ten" in the word "tense"

This is a tense situation....

The following would apply

match_type	Matches?
string "ten"	Yes. The "ten" matches the first three characters in "tense" in the middle
extact_string "Ten"	No. The "t" in "tense" is lower case, so the match fails
phrase "ten"	No. "ten" is not surrounded by white space or punctuation because it is followed by "se"
exact_phrase "tense situation"	Yes. The case matches, and there is a space before and punctuation (the "...") afterwards.

replace_type

The replace_type element is used in the replace_text command to specify what type of text replacement should be executed. The element should be immediately followed by the replacement text in quotes.

There are two options:-

by_string
The matched text should simply be replaced
by the replacement text.

by_character

The matched text should be replaced by an
equal length string composed solely of the
single character in the replacement text.

The by_character option allows a string to be "blanked out" by the character of your choice, but without altering the line length or spacing etc. This can be useful, for example to replace all DOS line drawing characters by blanks in table, so as to let the software make a better stab at detecting the table layout.

An example use of a Text Commands File

The following is a real-life example, sent to me by one of my users. They had a files that consisted of a table of data that was to be imported into an Excel spreadsheet. By using the policies

        output table format : 2
        Add delimited table markers : yes

They were able to turn the table into delimited data which they could easily extract and then import into Excel. However the problem then was that they couldn't tell which data had come from which file. As it happened the filename matched an instrument name, and they wanted the imported data to include the instrument/filename.

I actually modified Detagger to make this work for them. The solution was to use a text command file as follows

        replace_text string "<TR>" by_string "<TR><TD>[[DATA IN_FILENAME]]</TD>"

In this case the opening <TR> of each table was replaced by a <TR>, a <TD> a fragment tag and then a </TD>. The fragment tag [[DATA IN_FILENAME]] gets substituted by the input filename. The net effect of this substitution is to create the appearance of an extra column in each table consisting of the filename in the first cell of each row.

Once the modified table is converted to delimited data, the filename is effectively inserted into each row of the table, so that once imported into Excel each data row can identify which file (and hence which instrument) the data came from.

See The DATA fragment tag

Back to Contents List

string	This specifies that a string should be matched. This is, in fact, the most general of match types and is the one that would normally be used. This match type is case-insensitive.
exact_string	Same as "string", but case-sensitive.
phrase	A "phrase" is a string that is surrounded by white space and/or punctuation on either side (see below). This match type is case-insensitive
exact_phrase	Same as "phrase", but case-sensitive.
wildcard	Not yet supported (*)

by_string	The matched text should simply be replaced by the replacement text.
by_character	The matched text should be replaced by an equal length string composed solely of the single character in the replacement text.