Documentation for the Detagger html to text converter and markup removal utility |
The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html
As of version 2.3, Detagger allows the use of "Text Commands". These are commands that allow you to modify the text before it is converted.
The commands should be placed in an external "Text Commands File". This file can be chosen from Conversion Options -> Config File Locations menu option.
Contents of this section
Text Commands available
Text Command : ignore_lineText Command line elements
Text Command : remove_text
Text Command : replace_text
line_selectionAn example use of a Text Commands File
line_match
match_type
replace_type
The ignore_line command identifies lines that should be ignored in the input.
Syntax:
ignore_line <line_selection>
Any line matching the specified line_selection criteria will be ignored in the output. This can be a useful way of ignoring page markers in an input file, as these don't always transfer well under the conversion.
The remove_text command identifies text that should be removed from the input.
Syntax:
remove_text <match_type> "match string"
Any line containing text that matches the specified match_type for the supplied "match string" will have the matching text removed.
The remove_text command identifies text that should be removed from the input.
Syntax:
replace_text <match_type> "match string" by_string "new string" or replace_text <match_type> "match string" by_character "<char>"
Any line containing text that matches the specified match_type for the supplied "match string" will have the matching text replaced.
If the replacement is specified as
by_string "new string"
then the text is replaced by the new string. If the replacement is specified as
by_character "<char>"
then the string is replaced by a string of equal length consisting of this single character repeated. This can be useful for example to replace change bar characters by spaces in a document where the change bars have confused the program, or to replace other characters inside a table that are confusing the detection of the table's true layout.
The line_selection element is actually a combination of a number of simpler elements as follows
Syntax:
<line_match> <match_type> "match string"
That is the line_selection consists of a line_match, a match_type, and then the actual "match string" to be matched. All three elements must be present in order for the line_selection to be valid.
The following are all valid examples
starting_with string "Chapter" starting_with exact_phrase "Author : " containing phrase "click here" containing string "http://"
The line_match element specifies where on the input line the specified text should be located. The options are
starting_with | Text should be at start of line (ignoring any white space) | |
containing | Text can be anywhere on the input line |
Care should be used when using the containing option, as false matches are more likely to occur.
The match_type element specifies how any supplied match string should be matched. The options are
string
This specifies that a string should be matched.
This is, in fact, the most general of match types
and is the one that would normally be used. This
match type is case-insensitive.exact_string Same as "string", but case-sensitive. phrase
A "phrase" is a string that is surrounded by white space
and/or punctuation on either side (see below).
This match type is case-insensitiveexact_phrase Same as "phrase", but case-sensitive. wildcard Not yet supported (*)
The match_type phrase is a special case. This is a string that is surrounded by white space or punctuation on either side. So whereas the string "the" would match "then", the phrase "the" wouldn't because the "n" in "then" is not a white space character.
The start and end of a line count as white space, and any leading or trailing punctuation is allowed. Phase is therefore a more precise match - even for single words - than string.
Consider the following example, concentrating on the letters "ten" in the word "tense"
This is a tense situation....
The following would apply
match_type | Matches? |
---|---|
string "ten" |
Yes. The "ten" matches the first three characters in "tense" in the middle |
extact_string "Ten" |
No. The "t" in "tense" is lower case, so the match fails |
phrase "ten" |
No. "ten" is not surrounded by white space or punctuation because it is followed by "se" |
exact_phrase "tense situation" |
Yes. The case matches, and there is a space before and punctuation (the "...") afterwards. |
The replace_type element is used in the replace_text command to specify what type of text replacement should be executed. The element should be immediately followed by the replacement text in quotes.
There are two options:-
by_string
The matched text should simply be replaced
by the replacement text.by_character
The matched text should be replaced by an
equal length string composed solely of the
single character in the replacement text.
The by_character option allows a string to be "blanked out" by the character of your choice, but without altering the line length or spacing etc. This can be useful, for example to replace all DOS line drawing characters by blanks in table, so as to let the software make a better stab at detecting the table layout.
The following is a real-life example, sent to me by one of my users. They had a files that consisted of a table of data that was to be imported into an Excel spreadsheet. By using the policies
output table format : 2 Add delimited table markers : yes
They were able to turn the table into delimited data which they could easily extract and then import into Excel. However the problem then was that they couldn't tell which data had come from which file. As it happened the filename matched an instrument name, and they wanted the imported data to include the instrument/filename.
I actually modified Detagger to make this work for them. The solution was to use a text command file as follows
replace_text string "<TR>" by_string "<TR><TD>[[DATA IN_FILENAME]]</TD>"
In this case the opening <TR> of each table was replaced by a <TR>, a <TD> a fragment tag and then a </TD>. The fragment tag [[DATA IN_FILENAME]] gets substituted by the input filename. The net effect of this substitution is to create the appearance of an extra column in each table consisting of the filename in the first cell of each row.
Once the modified table is converted to delimited data, the filename is effectively inserted into each row of the table, so that once imported into Excel each data row can identify which file (and hence which instrument) the data came from.
Converted from
a single text file by
AscToHTM © 1997-2005 John A Fotheringham |