Using Detagger to Convert HTML to text
As an HTML-to-Text converter, Detagger allows you to
- convert HTML pages you've browsed into plain ASCII text (.txt), making it easier to read and email to others.
- convert HTML email into a smaller, safer format that is easier to archive and search
- convert HTML newsletters into a more compact and email-friendly format, helping authors easily maintain HTML and text versions.
- extract data from HTML tables in a format that can be imported into a database
- extract text from HTML pages so that you can do analysis on it (e.g. spell checking).
- batch-process whole directories of files at a single go
When converting a HTML file the program will output the document as plain text, but preserving the marked up headings, lists, tables of the original document and turning them into suitable text formats. The text will be laid out as faithfully as possible to the original document, within the constraints of your chosen page width.
There are many formatting options which can be saved in "policy" files so that they may be easily reloaded in later sessions.
Note, in addition to converting HTML into plain text, Detagger can also act as a fully-featured HTML markup remover
Features of the text conversion
When you use Detagger to convert HTML to text file the conversion can include:-
- Using the headings tags to create titles (you can choose to have these
underlined if you wish)
- Respecting the paragraph and line structure of the original.
- Respecting the list tagging on the page.
- Parsing tables (and nested tables) and laying the text accordingly. By
default the widths of the original table are respected, but if these are
not specified _Detagger) will intelligently lay out the table on the page.
- Replacing hyperlinks by the display text. URLs may either be placed in the
main text, or added as an entry in a reference table added at the end of
the text.
- Formatting the output to your desired page width, meaning you end with
a text format that meets your needs.
- Replacing Image tags by an Image marker. These can be labelled with the
Image URL or the ALT attribute text.
- Adding custom header and footers to the output. These can have merged
in selected data fields such as convert date, title etc. The evaluation version,
adds a standard header, in the registered version this is omitted and
you can choose to add your own headers.
- Changing all HTML entities into the correct characters. You can choose to
have 8-bit characters replaced by 7-bit alternatives where available
to give greatest compatibility of the output.
- Supporting the creation of Unicode text files from HTML files that
use non-ASNI character sets or contain non-ANSI HTML entities.
- Intelligent formatting of any "dialogue". This is particularly useful
when converting short stories
Data extraction options
In addition to straight text conversion, Detagger offers some data extraction features
- Simple tables can also be converted into comma-delimited (CSV) or
tab-delimited data, ready for import into spreadsheets.
Documentation
The product comes with extensive documentation, which you can also read online.