Documentation for the JafSoft text conversion APIs |
This document describes the API's that are available for the Text conversion products produced by JafSoft Limited. These include
AscToHTM Text-to-HTML conversion AscToRTF Text-to-RTF conversion AscToTab Text-to-table (HTML and RTF) conversion Detagger HTML-to-Text conversion and tag removal.
Although these converters are written in C++, the API is exported as "C"-like methods, and can be called from C/C++, C#, Visual Basic or Java. The standard distribution is supplied under Windows, but customers with access to the source have successfully compiled and integrated the API into systems running under OpenVMS, Linux and Solaris.
If you have any particular enquiries, contact info<at>jafsoft.com (replace "<at>" by "@").
Table of Contents
OverviewCalling the API from C/C++Using The API
Using the DLLCalling the API from .NET
Using static linking
C++ example
Calling the API from C# (C sharp)
Calling the API from Visual Basic
Passing text data into and out of the APICalling the API from Java
Defining the API
Calling the API
Visual Basic example
Calling the API from inside Lotus Notes
LotusNotes exampleUsing the API on non-Windows platforms
Allocating and releasing the APIThe API demonstration package
Customising the conversion using policies
Policy filesSpecifying the conversion types
Policy types
More documentation on policies
Performing the conversion
Setting up the input and output destinationsTesting for success using the API return values and the "Result" argument
Performing conversion between files
Performing conversion between string buffers
Performing mixed conversions
API return valuesPassing character data to and from the converter
API result codes
When to use string or (char *) "pointers" to pass character dataPassing Unicode data to the converter
Sample code using C++ strings
Sample code using (char *) pointers
Checking the conversion results when using (char *) pointers
The various Unicode implementationsCapturing error messages
How the API handles Unicode internally
How the API detects the presence of Unicode
Doing file-to-file conversions
Doing string-to-string conversions
Using the "input text encoding" policySummary of Unicode usage
Using the "output text encoding" policy
Initialise and release methods
CONVERTER_AllocatePolicy manipulation methods
CONVERTER_Free
CONVERTER_ResetPoliciesInput and output specification methods
CONVERTER_ReadPolicyFile
CONVERTER_WritePolicyFile
CONVERTER_SetPolicyValue
CONVERTER_GetPolicyValue
CONVERTER_ResetSourcesConversion methods
CONVERTER_ResetInputSource
CONVERTER_ResetOutputSource
CONVERTER_SetInputString
CONVERTER_SetOutputString
CONVERTER_SetInputFilename
CONVERTER_SetOutputFilename
CONVERTER_GetOutCharArraySize
CONVERTER_GetOutCharArray_Ptr
CONVERTER_DoConversionError reporting methods
CONVERTER_DoFileConvert
CONVERTER_DoStringConvert
CONVERTER_SetErrorFnDebugging methods
CONVERTER_SetOutFn
CONVERTER_DebugAPI
CONVERTER_DebugAPILogMessage
CONVERTER_GetLastMessage
The typical calling sequence when using the API is as follows
For example a small C++ program might look as follows
include "converter.h"; include "api_defines.h"; ... string inputFile = "input.txt"; string outputFile = "output.html"; long Result = R_SUCCESS; long APIResult = CONV_OK; // Allocate the API resource long Handle = CONVERTER_Allocate(); // do a file conversion APIResult = CONVERTER_DoFileConvert (Handle, CT_NORMAL, inputFile, outputFile, Result); // test for success if (API_Result == CONV_OK && Result = R_SUCCESS) { cout << "Conversion worked okay!" << endl; } // free the API resource CONVERTER_Free(Handle);
The API software is itself written in C++, and so in principle you can link to either the library or DLL forms of the API. This means that all things being equal you can call the string-based versions of the API methods, which are easier to use. Linking statically will make your .exe larger, but will avoid the need to manage the delivery and installation of the DLL.
The API will be delivered including the following files
<API_name>.dll
The API in DLL form
<API_name>.lib
The library file for the DLL version of
the API. This will be a comparatively
small file, just a few Kb in size
<API_name>_nodll.lib The library file for the non-DLL version
of the API. This will be a large file,
typically a few Mb in size.
To use the DLL, include "converter.h" and "api_defines.h" in your source code, and then link your software against the <API_name>.lib file. This will be the smaller of the two .lib files, as it only contains wrappers for the DLL methods.
Once linked, you will need to ensure the DLL is either in the same folder as your executable file, or in your system directory.
Note, when using the DLL version, using string objects can become a problem as the implementation of the string objects varies from one C++ implementation to the next. In particular C++ inside .NET projects cannot access the string objects inside the supplied DLL because of a binary incompatibility.
(See Passing character data to and from the converter)
To use static linking, include "converter.h" and "api_defines.h" in your source code, and then link your software against the <API_name>_nodll.lib file. This will be the larger of the two .lib files.
Once linked you will be able to run your program independently of the DLL.
An example program TestAPI.cxx is included in the Demonstration package, together with the converter.h and api_defines.h header files to define how the converter should be accessed in C++.
Here's an example of calling the Detagger API to convert a HTML file to text using the string version of the API methods.
#include "api_defines.h" #include "converter.h" long ConvertType = CT_CONVERT_TO_TEXT; // convert file to plain text string inFile = "c:\temp\input.html"; string outFile = "c:\temp\output.txt"; long Result, APIstatus; // Allocate the API and get a handle used in subsequent calls long Handle = CONVERTER_Allocate(); Result = R_SUCCESS; APIStatus = CONVERTER_DoFileConvert (APIHandle, ConvertType, inFile, outFile, Result); if (APIStatus != CONV_OK || Result != R_SUCCESS) { // you could test the value of Result to see what went wrong CONVERTER_Free (APIHandle) return EXIT_FAILED; } // Free up the converter APIStatus = CONVERTER_Free (APIHandle); if (APIStatus != CONV_OK || Result != R_SUCCESS) return EXIT_FAILED;
In principle C/C++ code should be callable from .NET projects, but as discussed above, the implementation of the string object varies under .NET, leading to binary incompatibilities. Furthermore it seems the implementation of string within .NET changed between versions, causing yet another binary incompatibility. For this reason unless you get a library or DLL that specifically matches your version of .NET you will get link and/or runtime errors.
For this reason I would advise .NET developers to use the _ptr variants and pass arguments as (char *) values (see Passing character data to and from the converter)
See calling the API from C/C++.
Some API users have managed to call the API in DLL from inside C#. To do this you need to create a wrapper class that contains the API and exposes its methods. In this class you need to declare a method for each API method you wish to expose and to use a DllImport to associate this with the matching method inside the DLL itself.
In the Demonstration package the folder "C# demos" contains the file DetaggerAPI.cs as an example kindly provided by a user who got this working.
Once you have a wrapper class, you can then use this to invoke the API as required.
In calling the DLL from C# the _ptr variant of the API methods must be used (see Passing character data to and from the converter)
Visual Basic can only call the DLL version of the API, and has to pass text data as character pointers see Passing character data to and from the converter)
Sample VB applications are available in the API demonstration package.
Visual Basic String variables cannot be mapped onto C++ string variables, so instead the VB code has to call the (char *) variants of the API methods (those whose name has "_Ptr" appended).
See When to use string or (char *) pointers to pass character data.
In order to use the API methods, they first must be correctly declared. This is done in declarations such as this
Public Declare Function CONVERTER_ReadPolicyFile_Ptr _ Lib "h:\DemoAPI\DLLs\rtfconv_eval" _ (ByVal handle As Long, _ ByVal policyfilename As String, _ ByRef result As Long) As Long
In this example the DLL location "h:\DemoAPI\DLLs\rtfconv eval" is given explicitly (in this case for the AscToRTF demo DLL). If you copy the DLL to your system directory, the path can be omitted and only the DLL name "rtfconv eval" need be used.
Note: The actual DLL name will depend on which API you are working with.
Full declarations for the API are contained in the API demonstration package. These contain files such as RTFConv.bas, which is effectively a translation into VB of the C++ header file "converter.h". Only the "_Ptr" variants are defined, as VB has to use these.
Should you want to install the DLL in a non-system folder, you will need to edit this VB file to change all the references to the correct location.
To call the API you must first make sure it is properly defined (see defining the API) and include the API definition in your project.
Once this is done, you are free to call most of the API methods, using the "_Ptr" variants to pass text data where they exits.
Here is a snippet of Visual Basic code, that calls the API methods. In this case there is an RTFConv object which is the AscToRTF API converter object, declared in a separate module. Converter declaration files are available in the API demonstration package available.
'-- initialise some data values ll_on = 1 ll_off = 0 '-- Allocate new RTFConverter resources to get a handle (needed in subsequent calls) ConverterHandle = RTFConv.CONVERTER_Allocate() '-- switch the various API debug modes on/off ' we don't want the call-by-call reporting RTFConv.CONVERTER_DebugAPI ll_off ' ... but we will have a log file, thanks. ls_logfile = "c:\temp\debug_API.log" RTFConv.CONVERTER_DebugAPILogMessage ll_on, ls_logfile '-- set any policies ls_policyname1 = "default font" ls_policyvalue1 = "Verdana, regular, 12" retval = RTFConv.CONVERTER_SetPolicyValue_Ptr(ConverterHandle, ls_policyname1, _ ls_policyvalue1, result) '-- now execute file conversion On Error GoTo ShowResult ' Do a NORMAL conversion Dim il_ConvType As Long il_ConvType = RTFConv.CT_NORMAL result = 0 retval = RTFConv.CONVERTER_DoFileConvert_Ptr(ConverterHandle, il_ConvType, _ ls_inputfilename, ls_outputfilename, result) Status.Caption = "Output file is " + ls_outputfilename '-- fetch the last API message (only useful if there's an error - and not always then) Dim message As String Dim messagesize As Long messagesize = 150 message = Space(messagesize) retval = RTFConv.CONVERTER_GetLastMessage_Ptr(message, messagesize, result) Status.Caption = "<" + message + ">" '-- release the API resources retval = RTFConv.CONVERTER_Free(ConverterHandle)
Some API users have managed to invoke the DLL versions of the API from inside Java programs. To do this it is necessary to create a C++ class that uses JNIEXPORT to expose its methods in a way that is accessible from Java. This class can then be called form inside Java to access the functionality of the API.
In the Demonstration Package, some samples of this kindly supplied by and API user are provided in the "JNI demo" folder.
Because Java Strings are not compatible with C++ string objects, the _ptr variant of the API methods must be used inside the wrapper class (see Passing character data to and from the converter)
Some users of the API have managed to invoke the DLL version of the API from inside LotusScript. They have kindly provided some sample code which is included in the "LotusNotes demo" folder of the Demonstration Package.
As with most other languages, Lotus Notes has to use the _ptr variants of the API methods (see Passing character data to and from the converter)
This example was supplied by a user who got the API to work inside Lotus Notes. Note the comment about declaring the result as Long to avoid a type mismatch error.
Sub ConvertToText Dim ConverterHandle As Long Dim ls_inputfilename As String Dim ls_outputfilename As String ls_inputfilename$ = "C:\\Documents and Settings\\user\\Desktop\\table_to_unhtml.htm" ls_outputfilename$ = "C:\\Documents and Settings\\user\\Desktop\\table_to_unhtml.txt" ConverterHandle = CONVERTER_Allocate() Dim result As Long ' <----- added to eliminate 'type mismatch' error on '"result" Dim il_ConvType As Long il_ConvType = CT_CONVERT_TO_TEXT result = 0 retval = CONVERTER_DoFileConvert_Ptr ( _ ConverterHandle, _ il_ConvType, _ ls_inputfilename, _ ls_outputfilename, _ result ) retval = CONVERTER_Free(ConverterHandle) End Sub
At present the API is only readily available under Windows. However the core code has been successfully built and run under OpenVMS, Windows, Linux and Solaris, and could probably be easily ported to other platforms as it is relatively OS-neutral.
JafSoft Limited can currently only offer to support Windows and OpenVMS versions. To build a version on any other platform, you will need to sign a special agreement to get the source code. This is normally more expensive than the usual API cost, and in some cases may not be granted.
Email JafSoft Limited (info<at>jafsoft.com) with your requirements in this case (replace "<at>" by "@").
When using the API it is necessary to first allocate some API resources. You do this by calling CONVERTER_Allocate which returns a "handle". This is an ID that tells the API which resources are being used. You need to pass this handle into all subsequent API calls.
Once you are finished with this API handle, you should call CONVERTER_Free to release the API resource. Once you've done this you won't be able to continue using the same handle.
Inside the API the CONVERTER_Allocate call creates a new API object. As the conversion proceeds, this object will allocate memory. For example the output of the last conversion is usually held in memory. Calling CONVERTER_Free releases all this resource by causing the API object to be deleted and all it's memory released.
If you don't call CONVERTER_Free, you will have a memory leak that will consume an amount of memory comparable to he size of the data converted.
So a typical use of the API would be as follows :-
// Allocate the API resource long Handle = CONVERTER_Allocate(); // ... use the converter as you wish // free the API resource CONVERTER_Free(Handle);
Each converter will accept options that can influence the analysis, or alter the output from the conversion process.
These options are known as "policies" and they may be saved in text files known as policy files.
The API offers several Policy manipulation methods which allow you to load a policy file, or to set individual policies before the conversion.
The API also includes methods which allow you to interrogate the value of a policy, or to dump all current policy values to file. You wouldn't normally want to do that unless you wanted to see how certain policies had been changed during the conversion. For example you might want to check the policy "expect underlined headings" to see if the converter had automatically detected underlined headings. If it hadn't, you might choose to explicitly set this policy before conversion in future.
Policies consist of a "policy name" - basically a text description - and a value. You should read the documentation for the converter you are interested in for more details.
See also the Policy Manual, but be aware that not all policies apply to all converters.
Policy files are plain text files with a .pol extension. They contain one policy per line (i.e. no hard breaks within a policy) as follows
<policy_name> : <policy_value>
Blank lines and comments (lines beginning with "!") are allowed, and there are a number of recognised headings enclosed in brackets that are ignored. The headings are used for convenience to group policies together and to make the file easier to read. In general the order in which policies appear in the file doesn't matter.
The following is a sample fragment of a policy file
[Added HTML] Document Title : User manual for AscToHTM Document Keywords : ASCII, text, HTML, conversion, utility, shareware [Contents] Add contents list : Yes [Frames] Header frame depth : 110 Footer frame depth : 90
Policies come in a number of types, with the value formatted accordingly
Integer
integer value
Boolean
"yes", "no"
Text
any free text
Alignment
"left", "right", "centered", "justified"
Colour
any valid HTML colour hex value, or one of
the 16 standard colour names
Font Format liable to change, but currently
compatible with the MFC FontDialog control
using
"font name, weight, point size"
e.g.
Arial, regular, 12
Verdana, bold, 10
The special value "(none)" can be taken to mean "not set". See the converter documentation and the Policy Manual for details of individual policies
The use of "policies" is the same for all converters, but the actual policies supported will vary from converter to converter.
You should download and check the documentation for the converter you are interested in.
You should also review the Policy Manual. If you've download the Windows version of the converter, this was probably included in the download. If not you can find it online at
Some useful policies, common to most converters, are below
Diagnostics | ||
Generate diagnostics files |
Yes/No |
|
Error messages | ||
Display messages | Yes/No | |
Error reporting level |
1-10 (10 is high, shows only important messages) |
|
Suppress INFO messages | Yes/No | |
Suppress TAG ERROR messages | Yes/No | |
Suppress URL messages | Yes/No | |
Suppress WARNING messages | Yes/No | |
Suppress program ERROR messages |
Yes/No |
|
Contents List | ||
Fonts |
Add contents list |
Yes/No |
Default Font | "Times New Roman, regular, 10" | |
Fixed Font | "Courier, regular, 8" | |
Heading Font |
"Arial, bold, 10" |
|
Analysis (headings) | ||
Expect Capitalised Headings | Yes/No | |
Expect Embedded Headings | Yes/No | |
Expect Numbered Headings | Yes/No | |
Expect Underlined Headings |
Yes/No |
|
Analysis (various) | ||
Attempt TABLE generation | Yes/No | |
Look for MAIL and USENET headers | Yes/No | |
Look for bullets | Yes/No | |
Look for character encoding | Yes/No | |
Look for diagrams | Yes/No | |
Look for horizontal rulers | Yes/No | |
Look for hanging paragraphs | Yes/No | |
Look for indentation | Yes/No | |
Look for preformatted text | Yes/No | |
Look for quoted text | Yes/No | |
Look for short lines | Yes/No | |
Look for white space |
Yes/No |
|
Line/paragraph formatting | ||
Preserve file structure using <PRE> | Yes/No | |
Preserve line structure | Yes/No | |
Preserve new paragraph offset | Yes/No |
The ConvType argument passed into the various Conversion methods is interpreted as follows. The default conversion type for most converters is CT_NORMAL.
CT_NORMAL | 1 | input is normal ASCII text | |
CT_TEXT_WITH_TAGS | 2 | input contains added HTML hyperlinks that should be preserved if possible (HTML conversion only) |
|
Table types | |||
CT_TEXT_TABLE | 3 | input is a plain text table. The converter will attempt to analyse the text into tables and rows |
|
CT_TAB_DELIMITED_TABLE | 4 | input is tab-delimited text in a table. Each line will be treated as a table row, and each value placed in a cell by itself |
|
CT_COMMA_DELIMITED_TABLE | 5 | input is comma-delimited text in a table. Each line will be treated as a table row, and each value placed in a cell by itself |
|
Detagger types | |||
CT_REMOVE_MARKUP | 6 | Detagger option. Markup will be selectively removed from a markup file |
|
CT_CONVERT_TO_TEXT | 7 | Detagger option. Markup file will be converted to text |
|
AscToTab types (output to RTF) | |||
CT_TEXT_TABLE_RTF | 8 | Same as CT_TEXT_TABLE, but specifies RTF output (instead of HTML) |
|
CT_TAB_DELIMITED_TABLE_RTF | 9 | Same as CT_TAB_DELIMITED_TABLE, but specifies RTF output (instead of HTML) |
|
CT_COMMA_DELIMITED_TABLE_RTF | 10 | Same as CT_COMMA_DELIMITED_TABLE, but specifies RTF output (instead of HTML) |
The API can support both external files and internal string buffers as input sources and output targets. If you are converting a file into a file, or a buffer into a buffer, then you can do so directly by calling the correct conversion method (CONVERTER_DoFileConvert and CONVERTER_DoStringConvert respectively).
If you want to convert mixed types (file to buffer or vice versa) then you will need to call the Input and output specification methods to setup the input source and output target before calling the general purpose CONVERTER_DoConversion method.
You can convert files by calling the CONVERTER_DoFileConvert method.
The input filespec may include wildcards, and the output filespec may be just a directory name (or even blank). When converting files, by default the output file will be placed in the same folder, with the same name but with an extension suited to the output format.
You can convert between string buffers by calling the CONVERTER_DoStringConvert method.
If you're calling the DLL version of the API (e.g. from Visual Basic), then you'll need to call the "_Ptr" variant. It you do this, make sure you test the Result to check that the output buffer you supplied was large enough.
See comments in "Passing character data to and from the converter"
It's possible to convert from source files to string buffers, or to convert a string buffer into an output file. To do this you must first make calls to the desired Input and output specification methods and then call the general purpose method CONVERTER_DoConversion.
You should test the Result to ensure adequate inputs and outputs had been supplied.
If you want to do multiple conversions you may need to reset the input and output between calls.
When using the API an initial call must be made to CONVERTER_Allocate. This returns a new handle that is required to be passed to all subsequent API calls.
All calls to subsequent API methods return a success code (see API return values). This code indicates only whether or not the call to the API is valid. Normally you would expect this to return the value CONV_OK (i.e. 0).
For those API methods that could fail, the argument list contains a writable Result field. On exit the value of the Result will be set to one of the API result codes. When no error is encountered, this will be returned as R_SUCCESS (i.e. 0). The possible error values vary from method to method.
So calling software should first test the return value to check the API call was okay, and then test the Result code variable to see what error (if any) has occurred.
e.g.
long Result = R_SUCCESS; long APIStatus = CONV_OK; long Handle = 0; ... Handle = CONVERTER_Allocate(); APIStatus = CONVERTER_<method> (Handle, ..., Result, ...); if (APIStatus == CONV_OK && Result == R_SUCCESS) { cout << "It worked!" << endl; } ... APIStatus = CONVERTER_Free(Handle);
All of the API methods (except CONVERTER_Allocate which returns a handle) return a code indicating success or failure as follows
Status code Value Meaning CONV_OK
0
Call to API was made. Check any
API result codes to see whether it worked
or not.
CONV_FAILED
1
Call to API failed
CONV_INVHANDLE 2 Invalid API handle passed in
Several of the API methods (especially the conversion methods) accept a "Result" variable, into which a result code is written. This result value is set as follows :-
Result code Value Meaning R_SUCCESS
0
API call succeeded
R_NOTEXECUTED
1
API call not made. Usually indicates
CONVERTER call was bad (e.g. invalid
handle passed in
R_NULLARG
2
Null or empty argument passed where
not expected
R_BUFFERTOOSMALL
3
Write-back buffer is too small to
receive result
R_POLICYLOADERROR
4
Failed to load policy
R_CANTFINDFILE
5
Can't find input file
R_CANTOPENFILE
6
Can't open output file
R_CONVERSIONFAILED
7
Error during conversion
R_NOINPUTDEFINED
8
No input file or data buffer supplied
R_NOOUTPUTDEFINED 9 No output file or data buffer supplied
Here are some suggestions on how to handle the various error codes :-
R_NOTEXECUTED
The API call not executed. This usually indicates that the converter has detected that some or all of the calling arguments were passed incorrectly. Try using CONVERTER_DebugAPI and CONVERTER_DebugAPILogMessage to identify the error.If calling from Visual Basic, check that the correct argument types are defined and passed
R_NULLARG
A NULL or empty argument has passed where one was expected. Treat as for R NOTEXECUTED aboveR_BUFFERTOOSMALL
The supplied string buffer is too small to receive the requested data. Try again with a larger buffer. If you are attempting to read back the results of a conversion see Checking the conversion results when using (char *) pointersR_POLICYLOADERROR
Failed to load policy value. Either the policy name was incorrect (check with the documentation), or the value was invalid. Check for any error messages generated by the converter - see Capturing error messagesR_CANTFINDFILE
The specified file couldn't be foundR_CANTOPENFILE
The specified file couldn't be opened. For output files this could be because the directory doesn't exist, or because the output file already exists and is currently open in another application. This last error is quite common with RTF files if you are looking at the previous results in Word.R_CONVERSIONFAILED
Some major error has been detected during conversion. Check for any error messages generated by the converter - see Capturing error messagesR_NOINPUTDEFINED
You haven't yet specified an input file or supplied an input string buffer for the conversion.
See Setting up the input and output destinationsR_NOOUTPUTDEFINED
You haven't yet specified an output file or supplied an output string buffer for the conversion.
See Setting up the input and output destinations
Many of the API methods require character data to be passed into and out of the methods. The converter code has been written in C++ and so using C++ string variables is the most natural and easy way to pass this data.
Unfortunately there are a number of situations in which using C++ string variables is not possible.
In these cases it is not possible to call API methods that have string arguments. To get round this, the API has two variants of any method that passes text data. The alternative function has the same name, but with "_Ptr" appended, because the non-string version uses character pointers instead of string as follows :-
Example:- DLL_DECLARE CONVERTER_ReadPolicyFile (long Handle, string PolicyFileName, long &Result); becomes DLL_DECLARE CONVERTER_ReadPolicyFile_Ptr (long Handle, char *pPolicyFileName, long &Result);
Example DLL_DECLARE CONVERTER_GetPolicyValue (long Handle, string PolicyName, string &PolicyValue, long &Result); becomes DLL_DECLARE CONVERTER_GetPolicyValue_Ptr (long Handle, char *pPolicyName, char *pPolicyValue, long &ValueBufferSize, long &Result);
Note in the above example that PolicyName is a read-only argument, while PolicyValue is an output argument, and so requires a buffer size passed.
The following code fragment shows how to set a policy value, and how to interrogate it again, using string variables.
string PolicyName, PolicyValue; long APIStatus, Result; PolicyName = "Default font"; PolicyValue = "Arial, regular, 10"; // set the policy value APIStatus = CONVERTER_SetPolicyValue (APIHandle, PolicyName, PolicyValue, Result); ... // read back a policy value string Value; APIStatus = CONVERTER_GetPolicyValue (APIHandle, "Page width", Value, Result); cout << "Page Width = " << Value.c_str() << endl;
Here's the same code using (char *) pointers and the "_Ptr" variants
#define MAX_POLICYNAME_LEN 255 #define MAX_POLICYVALUE_LEN 255 char *pPolicyName = new char [MAX_POLICYNAME_LEN]; char *pPolicyValue = new char [MAX_POLICYVALUE_LEN]; long APIStatus, Result; strcpy (pPolicyName, "Default font"); strcpy (pPolicyValue, "Arial, regular, 10"); // set the policy value APIStatus = CONVERTER_SetPolicyValue_Ptr (APIHandle, pPolicyName, pPolicyValue, Result); ... // read back a policy value strcpy (pPolicyName, "Page width"); strcpy (pPolicyValue, ""); long PolicyBufferSize = MAX_POLICYVALUE_LEN; APIStatus = CONVERTER_GetPolicyValue_Ptr (APIHandle, pPolicyName, pPolicyValue, PolicyBufferSize, Result); // need to add extra checks on _Result_ to see if buffer was big // enough cout << "Page Width = " << pPolicyValue << endl;
When using the "_Ptr" variant of the method to set up an output buffer, there is the possibility that the buffer you supply will turn out to be too small when you come to do the conversion.
When this situation arose, the Result returned by the conversion method will be R_BUFFERTOOSMALL.
Rather than requiring you to do the conversion a second time, with a bigger buffer, the API will hold onto an internal copy of the results, which you can retrieve any time up until you start on the next conversion.
To access this you first make a call to CONVERTER_GetOutCharArraySize to find out how large a buffer is required to receive this data. Create a buffer of the required size, and then call CONVERTER_GetOutCharArray_Ptr to actually retrieve the conversion results
APIStatus = CONVERTER_DoConversion (APIHandle, ConvertType, Result); if (APIStatus != CONV_OK) return EXIT_FAILED; // Conversion worked, but the output buffer may be too small. Check // this, and if necessary re-allocate the buffer. The converter will // internally still hold onto a copy of the output until you call the // free function, so you will be able to simply ask for the result once // you supply a big enough buffer if (Result == R_BUFFERTOOSMALL) { long Length = 0; // Find out what size buffer is required APIStatus = CONVERTER_GetOutCharArraySize (APIHandle, Length, Result); if (Result == R_SUCCESS) { char *pBigBuffer = new char [Length]; // read back the result into the new, big enough, buffer APIStatus = CONVERTER_GetOutCharArray_Ptr (APIHandle, pBigBuffer, Length, Result); if (Result == R_SUCCESS) cout << pBigBuffer << endl; delete [] pBigBuffer; } } // if buffer was too small
New in version 2.3.2
The API was not originally designed with Unicode in mind, and as
a result support for Unicode text has been gradually added over time,
with the result that earlier versions of the API may not support all
the features described in this manual. If in doubt, please contact
JafSoft for details.
New in version 2.3.2
Traditional single-byte character sets interpret the 8-bit
character values (128-255) as special characters. So on a Russian
machine this would be interpreted as Cyrillic, but on a different
machine this could be read (wrongly) as Arabic (and vice versa). On
most English-based PCs, the 8-bit characters are used for accented
character used in certain European languages, so a Russian text would
appear to have lots accented 'i's, 'e's and 'a's.
Unicode is a way of implementing text that supports multiple types of character sets at teh same time so that - for example - it is possible to display Chinese and Cyrillic on the same page unambigously. It does this by allocating each character in each language a unique code value, so that codes used for Cyrillic characters no longer overlap and conflict with those assigned to Arabic.
However, these code values are in most cases larger than can be represented in a single byte. As a result a way has to be chosen to represent each character by one or more bytes.
The following Unicode representations are commonly used
UTF-8
Each character is represented by 1, 2 or 3 bytes, depending on the which range the Unicode code value falls into. This has the advantage that all ASCII characters are a single byte, so for example all the HTML tags in a document are represented by a single byte each. This also means there are no null bytes contained in the text, which can make programming software to work with this text easier.UTF-16
Each character is represented by a 2-byte pair (future characters may require 2 such pairs). The 2-byte pair is just the numerical representation of the Unicode value of each character. This makes the files easier to interpret, but also means that the byte order depends on how the machine stores its bytes - i.e. is the machine big-endian or little-endian. Because ASCII characters have a Unicode value less than 255 the ASCII characters map onto a byte pairs in which one of the bytes is null. Because each character requires two bytes, a single byte wrongly inserted into a UTF-16 stream will render all text that follows is as gibberish.
Files that contain Unicode identify themselves by inserting a "Byte Order Mark" (BOM) at the top of the file. This is a two-byte marker for UTF-16 files and a three-byte marker for UTF-8 files. Modern applications will test for this byte marker and if present will then know how to interpret the contents of the file. For example Notepad as supplied with Windows XP can do this, whereas Notepad as supplied with Windows 98 could not.
In UTF-16 each character is represented by two bytes, and computers can store a two-byte value in different ways (known as "big-endian" and "little-endian"). Each operating system uses one method or another and it isn't usually an issue, but when Unicode files get passed from one machine to another, this becomes important. The BOM allows the two forms of UTF-16 (known as "UTF-16BE" and "UTF-16LE") to be distinguished.
New in version 2.3.2
Internally the API makes extensive use of the C runtime library, and
so effectively assumes that the text it is processing is free form null
characters. This means that the API cannor handle UTF-16 internally
in it's native form, as the two-byte implementation cointains nulls in one
of the bytes for each ASCII character present.
This means that the API will convert any detected Unicode characters into UTF-8.
New in version 2.3.2
The API considers that the input text is Unicode under the following
circumstances
New in version 2.3.2
For file-to-file conversion, the API will normally detect the presence
of Unicode by spotting the Byte Order Marks (BOM) at the top of the
input file.
Alternatively if the inpput file is a html file, any HTML entities that map onto Unicode characters will mark the input as being Unicode.
Internally the output text will be calculated as UTF-8 encoded text. When this is output to file, the UTF-8 BOM is added to the output file.
Thus any type of properly identified Unicode file on input will result in a valid UTF-8 file being created as output.
New in version 2.3.2
When calling the API to do string-to-string conversions, it is likely
that the Byte Order Marks (BOM) that identify files as being Unicode
will be present. This means you will probably have to "tell" the API
that the text is Unicode. How you do this depends on the way the text
is encoded.
See Using the "input text encoding" policy and Using the "output text encoding" policy
New in version 2.3.2
The program has the ability to detect Unicode Files on input if Byte
Order Mark (BOM) is present. The Detagger API also has the ability -
under some circumstances - to detect Unicode HTML entities are present
in the input text.
However in files without the BOMs, or when passed string data as input, the software may fail to detect the input is Unicode.
In such circumstances this policy allows you to tell the software that the input should be treated as Unicode. The possible values for this policy are
auto automatic detection (the default) UTF8 UTF-8 UTF16-BE UTF-16 "Big Endian" UTF16-LE UTF-16 "Little Endian"
New in version 2.3.2
When outputting to file the API will create a Unicode (UTF8) file
whenever it detects (or is told) that the input conrtains Unicode.
However under some circumstances it may be necessary to use the API to output to a UTF16 string, as opposed to a UTF8 or ASCII string.
In those circumstances this policy - which is only meant for use with APIs - allows you to specify the output encoding of the text returned by the API. As with the "input text encoding" policy the possible values are
auto automatic detection (the default) UTF8 UTF-8 UTF16-BE UTF-16 "Big Endian" UTF16-LE UTF-16 "Little Endian"
New in version 2.3.2
This table summarises how you should use the API when specifying
the input and/or output locations of Unicode text.
|
|
|
UTF-8 | UTF-16 | |
Input is file | Just pass in the file | Just pass in the |
with BOM | name | file name. |
|
|
|
Input is file | Pass in file name and | Pass in file name and |
without BOM | set the "input text | set the "input text |
encoding" policy to | encoding" policy to be | |
be "UTF-8". | either "UTF-16LE" or "UTF-16BE" according to the endian-ness. |
|
|
|
|
Input is a | Call string or "_Ptr" | Call the "_utf16" |
string | method and set the | method and set the |
"input text encoding" | "input text encoding" | |
policy to "UTF-8" | to "UTF-16LE" or "UTF-16BE" according to the endian-ness |
|
Output to file | Just pass in | Just pass in the |
file name. | file name. Output | |
Output will be a UTF-8 file |
will be a UTF-8 file | |
Output to | Call string or "_Ptr" | Call the "_utf16" |
string | method to get the | method to get the |
result. | result. | |
Output will be | Output will be UTF16 | |
UTF-8 text | with the endian-ness you requested |
|
The API can generate a number of progress messages, as well as error messages that will help diagnose any problems.
When calling the API from C++, it is possible to establish some callback
routines that get called each time a message would be output to the output
or error streams.
See Error reporting methods
When calling the API from other languages, such as Visual Basic, this
level of integration isn't possible. In that situation you might want to
use the debug options to switch on logging. In this way the output can
be diverted into a log file.
See Debugging methods.
Finally, after the conversion is complete, you can fetch the last error
message displayed. This isn't always useful as the last error message
isn't always the most significant, but it may help.
See CONVERTER_GetLastMessage
Evaluation copies of all of the APIs are available online at http://www.jafsoft.com/developers/api_demos.html
There you can download an evaluation copy, it will also contain a demonstration kit (DemoAPI.zip). The demonstration kit includes sample code for C++ and Visual Basic, showing how the converter can be called from your code. It also contains example files for other languages supplied by users who have managed to integrate the APIs into their systems. These other files are supplied on an "as is" basis, and may not always be up to date with the current API implementation.
These evaluation copies include DLLs that are not time-limited, but which have other limitations, e.g. limits on how many files can be converted in a wildcard operation, and watermarking the output data and converting occasional words or lines into UPPER case. It is hoped that these limitations should not overly interfere with your evaluation of the API. If you feel they do, please email info<at>jafsoft.com indicating your reasons, and we will see what we can do (replace "<at>" by "@").
Should you decide to register the API, you will be supplied with full versions of the .DLL and .LIB files which do not have these built-in restrictions.
Before the converter is used, a call must be made to CONVERTER_Allocate. This will create a new converter object and return a Handle that must be passed to all subsequent API call, so that they know which converter object is to be used.
Once you have finished, you should call CONVERTER_Free to release the converter object. This will free the memory and other resources allocated to the API object.
DLL_DECLARE CONVERTER_Allocate ();
This method must be called first to allocate an API resource. It should return a non-zero Handle if it succeeds, and that Handle should be passed in to all remaining API calls.
DLL_DECLARE CONVERTER_Free (long &Handle);
After all conversions are complete, this method should be called to release the resource. The resource is freed, and all memory allocated during the conversion will be released. Since the API typically keeps a copy of the last conversion, this can be a variable amount of memory, comparable to the size of the largest file converted.
On exit the Handle will have been reset to 0, preventing it's reuse in later API calls.
The conversion process can be fine tuned using "policies". Policies are program options that can be used to influence the conversion. Which policies are available varies from converter to converter, although some policies are supported by multiple converters.
You should see the program's documentation and the Policy Manual for details of individual policies.
In each case a policy consists of a "policy phrase" and a value. Policies can be placed in a text file, one per line, known as a policy file. The API supports the loading of existing policy files, and/or the setting of individual policies.
DLL_DECLARE CONVERTER_ResetPolicies (long Handle);
When called this will reset all policies back to default values. You might want to call this between conversions using the same API if you wanted to apply different policies each time. It wouldn't be necessary if you wanted to apply the same policies each time.
DLL_DECLARE CONVERTER_ReadPolicyFile (long Handle, string PolicyFileName, long &Result); DLL_DECLARE CONVERTER_ReadPolicyFile_Ptr (long Handle, char *pPolicyFileName, long &Result);
These methods accept the name of a policy file, and will load the policies in that file into the API object. You should test the Result to check that the file was found okay.
DLL_DECLARE CONVERTER_WritePolicyFile (long Handle, string PolicyFileName, long ShowAllPolicies, long &Result); DLL_DECLARE CONVERTER_WritePolicyFile_Ptr (long Handle, char *pPolicyFileName, long ShowAllPolicies, long &Result);
These methods allow you to dump the actual policies used during a conversion. This can be used to check that the policies you set were indeed used, or to see what values the analysis policies (such as page width) were set to by the API. Sometimes looking at post-conversion policies helps diagnose problematic conversions
The ShowAllPolicies value should be set as follows
Symbol Value Explanation INCREMENTAL_POLICY_FILE
0
save only those policies that were
loaded or changed to file
FULL_POLICY_FILE 1 save all policies to file. Only
recommended for diagnostic and
documentation purposes
You can elect to show (almost) all policy value, or only those which have been "Loaded" and "Edited". The "almost" refers to the fact that only policies which may be meaningfully re-loaded from file are saved.
DLL_DECLARE CONVERTER_SetPolicyValue (long Handle, string PolicyName, string TextValue, long &Result); DLL_DECLARE CONVERTER_SetPolicyValue_Ptr (long Handle, char *pPolicyName, char *pTextValue, long &Result);
Sets an individual policy by name. You should test the value of Result so see if the call worked. The commonest cause of failure would be a typo in the policy name.
See Customising the conversion using policies
DLL_DECLARE CONVERTER_GetPolicyValue (long Handle, string PolicyName, string &PolicyValue, long &Result); DLL_DECLARE CONVERTER_GetPolicyValue_Ptr (long Handle, char *pPolicyName, char *pValue, long &ValueBufferSize, long &Result);
Interrogates the current value of a named policy. You might use this, for instance, to ask the program what it calculated the page width to be after the conversion.
Check the value of Result to ensure the PolicyName was valid. If an error is detected the value is set to
"*** GetPolicyValue Error ***";
to distinguish it from any other value.
See Customising the conversion using policies
The API can accept input from either file or a passed string, and can output the results to either a file or a string buffer. You can use any combination you wish, but as a special case if you only supply an input filename, the converter will default to creating an output file in the same folder, with the same name, but a different extension (one more suited to the output format).
Depending on the conversion method called, you can either pass in filenames or string buffers to the conversion method directly, or you can set these up before conversion.
If you want to do mixed conversion, (e.g. from file into string), then you'll need to call these methods first to set up the input and output options.
If you are doing multiple conversions with the same API object, you may need to reset the input source and output targets between conversions.
DLL_DECLARE CONVERTER_ResetSources (long Handle, long &Result);
When called this will reset to null the input source and output targets for the API. This means you will either have to set up new locations before the next conversion, of choose a conversion method which allows you to specify those sources. Failure to do so will result in an error message.
DLL_DECLARE CONVERTER_ResetInputSource (long Handle, long &Result);
When called this will nullify the input source. You will need to specify a new source before the next conversion, or choose a conversion method that allows you to specify a source.
DLL_DECLARE CONVERTER_ResetOutputSource (long Handle, long &Result);
When called this will nullify the output target. You will need to specify a new target before the next conversion, or choose a conversion method that allows you to specify a target.
The exception is file conversion, where a default output file can be inferred from the input file name (same folder, same name, different extension).
DLL_DECLARE CONVERTER_SetInputString (long Handle, string Instring, long &Result); DLL_DECLARE CONVERTER_SetInputString_Ptr (long Handle, char *pInstring, long &Result);
When called this sets up the input for the next conversion to be the passed string data. Once you have also set up the output target, you can then call CONVERTER_DoConversion.
If the output is also a string buffer, you should consider calling CONVERTER_DoStringConvert which negates the need to call this method first.
DLL_DECLARE CONVERTER_SetOutputString (long Handle, string &Outstring, long &Result); DLL_DECLARE CONVERTER_SetOutputString_Ptr (long Handle, char *pOutputString, long OutputBufferSize, long &Result);
When called this sets up the output target for the next conversion to be the passed string buffer. Once you have also set up the input source, you can then call CONVERTER_DoConversion.
If the input is also a string buffer, you should consider calling CONVERTER_DoStringConvert which negates the need to call this method first.
When calling the "_Ptr" version of this method, be aware that the passed
buffer may end up being too small.
See the discussion in Passing character data to and from the converter
DLL_DECLARE CONVERTER_SetInputFilename (long Handle, string Filename, long &Result); DLL_DECLARE CONVERTER_SetInputFilename_Ptr (long Handle, char *pFilename, long &Result);
When called this sets up the input for the next conversion to be the specified file. Once you have also set up the output target, you can then call CONVERTER_DoConversion.
If the output is also a file, you should consider calling CONVERTER_DoFileConvert which negates the need to call this method first.
DLL_DECLARE CONVERTER_SetOutputFilename (long Handle, string Filename, long &Result); DLL_DECLARE CONVERTER_SetOutputFilename_Ptr (long Handle, char *pFilename, long &Result);
When called this sets up the output target for the next conversion to be the specified file. Once you have also set up the input target, you can then call CONVERTER_DoConversion.
If the input is also a file, you should consider calling CONVERTER_DoFileConvert which negates the need to call this method first.
See discussion in performing conversion between files
DLL_DECLARE CONVERTER_GetOutCharArraySize ( long Handle, long &Size, long &Result );
When using (char *) buffers with the API there is the possibility that the buffer passed to the API may be too small. This method can be called after the conversion to determine the size of buffer required to receive the results.
See Checking the conversion results when using (char *) pointers
DLL_DECLARE CONVERTER_GetOutCharArray_Ptr ( long Handle, char *pArray, long &OutArraySize, long &Result);
When using (char *) buffers with the API there is the possibility that the buffer passed to the API may be too small. This method can be called after the conversion to retrieve the results of the last conversion. A call should first be made to CONVERTER_GetOutCharArraySize to determine how large the buffer passed into this method should be, otherwise the Result may again be R_BUFFERTOOSMALL.
See Checking the conversion results when using (char *) pointers
There are a number of methods to actually perform the conversion, depending on whether or not you want to set up the input source and output destination before calling the execution method.
CONVERTER_DoFileConvert
Call this method if you want to
do convert an input file into an
output file
CONVERTER_DoStringConvert
Call this method if you want to
convert from one string buffer into
another
CONVERTER_DoConversion For all other conversions, use this
method. You will need to set up
the input and output locations by
calling other methods before calling this one.
See Setting up the input and output destinations
In each case you should test that the API method returns the value CONV_OK (see API return values), and that the Result argument is returned as R_SUCCESS (see API result codes).
If you are using a string buffer as the output location, and are using the
"_Ptr" variants of methods, then bear in mind that buffer might have proved
to be too small.
See the discussion in Checking the conversion results when using (char *) pointers
Bear in mind that while the conversion may appear to work, there may still be aspects of the conversion which are reported as conversion problems during the conversion. These will be reported as errors and warnings via the error reporting methods. To see those messages you will either need to establish error reporting callback functions (available via C++ only), or enable some debugging.
See Error reporting methods and Debugging methods.
DLL_DECLARE CONVERTER_DoConversion ( long Handle, long ConvType, long &Result);
This method should be called to execute a "mixed mode" conversion, i.e. one in which the input is a file, and the output is a string buffer or vice versa.
See also the discussion in Conversion methods.
DLL_DECLARE CONVERTER_DoFileConvert ( long Handle, long ConvType, string InFilename, string OutFilename, long &Result); DLL_DECLARE CONVERTER_DoFileConvert_Ptr ( long Handle, long ConvType, char *pInFilename, char *pOutFilename, long &Result);
This method should be called to execute a file conversion, i.e.
one in which both the input and outputs are files.
See performing conversion between files
See also the discussion in Conversion methods.
DLL_DECLARE CONVERTER_DoStringConvert( long Handle, long ConvType, string InText, string OutText, long &Result); DLL_DECLARE CONVERTER_DoStringConvert_Ptr ( long Handle, long ConvType, char *pInText, char *pOutText, long &OutTextSize, long &Result);
This method should be called to execute a string conversion, i.e. one in which both the input and outputs are string buffers. If you are using the _(char *)_ method for passing text (using the "_Ptr" variant), you'll need to check the output buffer was large enough (see Checking the conversion results when using (char *) pointers).
See also the discussion in Conversion methods.
During the conversion the API will generate a number of messages indicating progress and problems with the conversion itself. These messages won't normally represent a total failure of conversion, but may act as warnings that some aspects of the conversion may not have proceeded as expected.
In C++, it is possible to establish callback functions to capture and report these messages.
When calling the API from other programming languages these techniques cannot be used, and you would need to use the various Debugging methods that are available instead.
DLL_DECLARE CONVERTER_SetErrorFn (long Handle, void (*pErrorFn) (const char *));
This method can be used when calling the API from C++ to capture messages that would be sent to the "error" stream. The supplied callback routine will be called each time that an "error" message is generated.
DLL_DECLARE CONVERTER_SetOutFn (long Handle, void (*pErrorFn) (const char *));
This method can be used when calling the API from C++ to capture messages that would be sent to the "output" stream. The supplied callback routine will be called each time that an "informational" message is generated.
A number of methods exist to help you debug your use of the API, and to direct the output of the API to a log file.
DLL_DECLARE CONVERTER_DebugAPI (long Value);
This method is used to switch on/off the generation of debug messages each time and API method is called. These messages will show calls to the API, and the arguments passed. Some API calls will produce multiple entries, for example the "_Ptr" variants of methods often call their _string_ based equivalents.
This call can be useful to help diagnose problems with the API, often caused by the incorrect passing of data, especially text arguments.
A _Value_ of 1, switches on the messages, 0 switches them off. They are off by default.
DLL_DECLARE CONVERTER_DebugAPILogMessage (long Value, char *pLogName);
This method can be used to direct messages generated by the API into a log file. If enabled all messages generated by the API (including any Debug messages if CONVERTER_DebugAPI has been called) will be output to a log file. You may need to specify a complete directory path in the filename, as relative filenames may not work.
A _Value_ of 1, switches on the logging, 0 switches it off. It is off by default.
DLL_DECLARE CONVERTER_GetLastMessage (string &Message, long &Result); DLL_DECLARE CONVERTER_GetLastMessage_Ptr (char *pMessage, long &MessageSize, long &Result);
These methods may be used to retrieve the last message generated by the API. This can be useful in diagnosing problems, although sometimes the last message may not be the most important, and you may need to use some other techniques to capture all error messages generated during the conversion.
Converted from
a single text file by
AscToHTM © 2001-2004 John A Fotheringham |