NRSI: Computers & Writing Systems
TECkit: Notes for Developers
For mapping authors
Question: Why do all my spaces (or line ends) get mangled?
Answer: When mapping between bytes and Unicode, every character code that you are interested in needs to be mapped appropriately by the table. If you map only the visible characters, or worse still, only those where your legacy encoding differed from standard ASCII, everything else will be mapped to the default replacement character, typically U+FFFD. This applies to characters such as space, tab, carriage-return and line-feed just as to printable characters.
Note that byte-only or Unicode-only mappings (or passes within multi-pass mapping) work differently: they will pass unmapped characters through unchanged. But in byte-Unicode mappings, which are the major focus of TECkit, anything that is not explicitly mapped will be replaced by the default code.
Question: The compiler reports "code space mismatch"; what does that mean?
Answer: This (rather cryptic) error message means that the mapping description includes multiple passes, and the output of one pass is not in the same "code space" (either bytes or Unicode) as the input of the next pass.
The compiler reports this error when it reaches the end of the second of the incompatible passes (which may be the very end of the file); the actual problem lies at the beginning of the pass, where it is chained with the preceding one.
One subtle way this error can arise is if you intend to have a single pass in the mapping, and use an explicit pass(Byte_Unicode) statement, but accidentally place some part of the mapping content before the pass statement. Any Class definitions or mapping rules found before any pass statement will implicitly begin a Byte/Unicode mapping pass. (This is a legacy of the original, single-pass TECkit system.) When your explicit pass(Byte_Unicode) statement is read, this begins a second pass, and you can't chain two Byte/Unicode passes: the Unicode output of the first can't become Byte input to the second.
For application developers
Question: How big of an output buffer should I use when calling Convert or Flush?
Answer: In general, you can't be sure; mappings are not necessarily one-to-one. Unless the input is ridiculously large, it's probably best to allocate a buffer that would allow for a 50% or even 100% increase in the number of character codes; however, you must still be prepared for the possibility that the engine will return kStatus_OutputBufferFull. If this happens, either enlarge your buffer or clear out the output that has been generated so far—write it out, send it to the next process, or whatever is appropriate—so that you can restart at the beginning of your buffer.
If you can't afford such a generous buffer, you can use a smaller one and expect to do more looping. But your buffer must be at least big enough for the engine to perform a complete unit of conversion work, and this may result in a sequence of characters being output, not just a single code.
Question: Why do I get kStatus_OutputBufferFull, when it isn't?
Answer: When you call Convert or Flush, the TECkit engine does not necessarily use all the space in your output buffer. It may return kStatus_OutputBufferFull even though there is some space remaining.
There are two reasons for this. First, the engine never puts a partial Unicode character into the output buffer. A single Unicode character may require up to 4 bytes, depending on the encoding form and the particular character, so if less than 4 bytes are available, the engine may report that the buffer is full because the next character it wants to write won't fit in its entirety.
Second (and this applies even when mapping to bytes), the engine does not like to return with an input code partially processed. And processing a single input code may result in multiple characters of output, either because the input code itself maps to a sequence or because it provides the context needed to determine the mapping preceding codes that have been buffered by the engine because their mappings depended on following context.
So the engine may report kStatus_OutputBufferFull even when a considerable number of bytes remain unused. In extreme cases, unlikely in real-life mappings, this could be several hundred bytes, but cases where a dozen or more bytes are needed in the output buffer to process a single input code definitely occur. This status code always means that you need to create more output buffer space, either by enlargement or by clearing previous output, even if your buffer was not completely full.
Question: How can I detect if there were characters TECkit couldn't map?
Answer: By default, the TECkit engine maps all input characters to something in the output; characters for which no explicit mapping was given in the table will result in the "default replacement character". (This is 0x3F ASCII '?' by default when mapping to bytes, and U+FFFD REPLACEMENT CHARACTER by default when mapping to Unicode, but the mapping table author can change these values.)
Beginning with TECkit version 2.1, released 29 March 2004, the engine has new conversion APIs (the TECkit_ConvertBufferOpt and TECkit_FlushOpt functions; see the TECkit_Engine.h header file). These allow the client application to control the behavior when unmappable input is encountered. The options are:
Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.
Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.