This is an archive of the original scripts.sil.org site, preserved as a historical reference. Some of the content is outdated. Please consult our other sites for more current information: software.sil.org, ScriptSource, FDBP, and silfontdev



Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE | PRIVACY POLICY

You are here: Encoding > Conversion > Utilities > TECkit
Short URL: https://scripts.sil.org/TECkitDevNotes

TECkit: Notes for Developers

Jonathan Kew, 2004-03-30

For mapping authors

Question: Why do all my spaces (or line ends) get mangled?

Answer: When mapping between bytes and Unicode, every character code that you are interested in needs to be mapped appropriately by the table. If you map only the visible characters, or worse still, only those where your legacy encoding differed from standard ASCII, everything else will be mapped to the default replacement character, typically U+FFFD. This applies to characters such as space, tab, carriage-return and line-feed just as to printable characters.

Note that byte-only or Unicode-only mappings (or passes within multi-pass mapping) work differently: they will pass unmapped characters through unchanged. But in byte-Unicode mappings, which are the major focus of TECkit, anything that is not explicitly mapped will be replaced by the default code.

Question: The compiler reports "code space mismatch"; what does that mean?

Answer: This (rather cryptic) error message means that the mapping description includes multiple passes, and the output of one pass is not in the same "code space" (either bytes or Unicode) as the input of the next pass.

The compiler reports this error when it reaches the end of the second of the incompatible passes (which may be the very end of the file); the actual problem lies at the beginning of the pass, where it is chained with the preceding one.

One subtle way this error can arise is if you intend to have a single pass in the mapping, and use an explicit pass(Byte_Unicode) statement, but accidentally place some part of the mapping content before the pass statement. Any Class definitions or mapping rules found before any pass statement will implicitly begin a Byte/Unicode mapping pass. (This is a legacy of the original, single-pass TECkit system.) When your explicit pass(Byte_Unicode) statement is read, this begins a second pass, and you can't chain two Byte/Unicode passes: the Unicode output of the first can't become Byte input to the second.

For application developers

Question: How big of an output buffer should I use when calling Convert or Flush?

Answer: In general, you can't be sure; mappings are not necessarily one-to-one. Unless the input is ridiculously large, it's probably best to allocate a buffer that would allow for a 50% or even 100% increase in the number of character codes; however, you must still be prepared for the possibility that the engine will return kStatus_OutputBufferFull. If this happens, either enlarge your buffer or clear out the output that has been generated so far—write it out, send it to the next process, or whatever is appropriate—so that you can restart at the beginning of your buffer.

If you can't afford such a generous buffer, you can use a smaller one and expect to do more looping. But your buffer must be at least big enough for the engine to perform a complete unit of conversion work, and this may result in a sequence of characters being output, not just a single code.

Question: Why do I get kStatus_OutputBufferFull, when it isn't?

Answer: When you call Convert or Flush, the TECkit engine does not necessarily use all the space in your output buffer. It may return kStatus_OutputBufferFull even though there is some space remaining.

There are two reasons for this. First, the engine never puts a partial Unicode character into the output buffer. A single Unicode character may require up to 4 bytes, depending on the encoding form and the particular character, so if less than 4 bytes are available, the engine may report that the buffer is full because the next character it wants to write won't fit in its entirety.

Second (and this applies even when mapping to bytes), the engine does not like to return with an input code partially processed. And processing a single input code may result in multiple characters of output, either because the input code itself maps to a sequence or because it provides the context needed to determine the mapping preceding codes that have been buffered by the engine because their mappings depended on following context.

So the engine may report kStatus_OutputBufferFull even when a considerable number of bytes remain unused. In extreme cases, unlikely in real-life mappings, this could be several hundred bytes, but cases where a dozen or more bytes are needed in the output buffer to process a single input code definitely occur. This status code always means that you need to create more output buffer space, either by enlargement or by clearing previous output, even if your buffer was not completely full.

Question: How can I detect if there were characters TECkit couldn't map?

Answer: By default, the TECkit engine maps all input characters to something in the output; characters for which no explicit mapping was given in the table will result in the "default replacement character". (This is 0x3F ASCII '?' by default when mapping to bytes, and U+FFFD REPLACEMENT CHARACTER by default when mapping to Unicode, but the mapping table author can change these values.)

Beginning with TECkit version 2.1, released 29 March 2004, the engine has new conversion APIs (the TECkit_ConvertBufferOpt and TECkit_FlushOpt functions; see the TECkit_Engine.h header file). These allow the client application to control the behavior when unmappable input is encountered. The options are:

  • Silently use the replacement character, as in previous versions of the engine.
  • Use the replacement character, but return a warning status to the calling application.
  • Stop converting and return an error code to the calling application.



Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.

"Michael Ichter", Thu, Mar 8, 2018 11:23 (EST)
Mapping Documentation

Is there documentation concerning the syntax for mapping (e.g. what classes there are and what operators exist)? I'm using TECkit for script conversion. The tutorial on this website is nice, but I'm thinking a reference resource might be useful.

martinpk, Thu, Mar 8, 2018 11:59 (EST) [modified by martinpk on Thu, Mar 8, 2018 12:00 (EST)]
Re: Mapping Documentation

Hi Michael, the TECkit package download contains TECkit_Language.pdf in the Documentation folder, covering the syntax of mapping files. Let us know if that doesn't meet the need. Thanks!

"Bob Hallissy", Thu, Mar 8, 2018 12:01 (EST)
Re: Mapping Documentation

In addition to be packaged with TECkit, the documentation is available on the source repo, for example at https://github.com/silnrsi/teckit/blob/master/docs/TECkit_Language.pdf

"Michael Ichter", Wed, Mar 14, 2018 13:59 (EDT)
Re: Mapping Documentation

Thanks, Martin and Bob! Those are exactly what I was looking for.



© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.