WSTech: Writing Systems Technology (formerly known as NRSI)
Encoding Conversion Frequently Asked Questions and Known Issues
SIL Converters Pages
Contents of this page
Frequently Asked Questions
Question: When I open a mapping file with the TECkit Unicode Mapping Editor I get an error message "unexpected character: "?" at line 1." What can I do to get rid of this message?
Answer: Check whether you have another (different) version of the TECkit_x86.dll and/or TECkit_Compiler_x86.dll located somewhere in the path ahead of the version installed by SILConverters (i.e. C:Program FilesCommon FilesSIL). If so, this interferes with SILConverters finding the one it installed and the last time I encountered this, it was an older version of the TECkit DLL which prevents SILConverters clients from using the right version.
SILConverters v 3.1 has the latest versions of the TECkit DLLs and the location it is installed is also added to the path, so even if you deleted the (older?) ones in C:WindowsSystem32 or lingualinks folder, the programs that installed them should continue to work (i.e. the interface has not changed, so they should still work fine).
Question: I am faced with a zillion files that need converting to Unicode. How do I prioritize my work?
Answer: You could divide your files into different types. Take into account the importance of the content related to your particular goals, the likelihood that the material would need to be re-used, etc.
This is how one person prioritized his work:
The documents of value, in order of priority would be1:
Do not bother converting ANY published or unpublished Scripture data unless it falls under #1 above. Scripture data is constantly changing and if it is more than a month old in an active project, it is already outdated. Find out where the most recent copy of the data is, and if it is safe and in standard format, that is worth more than 20 Word or Publisher files.
Question: I am using SILConverters 4.0 and would like to know how I can manually add a TECkit converter that does not have an installer.
Answer: These steps are for those who have an existing TECkit converter. They should also work, with reasonable modifications, for other types of converters supported by SILConverters 4.0.
Question: I know that when converting legacy text to Unicode (with TECkit) I can request either decomposed (NFD) or composed (NFC) result. But if the text is already Unicode, how do I convert it to NFD or NFC?
Answer: For text documents, you can do this with txtconv.exe (from the TECkit package); just run it with no mapping file, specifying a Unicode encoding form for both input and output, and give it the normalization option you want.
With EncConverters installed (including the ICU Plug-in), you could do the following steps in Word:
Then you can use the main dialog interface to convert your word document according to your need using either of these two converters.
You can also write Word macros (after following the above steps) to do the conversion, but it depends on the version of the EncConverters you are using (the 1.5 version uses a slightly different interface than the 2.0 version). The 1.5 version should work, but the actual syntax of the VB code will be different from the following. In the Visual Basic editor, you have to do like the following:, and check the box that looks like “EncCnvtrs” or “EncConverters”, then add the macros
; Add a reference to the SILConverters type lib (i.e. in the VBA editor (Alt+F11), ; click 'Tools', 'References' and then browse for file ; "C:Program FilesCommon FilesSIL<version>SilEncConverters22.tlb") Sub Decompose() Selection.Text = GetEncConverter("NFD").Convert(Selection.Text) End Sub Sub Compose() Selection.Text = GetEncConverter("NFC").Convert(Selection.Text) End Sub Where the "GetEncConverter" function is defined as follows: Function GetEncConverters() As EncConverters Set GetEncConverters = CreateObject("SilEncConverters22.EncConverters") End Function Function GetEncConverter(sName As String) As IEncConverter Set GetEncConverter = GetEncConverters().Item(sName) End Function
Question: Is there any advantage in generating the BOM?
Answer: The Byte Order Mark (U+FEFF) was invented back in the days when Unicode was thought of as a 16-bit standard. The problem was that some systems wanted to store Unicode little endian with the least significant byte first, and some wanted to store it big endian with the most significant byte in a pair coming first. In order to resolve this, the UTC allocated two codes: U+FEFF as a zero-width non-breaking space (which basically has no effect at the start of the file), and U+FFFE (the byte reversal of U+FEFF) as an unassigned and therefore illegal code. Therefore, to determine the byte order of a stream of data, a processor can examine the inital BOM: if it encounters U+FFFE, this indicates that the order of the bytes needs to be reversed (which will produce the valid code U+FEFF); if it finds U+FEFF then no change is needed.
With the advent of the various Unicode Transfer Formats (UTFs), the BOM concept was extended to UTF-32. UTF-8, on the other hand, does not strictly need an initial BOM because there are no byte order issues to deal with. But U+FEFF as encoded in the various UTFs does end up being a sequence that is very unlikely to occur in almost any other encoding. As such, U+FEFF can be used not only as a byte-order mark but also to indicate that the data in question is in fact Unicode text. Microsoft, for example, uses this extensively in its data.
There is a problem with using the BOM when storing data in UTF-8. Many programs that were not designed to be Unicode-aware work very well with UTF-8 encoded data. All codepoints up to x7F take their standard ASCII meanings, and all codes above x80 are considered just to be odd characters with no real interpretation required. Thus processors of certain kinds of ASCII text, such as programming source code, configuration files, etc., which assume that any significant characters will be in the range 0x00-0x7F, can work fine with UTF-8, since all characters in the higher range are just treated as unprocessed data that are handled by some other process. But if a file starts with a BOM ((U+FEFF = xEF xBB xBF in UTF-8), those codes are all 'upper ASCII', not ASCII spaces that can be ignored. This can lead to various types of failure (errors, warnings, crashes). Increasingly, as such applications are updated, they are designed to ignore any initial BOM, but care should be taken and users should be aware that an initial BOM may cause problems in certain situations. Unfortunately, the chief culprits in creating such BOMs in the first place are the least able to remove it.
Another problem arises when files are blindly concatenated. U+FEFF is actually a ZERO WIDTH NO-BREAK SPACE and as such has specific semantics when it occurs in the middle of a file. But if a file with a BOM is simply concatenated to another file, then what was a BOM may suddenly occur in the middle of a file and have a change in semantics. For this reason, the UTC created the U+2060 WORD JOINER character to take over the nobreak space semantics of U+FEFF. But it is only safe to strip U+FEFF from within a text if one is sure that the data is conforming to Unicode 5.0 or later. (This means that the ZERO WIDTH NO-BREAK SPACE is now very poorly named, and unfortunately will remain so, since Unicode character names can never be changed.)
So what to do? Should one always insert a BOM even in UTF-8 text so that applications can identify the encoding? Or should one always remove them because they might cause problems? In the case of UTF-16 and UTF-32 data, the answer is obvious: a BOM is essential to give byte ordering information. But in the case of UTF-8 data the situation is more ambiguous.
BOMs should be inserted if they are needed and removed if they cause problems. Applications should be written to ignore BOM everywhere (if they are Unicode-aware) and especially file-initially (even if the application is not Unicode-aware). Files may or may not include a BOM and no strong efforts should be made either way, unless that data is being used in an application that requires either the presence or absence of a BOM. In other words, it does not matter what you do unless it matters, in which case do the right thing! As applications become BOM insensitive it matters less and less whether UTF-8 data has a BOM or not.
Question: When I try to convert my file using the SILConverters 4.0 package some of the characters are not being converted. If I select the text that was not converted correctly and run the Unicode Word Macro “Show Unicode” I see that Word thinks every one of these is U+0028 LEFT PARENTHESIS. What is happening?
Answer: Use the Bulk Word document converter. It does convert occurrences. See here for instructions.
Question: I noticed in conversion that the IPA in the footnotes did not convert. And when I tried to do the conversion piece-by-piece funnier things happened. I selected one SILIPA93 character in a footnote, tried the Data Conversion macro, and the font fields came up blank. Any ideas what else I can do?
Answer: Use the Bulk Word document converter. See here for instructions.
Question: How do I use the Bulk Word Document Converter?
Question: How do I use SILConverters to convert a Microsoft Publisher document to Unicode?
Answer: Hopefully you installed "SILConverters for Office" when you installed SIL Converters.
IPA data conversion
Question: I have MS Word documents which use your SIL IPA93 fonts, but I want to begin using Unicode. Can you tell me how to convert my data to Unicode?
Answer: As you have implied, documents which were created (encoded) with legacy fonts are not compatible with Unicode fonts. You not only need to use a Unicode font, you will need to convert your data to Unicode.
Since you have MS Word documents it might be a straightforward problem. These instructions should work if you have MS Word 2003:
SIL Converters Word Macro
We hope your data will have converted correctly! These instructions are for converting your data in one document. Once you have tried it for one document and understand the concepts, you might want to use the Bulk Word Document Converter. () to do all your documents in one go. At this point we do not have step-by-step instructions, but basically you select your documents, choose your fonts, converter and font to apply.
If you do not have MS Word 2003 and want to try an older version of SIL Converters 2.5, you can try: SILConverters — Obsolete version. However, this version does not always work correctly and we are not providing any support for this product.
Converting data to Unicode is not always straightforward and we do not have all the answers.
Question: I have plain text files which use your SIL IPA93 fonts, but I want to begin using Unicode. Can you tell me how to convert my data to Unicode?
Question: I have Standard Format Marker (sfm) text files and want to convert only one or two of the sf markers from an SIL IPA93 encoding to Unicode. Can I do that?
Answer: the Bulk SFM Converter will easily handle doing this.
SIL Converters 3.0 (and any version less) will not work with Vista x64. Version 3.1 (to be released) will work with Vista x64.
2009-05-15 LP: added Known Issues