Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Encoding > Conversion
Short URL: http://scripts.sil.org/EncConvFAQ

Encoding Conversion Frequently Asked Questions and Known Issues

Lorna A. Priest, 2009-05-15

Contents of this page

  1. Frequently Asked Questions
    1. Q: When I open a mapping file with the TECkit Unicode Mapping Editor I get an error message "unexpected character: "?" at line 1." What can I do to get rid of this message?
    2. Q: I am faced with a zillion files that need converting to Unicode. How do I prioritize my work?
    3. Q: I am using SILConverters 4.0 and would like to know how I can manually add a TECkit converter that does not have an installer.
    4. Q: I know that when converting legacy text to Unicode (with TECkit) I can request either decomposed (NFD) or composed (NFC) result. But if the text is already Unicode, how do I convert it to NFD or NFC?
    5. Q: Is there any advantage in generating the BOM?
    6. Q: When I try to convert my file using the SILConverters 4.0 package some of the characters are not being converted. If I select the text that was not converted correctly and run the Unicode Word Macro “Show Unicode” I see that Word thinks every one of these is U+0028  LEFT PARENTHESIS. What is happening?
    7. Q: I noticed in conversion that the IPA in the footnotes did not convert. And when I tried to do the conversion piece-by-piece funnier things happened. I selected one SILIPA93 character in a footnote, tried the Data Conversion macro, and the font fields came up blank. Any ideas what else I can do?
    8. Q: How do I use the Bulk Word Document Converter?
    9. Q: How do I use SILConverters to convert a Microsoft Publisher document to Unicode?
  2. IPA data conversion
    1. Q: I have MS Word documents which use your SIL IPA93 fonts, but I want to begin using Unicode. Can you tell me how to convert my data to Unicode?
    2. Q: I have plain text files which use your SIL IPA93 fonts, but I want to begin using Unicode. Can you tell me how to convert my data to Unicode?
    3. Q: I have Standard Format Marker (sfm) text files and want to convert only one or two of the sf markers from an SIL IPA93 encoding to Unicode. Can I do that?
  3. Known Issues
  4. Page History

Frequently Asked Questions

Question: When I open a mapping file with the TECkit Unicode Mapping Editor I get an error message "unexpected character: "?" at line 1." What can I do to get rid of this message?

Answer: Check whether you have another (different) version of the TECkit_x86.dll and/or TECkit_Compiler_x86.dll located somewhere in the path ahead of the version installed by SILConverters (i.e. C:Program FilesCommon FilesSIL). If so, this interferes with SILConverters finding the one it installed and the last time I encountered this, it was an older version of the TECkit DLL which prevents SILConverters clients from using the right version.

SILConverters v 3.1 has the latest versions of the TECkit DLLs and the location it is installed is also added to the path, so even if you deleted the (older?) ones in C:WindowsSystem32 or lingualinks folder, the programs that installed them should continue to work (i.e. the interface has not changed, so they should still work fine).

Question: I am faced with a zillion files that need converting to Unicode. How do I prioritize my work?

Answer: You could divide your files into different types. Take into account the importance of the content related to your particular goals, the likelihood that the material would need to be re-used, etc.

This is how one person prioritized his work:

The documents of value, in order of priority would be1:

  • Scripture portions for abandoned projects.
  • Published documents other than Scripture (which may need to be reprinted)
  • Natural language texts: tales, stories, interviews.
  • Dictionary data
  • Unpublished documents such as primers, health guides, etc. Value on these is low because it is unlikely that any new team will pick up on them and complete them anyhow.

Do not bother converting ANY published or unpublished Scripture data unless it falls under #1 above. Scripture data is constantly changing and if it is more than a month old in an active project, it is already outdated. Find out where the most recent copy of the data is, and if it is safe and in standard format, that is worth more than 20 Word or Publisher files.

Question: I am using SILConverters 4.0 and would like to know how I can manually add a TECkit converter that does not have an installer.

Answer: These steps are for those who have an existing TECkit converter. They should also work, with reasonable modifications, for other types of converters supported by SILConverters 4.0.

  • Download and unzip your TECkit converter (some mapping files are found here: Mapping Files
  • Either run the “Data Conversion” macro
    • (Tools / Data Conversion or in Word 2007 Add-Ins / Data Conversion)
    • Click the  Select  button to add your converter.
    • Click  Add New 
  • Or run the Clipboard EncConverter (which then doesn't require having/starting Word and the complications of loading the document template)
    • Start / All Programs / SIL Converters / Clipboard EncConverter
    • When it shows up in the system tray, right-click on its icon and choose Add Converter.
  • Select “TECkit map” and click on  Add 
  • Select the Setup tab
  • Select the  ...  button and browse to the folder where you ut your TECkit converter.
  • Select your converter and click  Open .
  • Select  Save in System Repository 
  • Where it says to “Enter Converter Name” type in a name that will help you remember what the converter will do. Usually people choose a name that expresses both sides of the equation, for instance SILIPA93<>Unicode.
  • Click  Advanced  and ensure that appropriate radio buttons are selected. Usually the default if OK.
  • Click  OK ,  OK .  OK  and  OK 

Question: I know that when converting legacy text to Unicode (with TECkit) I can request either decomposed (NFD) or composed (NFC) result. But if the text is already Unicode, how do I convert it to NFD or NFC?

Answer: For text documents, you can do this with txtconv.exe (from the TECkit package); just run it with no mapping file, specifying a Unicode encoding form for both input and output, and give it the normalization option you want.

OR

With EncConverters installed (including the ICU Plug-in), you could do the following steps in Word:

  • Run the “Data Conversion” macro
  • Click the  Select  button to add a converter.
  • Click  Add New 
  • Select “ICU transliterator” and click on  Add 
  • Select the Setup tab
  • At the “Built-in transliterator” radio button, select Any to NFD
  • Select  Save in System Repository 
  • Where it says to “Enter Converter Name” type: Any to NFD
  • Click  Advanced  and ensure that “ICU Transliteration” is selected.
  • Click  OK ,  OK .  OK  and  OK 
  • Repeat these steps to add a “NFC” converter.

Then you can use the main dialog interface to convert your word document according to your need using either of these two converters.

You can also write Word macros (after following the above steps) to do the conversion, but it depends on the version of the EncConverters you are using (the 1.5 version uses a slightly different interface than the 2.0 version). The 1.5 version should work, but the actual syntax of the VB code will be different from the following. In the Visual Basic editor, you have to do Tools / References, and check the box that looks like “EncCnvtrs” or “EncConverters”, then add the macros like the following:

; Add a reference to the SILConverters type lib (i.e. in the VBA editor (Alt+F11), 
; click 'Tools', 'References' and then browse for file 
; "C:Program FilesCommon FilesSIL<version>SilEncConverters22.tlb")
 
Sub Decompose()
    Selection.Text = GetEncConverter("NFD").Convert(Selection.Text)
End Sub
 
Sub Compose()
    Selection.Text = GetEncConverter("NFC").Convert(Selection.Text)
End Sub
 
Where the "GetEncConverter" function is defined as follows:
Function GetEncConverters() As EncConverters
    Set GetEncConverters = CreateObject("SilEncConverters22.EncConverters")
End Function
 
Function GetEncConverter(sName As String) As IEncConverter
    Set GetEncConverter = GetEncConverters().Item(sName)
End Function

Question: Is there any advantage in generating the BOM?

Answer: The Byte Order Mark (U+FEFF) was invented back in the days when Unicode was thought of as a 16-bit standard. The problem was that some systems wanted to store Unicode little endian with the least significant byte first, and some wanted to store it big endian with the most significant byte in a pair coming first. In order to resolve this, the UTC allocated two codes: U+FEFF as a zero-width non-breaking space (which basically has no effect at the start of the file), and U+FFFE (the byte reversal of U+FEFF) as an unassigned and therefore illegal code. Therefore, to determine the byte order of a stream of data, a processor can examine the inital BOM: if it encounters U+FFFE, this indicates that the order of the bytes needs to be reversed (which will produce the valid code U+FEFF); if it finds U+FEFF then no change is needed.

With the advent of the various Unicode Transfer Formats (UTFs), the BOM concept was extended to UTF-32. UTF-8, on the other hand, does not strictly need an initial BOM because there are no byte order issues to deal with. But U+FEFF as encoded in the various UTFs does end up being a sequence that is very unlikely to occur in almost any other encoding. As such, U+FEFF can be used not only as a byte-order mark but also to indicate that the data in question is in fact Unicode text. Microsoft, for example, uses this extensively in its data.

There is a problem with using the BOM when storing data in UTF-8. Many programs that were not designed to be Unicode-aware work very well with UTF-8 encoded data. All codepoints up to x7F take their standard ASCII meanings, and all codes above x80 are considered just to be odd characters with no real interpretation required. Thus processors of certain kinds of ASCII text, such as programming source code, configuration files, etc., which assume that any significant characters will be in the range 0x00-0x7F, can work fine with UTF-8, since all characters in the higher range are just treated as unprocessed data that are handled by some other process. But if a file starts with a BOM ((U+FEFF = xEF xBB xBF in UTF-8), those codes are all 'upper ASCII', not ASCII spaces that can be ignored. This can lead to various types of failure (errors, warnings, crashes). Increasingly, as such applications are updated, they are designed to ignore any initial BOM, but care should be taken and users should be aware that an initial BOM may cause problems in certain situations. Unfortunately, the chief culprits in creating such BOMs in the first place are the least able to remove it.

Another problem arises when files are blindly concatenated. U+FEFF is actually a ZERO WIDTH NO-BREAK SPACE and as such has specific semantics when it occurs in the middle of a file. But if a file with a BOM is simply concatenated to another file, then what was a BOM may suddenly occur in the middle of a file and have a change in semantics. For this reason, the UTC created the U+2060 WORD JOINER character to take over the nobreak space semantics of U+FEFF. But it is only safe to strip U+FEFF from within a text if one is sure that the data is conforming to Unicode 5.0 or later. (This means that the ZERO WIDTH NO-BREAK SPACE is now very poorly named, and unfortunately will remain so, since Unicode character names can never be changed.)

So what to do? Should one always insert a BOM even in UTF-8 text so that applications can identify the encoding? Or should one always remove them because they might cause problems? In the case of UTF-16 and UTF-32 data, the answer is obvious: a BOM is essential to give byte ordering information. But in the case of UTF-8 data the situation is more ambiguous.

BOMs should be inserted if they are needed and removed if they cause problems. Applications should be written to ignore BOM everywhere (if they are Unicode-aware) and especially file-initially (even if the application is not Unicode-aware). Files may or may not include a BOM and no strong efforts should be made either way, unless that data is being used in an application that requires either the presence or absence of a BOM. In other words, it does not matter what you do unless it matters, in which case do the right thing! As applications become BOM insensitive it matters less and less whether UTF-8 data has a BOM or not.

Question: When I try to convert my file using the SILConverters 4.0 package some of the characters are not being converted. If I select the text that was not converted correctly and run the Unicode Word Macro “Show Unicode” I see that Word thinks every one of these is U+0028  LEFT PARENTHESIS. What is happening?

Answer: Use the Bulk Word document converter. It does convert Insert / Symbol occurrences. See here for instructions.

Question: I noticed in conversion that the IPA in the footnotes did not convert. And when I tried to do the conversion piece-by-piece funnier things happened. I selected one SILIPA93 character in a footnote, tried the Data Conversion macro, and the font fields came up blank. Any ideas what else I can do?

Answer: Use the Bulk Word document converter. See here for instructions.

Question: How do I use the Bulk Word Document Converter?

Answer:

  • Make sure Microsoft Word is closed
  • Go to Start / All Programs / SIL Converters / Bulk Word Document Converter
  • Click on File / Open and navigate to where your Word documents are.
  • You may select one file or quite a few files.
  • A dialog box will appear that shows you all the fonts used in those documents


  • For each custom-encoded font you should choose an appropriate converter (click on "Select a converter" and choose the converter)
    • If you wish to use the same converter for the next font selection, you can just RIGHT-click your mouse button on "Select a converter" and it will automatically apply the last converter you selected. Otherwise, follow the same procedure as above to choose a different converter.
  • For other fonts (like Times New Roman), you should not need to choose a converter
  • For each custom-encoded font you should choose a Unicode font for "Apply font" (click on "Apply Font" and choose an appropriate font)
    • If you wish to use the same font for the next font selection, you can just RIGHT-click your mouse button on "Select a font to apply" and it will automatically apply the last font you selected. Otherwise, follow the same procedure as above to choose a different font.


  • Notice the "Example Data" window and the "Example Results" window. If the correct converter and correct font is applied, the results should look almost identical (at least they should be the same characters, if you chose a very different font, it could look different)
  • Then click on the little floppy disks to convert the files and save
    • You can choose to rename the converted files, or you could keep the same name if you put them in a different directory. It is probably better to give them a different name until you've confirmed the conversion was done properly.
  • You may wish to save your settings file. This will save your information tieing custom-encoded fonts with an appropriate TECkit mapping and the Unicode font you wish to use. This will allow you to reuse these settings for another project.
    • Select Converter Mappings and click on Save and Save.

Question: How do I use SILConverters to convert a Microsoft Publisher document to Unicode?

Answer: Hopefully you installed "SILConverters for Office" when you installed SIL Converters.

  • Open your document in Publisher
  • You will see a new toolbar that looks like this:


  • If you want to convert your whole document you should click on  Convert Whole Document . Otherwise choose  Convert Selected Story  or  Convert Selection 
  • Once you have clicked on one of the buttons, you will see a dialog box called "Choose fonts to convert". This will list all the fonts in your document


  • For the first font you wish to run the converter on, click on "Select a converter". If you've already installed your converter, you can just click on your converter


  • Then click on  OK 
  • Next, click on "Select a font to apply" and choose the font you wish to use.


  • If you wish to use the same converter for the next font selection, you can just RIGHT-click your mouse button on "Select a converter" and it will automatically apply the last converter you selected. Otherwise, follow the same procedure as above to choose a different converter.
  • If you wish to use the same font for the next font selection, you can just RIGHT-click your mouse button on "Select a font to apply" and it will automatically apply the last font you selected. Otherwise, follow the same procedure as above to choose a different font.
  • SIL Converters will take you to a dialog box which will show you how it will convert the first text. If it looks okay, you can choose  Replace All  and continue to choose that until you are finished converting your document.
  • You will find that different fonts have a different amount of linespacing and your "stories" may not always fit in the same box you had them in. You can either adjust the linespacing for your story or adjust the size of the box.

IPA data conversion

Question: I have MS Word documents which use your SIL IPA93 fonts, but I want to begin using Unicode. Can you tell me how to convert my data to Unicode?

Answer: As you have implied, documents which were created (encoded) with legacy fonts are not compatible with Unicode fonts. You not only need to use a Unicode font, you will need to convert your data to Unicode.

Since you have MS Word documents it might be a straightforward problem. These instructions should work if you have MS Word 2003:

  • Download and install Doulos SIL — Download or Charis SIL — Download.
  • Download and install SILConverters 4.0 (follow the installation instructions on that page).
    • When you get to the SIL Converters Install Options dialog box:
      • Under Conversion Maps and Tables:
        • Make sure SIL IPA93<>UNICODE is selected. If you plan to convert SIL-IPA-1990 data to Unicode you should also make sure SIL-IPA-1990<>UNICODE is selected.
        • Click  Apply 
      • Continue on with the installation.
  • If multiple users on the machine want to use the document template (found in: (C:Documents and Settings<user>Application DataMicrosoftTemplates), you need to manually move the .DOT files to some common location and each user will need to browse for them individually in Tools / Templates and Add-Ins. If you want one or more of these document templates to start up automatically when Word starts, move them either to the current user’s Startup folder (i.e. C:Documents and Settings<user>Application DataMicrosoftWordSTARTUP). For all users, put it in the global startup folder (e.g. C:Program FilesMicrosoft OfficeOFFICE11STARTUP).
  • Open your document in MS Word
  • If you did not load the template in the manner specified above, you should follow these steps to activate the Conversion Macro Template:
    • Click Tools / Templates and Add-ins...
    • Click the  Add  button to browse for the Data Conversion Macro dot file (in, C:Documents and SettingsyourloginnameApplication DataMicrosoftTemplates by default).
    • Click  OK  to return to Word.
  • Go to Tools / Data Conversion or in Word 2007 Add-Ins / Data Conversion

SIL Converters Word Macro



  • Under Conversion table details, click on  Select... 
    • Under Select Converter, select SIL IPA93<>UNICODE
    • Select  OK 
  • If you have footnotes, choose Include footnotes
  • Under Scope of change, select A specific regular font:
    • In the pull-down menu choose SILDoulos IPA93
  • Under Target Data choose Apply specific font:
  • Select  OK 
  • Be patient. With some documents the conversion process can take awhile. Do not assume that Word, or your computer, has hung.
  • Follow these same instructions for converting your IPA-1990 (SILDoulosIPA font) data. However, instead of selecting the SIL IPA93<>UNICODE converter, you should select the SIL-IPA-1990<>UNICODE converter.

We hope your data will have converted correctly! These instructions are for converting your data in one document. Once you have tried it for one document and understand the concepts, you might want to use the Bulk Word Document Converter. (Start / Programs / SIL Converters / Bulk Word Document Converter) to do all your documents in one go. At this point we do not have step-by-step instructions, but basically you select your documents, choose your fonts, converter and font to apply.

If you do not have MS Word 2003 and want to try an older version of SIL Converters 2.5, you can try: SILConverters — Obsolete version. However, this version does not always work correctly and we are not providing any support for this product.

Disclaimer:

Converting data to Unicode is not always straightforward and we do not have all the answers.

Question: I have plain text files which use your SIL IPA93 fonts, but I want to begin using Unicode. Can you tell me how to convert my data to Unicode?

Answer:

  • If you have not already installed SILConverters 4.0, follow the instructions above.
  • Click on Start / Programs / SIL Converters / TECkit / DropTEC
  • In Windows Explorer, navigate to C:Documents and SettingsAll UsersApplication DataSILSILConverters30MapsTables
  • From Windows Explorer, drag the compiled mapping file silipa93.tec to the box at the top labeled “Mapping file:”. (Alternatively, you can use the adjacent  Browse  button and navigate to the file.)
  • From Explorer, drag your text file to the box on the left labeled “Legacy text file:” (alternatively, you can use the adjacent  Browse  button and navigate to the file). When you have loaded the file, DropTEC prompts you for a destination file. Click  Save  and accept the default name.
  • Once you supply the name, the file is created, so if you want to select a different “Unicode output form”, you'll need to do it before specifying the input file (and giving the name for the output file).
  • Click on File / Exit of the “DropTEC” window to close the program.
  • You should have a new Unicode-encoded text file.

Question: I have Standard Format Marker (sfm) text files and want to convert only one or two of the sf markers from an SIL IPA93 encoding to Unicode. Can I do that?

Answer: the Bulk SFM Converter will easily handle doing this.

  • If you have not already installed SILConverters 4.0, follow the instructions above.
  • Click on Start / Programs / SIL Converters / Bulk SFM Converter
  • Click on File / Open SFM Documents / Non-Unicode (Legacy)
  • Select your files and click on  Open 
  • For each SFM which uses the IPA93 fonts, click on Click here to define a converter
  • Select  SIL IPA93<>UNICODE  (note that you may select a different converter for each SFM — if needed)
  • Click on  OK 
  • If you wish to preview whether the converter will work or not, you may select appropriate fonts
    • For each SFM which uses the IPA93 fonts
      • Right-click on Example Data and choose SILDoulos IPA93
      • Right-click on Example Results and choose Doulos SIL (or another appropriate Unicode font)
  • If the example results look right, move on to the next step
  • Click on File / Convert and Save SFM Documents / Unicode (UTF-8)
  • Verify that the file name and location are appropriate and click on  Save 

Known Issues

SIL Converters 3.0 (and any version less) will not work with Vista x64. Version 3.1 (to be released) will work with Vista x64.

Page History

2009-05-15 LP: added Known Issues
2009-02-17 LP: added error message for TECkit mapping editor
2007-07-05 LP: updated answers for ver. 2.6
2007-05-21 LP: added info for Word 2007
2007-04-20 LP: added BOM question
2006-03-15 LP: added priority question
2006-03-14 LP: added OO solution to Insert/Symbol problem2006-02-22 JW: added Publisher conversion2005-10-06 LP: added IPA93 conversion question


1 We are grateful to Doug Higby for this list of priorities.

© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.