WSTech: Writing Systems Technology (formerly known as NRSI)
Unicode Issues in Word 97/2000/2002 – FAQ
Question: Under Word97/Win95, embedded SIL IPA characters are saved as ‘?’-marks when a file is saved in any .txt format – ie. their individual identity is lost. Is there any known work-around for this?
Answer: In order to understand the nature of the problem you’re having, you need to understand something about font encoding, Unicode, and codepages. For all the details, read on.
This problem is due to the fact that, unlike previous versions of Word, Word 97 and Word 2000 store the text in a file using 16-bit Unicode values rather than the 8-bit values (e.g., ASCII, ANSI) used in the past. Windows 95/98 is still partially an 8-bit operating system, and the fonts used to display the characters are 8-bit fonts, so Word 97/2000 converts the Unicode values stored in the file to 8-bit values to display and print the characters.
This conversion is controlled by the codepage in use. Text data is always stored in a computer as a sequence of numbers. Each number represents a character, and in Windows, the character represented by each number is defined by a codepage. (Versions of Windows for different languages around the world use different codepages. It is also possible to install several codepages in Windows, with each codepage defining a different character set. This happens if you install the Multilingual Extensions, for example.)
Word 97/2000 keeps track of what font a given run of text is formatted with, and it also knows what codepage to associate with a given font. Whenever text is converted from one format (e.g., Word 97/2000) to another (e.g. plain text), the codepage associated with the text is used to do the conversion.
To further complicate the picture, there are two different ways to encode 8-bit fonts: as normal text fonts, called UGL, or as symbol fonts. Most fonts containing alphabetic characters (e.g., Times New Roman, Arial) are encoded as UGL fonts. Fonts containing symbols (e.g. Wingdings) are typically encoded as symbol fonts. Word 97/2000 uses two different translation schemes between Unicode values and 8-bit values, depending on whether the font used for the text in question is a UGL font or a symbol font. If the font is a UGL font, Word 97/2000 converts the characters between the standard 8-bit and Unicode values defined by the active codepage. No such standard conversion exists for symbols, however, so if the font is a symbol font, Word 97/2000 converts the characters to a different set of Unicode values in what is called the “Private Use Area” (PUA) of Unicode. The PUA values thus assigned relate to the symbol font, but have no meaning for a UGL font. When the text is exported as plain text, the PUA values are meaningless, and Word 97/2000 converts them to the question mark.
The SIL IPA and IPA93 fonts are encoded as symbol fonts, and thus fall prey to the problem described above.
We know of two solutions at the moment:
If you have understood all of the above explanation and feel comfortable editing your Registry, click here for the modified codepage. Also, if you would like more information on this topic, read the white paper, “Unicode Issues in Microsoft Word 97 and Word 2000,” which is in Acrobat PDF format.
Question: What are the Unicode issues in Word 97 and Word 2000?
Answer: With the introduction of Unicode support in the Microsoft Office 97, some users began to encounter unexpected behaviour with fonts and their data, such as text displaying as boxes. This paper looks at several such issues, discussing the causes of the problems people encounter and suggesting some fixes. It also considers the direction in which the software industry has been heading with regard to Unicode and argues that users should adopt a Unicode strategy to avoid recurrences of the kind of problems discussed here.