Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Rendering > Resources > Font FAQ
Short URL: http://scripts.sil.org/FontFAQ_UnicodeWord

Unicode Issues in Word 97/2000/2002 – FAQ

NRSI staff

Question: Under Word97/Win95, embedded SIL IPA characters are saved as ‘?’-marks when a file is saved in any .txt format – ie. their individual identity is lost. Is there any known work-around for this?

Answer: In order to understand the nature of the problem you’re having, you need to understand something about font encoding, Unicode, and codepages. For all the details, read on.

This problem is due to the fact that, unlike previous versions of Word, Word 97 and Word 2000 store the text in a file using 16-bit Unicode values rather than the 8-bit values (e.g., ASCII, ANSI) used in the past. Windows 95/98 is still partially an 8-bit operating system, and the fonts used to display the characters are 8-bit fonts, so Word 97/2000 converts the Unicode values stored in the file to 8-bit values to display and print the characters.

This conversion is controlled by the codepage in use. Text data is always stored in a computer as a sequence of numbers. Each number represents a character, and in Windows, the character represented by each number is defined by a codepage. (Versions of Windows for different languages around the world use different codepages. It is also possible to install several codepages in Windows, with each codepage defining a different character set. This happens if you install the Multilingual Extensions, for example.)

Word 97/2000 keeps track of what font a given run of text is formatted with, and it also knows what codepage to associate with a given font. Whenever text is converted from one format (e.g., Word 97/2000) to another (e.g. plain text), the codepage associated with the text is used to do the conversion.

To further complicate the picture, there are two different ways to encode 8-bit fonts: as normal text fonts, called UGL, or as symbol fonts. Most fonts containing alphabetic characters (e.g., Times New Roman, Arial) are encoded as UGL fonts. Fonts containing symbols (e.g. Wingdings) are typically encoded as symbol fonts. Word 97/2000 uses two different translation schemes between Unicode values and 8-bit values, depending on whether the font used for the text in question is a UGL font or a symbol font. If the font is a UGL font, Word 97/2000 converts the characters between the standard 8-bit and Unicode values defined by the active codepage. No such standard conversion exists for symbols, however, so if the font is a symbol font, Word 97/2000 converts the characters to a different set of Unicode values in what is called the “Private Use Area” (PUA) of Unicode. The PUA values thus assigned relate to the symbol font, but have no meaning for a UGL font. When the text is exported as plain text, the PUA values are meaningless, and Word 97/2000 converts them to the question mark.

The SIL IPA and IPA93 fonts are encoded as symbol fonts, and thus fall prey to the problem described above.

We know of two solutions at the moment:

  • Use an earlier version of Word. One work-around is to save the file in WinWord 2 format. That will force the characters to be stored with 8-bit values rather than Unicode values, and should help things work more smoothly.
  • Change the codepage installed in Windows. The default codepage for the US version of Win95/98/Me is 1252. We have a modified version of codepage 1252 available that maps PUA characters to 8-bit values. Installing this modified codepage requires copying the codepage file to your Windows directory, and editing the Registry to access it.

If you have understood all of the above explanation and feel comfortable editing your Registry, click here for the modified codepage. Also, if you would like more information on this topic, read the white paper, “Unicode Issues in Microsoft Word 97 and Word 2000,” which is in Acrobat PDF format.

Question: What are the Unicode issues in Word 97 and Word 2000?

Answer: With the introduction of Unicode support in the Microsoft Office 97, some users began to encounter unexpected behaviour with fonts and their data, such as text displaying as boxes. This paper looks at several such issues, discussing the causes of the problems people encounter and suggesting some fixes. It also considers the direction in which the software industry has been heading with regard to Unicode and argues that users should adopt a Unicode strategy to avoid recurrences of the kind of problems discussed here.

Unicode Issues in Word 97 and Word 2000
Peter Constable, 2000-10-20
Download "unicodeissuesinword97-2000.pdf", Acrobat PDF document, 532KB [13051 downloads]


© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.