WSTech: Writing Systems Technology (formerly known as NRSI)
NRSI Update #13 – June 2000
In this issue:
WorldPad: WordPad For the Rest of Us
by Sharon Correll
One of the applications scheduled to be included with the first FieldWorks release is a basic text editor called WorldPad. WorldPad will feature basic word-processing capabilities, but will also include the ability to use Graphite as its renderer for complex scripts.
Many of WorldPad’s features are standard aspects of the FieldWorks text-editing framework, such as adjusting paragraph indentation and alignment, creating a simple stylesheet, handling multilingual text, and creating keyboard short-cuts. There is also the possibility of including some features that will be useful for those working with Graphite fonts, such as the ability to simultaneously view data using a Graphite renderer and a “dumb” renderer.
Graphite Project Update
by Sharon Correll
The Graphite (previously WinRend) project continues to make progress toward our goal of providing complex rendering for those working with non-Roman and complex Roman scripts. The engine is complete in terms of its ability to perform complex transformations and is undergoing initial testing by several field members.
A demo of the Graphite renderer was given at the Language Software Board meetings in May. Several of the board members commented on the fact that Graphite will be useful not only for non-Roman scripts, but for more general needs in SIL such as IPA. As an example, Graphite’s smart rendering capabilities eliminate the need for multiple versions of a diacritic depending on the visual characteristics of the character it is placed over, thus reducing the range of characters needed to represent the underlying data. Using Graphite to render IPA will result in a representation of phonological data that is not only simpler but also more linguistically consistent.
We believe we are on target for a more general field test during the months of July - September, 2000. Members of the testing team are expected to be knowledgeable in a complex writing system that might be usefully implemented in Graphite. Responsibilities include: learning the Graphite Description Language, using it to implement an interesting writing system, testing and debugging the program, and providing feedback on the language, compiler, behavior, and performance. A commitment of about a person-month is needed. If you would like to be involved in this test effort, please contact me at .
Previous articles in NRSI Updates which are related to this topic:
Life after W2K?
by Martin Hosken
Y2K is past, and with it we face a new hurdle: W2K. Windows 2000 is the new flagship operating system from Microsoft which is based on both Windows 9x and Windows NT. At its core it is Windows NT, but it has added the usability aspects of Windows 9x. The result is a monumental 5 million lines of code with a bug database of 65000 entries on the release date. This was widely reported in the press as a recipe for disaster. But with Microsoft betting its future profits and survival on Windows 2000, Microsoft cannot allow Windows 2000 to fail, and that means that it has to make it work. The wisdom, therefore, was that the big corporations were going to move slowly on switching to the new OS and wait for a few updates to come out and for the major bugs to be identified by others and fixed before making any real decisions.
But there are some very significant technological improvements that come with Windows 2000 that are of direct interest to us. The primary technology of interest to us is the multilingual support, based around Unicode, and the ability to do direct Unicode keying. Windows 2000 comes with a number of standard locales allowing one to type in Hindi, Thai, Chinese, Arabic, etc.—all within Notepad just by changing the keyboard. This has come about because Windows 2000 is the first OS from Microsoft which has one global binary: there is only one version of Windows 2000 for the whole world. Yes, there are language kits to localise user interfaces, etc. but at its core there is only one version. (A multi-language product will be appearing that includes all of the localised user interface resources, though it will initially be available only to large corporations.)
Thus, for those working in more than one national script supported by Windows 2000, this new OS could be of great benefit just out of the box, with no additional work. But is it too buggy to use?
After a long battle to do a full backup of my machine onto CD-ROM, I began, with some trepidation, to install Windows 2000.
The first step was to ensure that my machine had the recommended 128MB and spare 1-2GB of hard disk necessary for the installation. And then I ran a “check to see if my hardware can run Windows 2000” program from Microsoft, which reported where I might have difficulties. While my machine was not perfect, I wasn’t going to lose anything too significant.
Then I checked for some upgraded drivers for my Toshiba laptop from the Toshiba site. This involved a BIOS upgrade and some drivers to allow me to access my hardware from within Windows 2000.
OK. Now I was ready to insert the CD. The software installation went very smoothly, and after keeping my disk as FAT32, I found that I could boot either Windows 98 or Windows 2000. I could then install the other goodies I had downloaded from Toshiba: hardware and power management support and video drivers.
Since then I have been using Windows 2000 very happily. The main problems I have found are:
And what side-effect gains have I made?
No software is perfect and Windows 2000 is not perfect. But it seems to me as good as Windows 98 and with its far better multilingual features seems to be worth the effort. My conclusion, therefore, is to encourage people that when they upgrade their machine (usually by buying a new laptop) to make that new machine Windows 2000 capable and pay the extra to go Windows 2000. This is especially true for people working in countries for which there is a localised version of Windows 9x already on the market. The single binary of Windows 2000 allows them to have the universality of a US version with the facility of their national version in the same box. You will also know that any solutions you provide in your version of Windows 2000 will also work for any national using a version of Windows 2000 they bought locally.
But I would not say that there is a burning need for people to rush out and upgrade their machines especially if they are relatively happy with the status quo. I can imagine various users switching pretty rapidly, especially with support for Devanagari (Hindi), Arabic, Chinese, Thai, Modern Greek and Hebrew and various Eastern European locales being provided out of the box. Another class of user I would encourage to upgrade, are members from countries who speak non-Roman script-based languages: for example, Korean, Japanese, Chinese, etc. Now they have the ability to use their mother tongue as well as English in an environment effectively identical to their colleagues.
Editor’s Note: We had intended including several articles from varying perspectives on this issue but find that the latest “Notes on Computing (January-April 2000)” has two good articles on the subject. All three of these articles come at the issue from a different perspective. We hope they help you in your decision!
I will include a few items from the “Computer Support Technical Bulletin #2.”
“We recommend that users NOT install Windows 2000 at this time unless they are already using NT or will be using it in a networking environment with adequate administrative support.”
And from “A First Look: Windows 2000 is Here”:
“This review is for your information only and does not constitute an endorsement or promise of support by CCS.”
MS Windows ME available soon
by Peter Constable
Microsoft Windows ME (Millenium Edition) has just progressed to the manufacturing stage and will be available in retail stores in the US on September 14. This product is the next update in the Windows 9x product line, and the primary changes are targeted at home users. We should expect to see this update beginning to replace Windows 98 SE on new systems during the fall.
We have not yet had a chance to examine multilingual or script-related capabilities of this new version of Windows, and so don’t yet know if it offers any benefits over Windows 98 or Windows 98 SE. (Note, however, that it is not likely to provide as high a level of multilingual and multi-script support as Windows 2000, which is the latest product in the Windows NT line.)
Font Issues in MS Windows & Office
by Peter Constable
This is a quick overview of some potential problems users might encounter with fonts in MS Windows and MS applications such as those in the Office suite. The issues described here are particularly relevant for custom-encoded fonts, though the “font-linking” issue can also affect Unicode-conformant fonts for certain scripts. Many of the issues have been discussed more fully elsewhere [1, 2], so this is a summary review and update. I will try to keep technical details to a minimum. For further details on most of the issues mentioned here, see the two articles cited in the references.
MS Windows and applications designed to run in that environment implement certain character encoding standards, such as the so-called “ANSI” character set (also known as “Windows Latin 1”, “Western” or MS codepage 1252; hereafter, I will refer to it as the Western character set, or as codepage 1252). For years, however, we have been accustomed to working with custom-encoded fonts - fonts that have special characters (Roman or non-Roman) not defined in the character sets supported by MS Windows. In effect, we have been creating hacked fonts that violate the standard encoding. In the past, this usually wasn’t a problem: the software we worked with didn’t do much for which following the standard was critically important. As long as we avoided a few critical codepoints, the worst problem we might typically encounter was having to set some line breaks manually.
In recent years, Microsoft has made some changes pertaining to character encoding that have begun to affect us. The biggest of these has been the adoption of Unicode as the character encoding used within their business applications. A second change was the addition of three new characters to the Western character set. These and related changes have in many cases broken the custom-encoded fonts we have been using.
We might be inclined to get upset with Microsoft for making these changes and causing us problems. In fact, we’re responsible for most of the problems: by breaking the standards, we’ve been taking risks. We got away with it for several years, but now that software is starting to enforce the character encoding standards more rigorously, we’re getting caught. We need to change our ways and start following the standards. Fortunately, the advent of Unicode and related technologies (particularly, “smart” fonts) has meant that standards are at last beginning to meet our needs.
A brief note on Unicode: Unicode is a single character set designed to support (eventually) every character used in every script throughout the world. Unicode is based on 16-bits, rather than 8-bits, and so software has to be designed specially to support it. Unicode first appeared on our desktops with the introduction of TrueType fonts in Windows 3.1, but was used only internally to TrueType fonts - most of Windows 3.x used 8-bit text. Since Windows NT first appeared, Unicode has been the native encoding on that line of operating systems. Some support for Unicode was introduced in Windows 95, enough so that true Unicode applications could be built to run on that platform. The first Unicode application most of us were exposed to was MS Word 97. Unicode is quickly becoming the industry standard, and it is the standard that all future SIL language software will be built on.
As mentioned, TrueType fonts have always used Unicode internally, and Unicode is based on 16-bits. On the other hand, applications such as Word 95 and all existing SIL software have worked with 8-bit text. To display text using a TrueType font, a translation from 8-bit encodings to Unicode has been necessary. This is handled by Windows using a codepage file. So, for example, codepage 1252 maps the 8-bit codepoint d225 to the Unicode character U+00E1, LATIN SMALL LETTER A WITH ACUTE. Because TrueType fonts use Unicode internally, when someone creates a custom-encoded font, they are breaking both the defined character set in the codepage 1252 and the characters defined in Unicode. So, for example, in a font for which the 8-bit codepoint d225 is to display as a Hebrew patah diacritic, the glyph for that Hebrew character is assigned to U+00E1, going against Unicode.
One final background note: often custom-encoded fonts are created as “symbol” fonts. These have some special properties, the most important of which is that internally the characters are assigned to the Unicode range U+F020 to U+F0FF. This is part of a special area in Unicode known as the Private Use Area (PUA), which is a set of codepoints permanently set aside for custom end-user definitions.
The euro problem
The problem that appears to have affected most people relates to changes that were made to support the euro currency symbol. In the spring of 1998, Microsoft revised various codepages, including the codepage for the Western character set (codepage 1252), to add support for the euro. In the case of codepage 1252, two other additional characters used for some European languages - upper and lower case z with caron (hacek) - were also added. This was first introduced as a patch to Windows 95 (W95Euro.exe) and as part of service pack 4 for Windows NT 4, and it became a standard part of all subsequent versions of Windows. The crucial effect of this update to codepage 1252 was to change the 8-bit-to-Unicode mapping for three codepoints, d128, d142 and d158. Custom-encoded fonts created before this change may have had glyphs at the old Unicode values, but would not have glyphs at the new Unicode values. The net effect is that when these fonts are used on a system that has the euro update, codepoints d128, d142 and d158 will most likely appear as empty boxes rather than the desired glyphs. This problem affects all applications.
There are a couple of workarounds for this, but the preferred fix is to modify the font so that both the old and new Unicode values present the same glyph. This way, the font works as desired regardless of whether euro support is installed or not. This modification can be done using Typecaster 3 or Martin Hosken’s EuroFix Perl script. (See  for further details on how to make this modification.)
Symbol fonts and text appearing as boxes
There is another situation in which text can appear as boxes. It arises when text that was formatted with a symbol font is re-formatted to use a normal font, e.g. when changing the font from SILDoulos IPA93 to Times New Roman. This problem is specific to Word 97 and later.
Recall that symbol fonts are encoded using a special Unicode range in which codepoints are permanently not defined as part of the standard. What happens is that, as the user enters text while a symbol font is selected, the character codes from the keystrokes get converted to that special range of Unicode rather than being converted in accordance to the codepage that is in effect. So a keystroke that enters d225 would result in the Unicode character U+00E1 being added to the document if a normal font such as Times New Roman is selected, but would result in the character U+F0E1 if a symbol font such as SILDoulos IPA93 is selected.
When such text is reformatted to a non-symbol font, the character codes stored in the document are left unchanged. Since the non-symbol font most likely does not have glyphs associated with the PUA character codes, empty boxes appear. Because the glyphs in a symbol font can be anything, there is no way for software to know how to map the PUA character codes used for the symbol font to regular Unicode character codes. So, there isn’t anything else the application can really do in this situation.
A workaround is available that assumes that the PUA character codes used for symbols are to be mapped to those that would have resulted had a non-symbol font been used when the text was entered. See  for details.
There are a couple of other minor issues with symbol fonts. Because PUA character codes are used, and because an application has no way of knowing how they are defined in the given font, nothing is assumed. This means that features such as upper- or lower-case conversions and line breaking will not work. For example, in Word 97 and later, text that is formatted with a symbol font is treated as though every character (including spaces) is a word-building character, and so lines will not break unless the entire run will not fit on a line. A workaround for the line-breaking problem is also available (see  for details).
Exporting text formatted with a symbol font
There is a serious problem related to the use of symbol fonts in Word 97 and later that can result in data loss. If a document contains text that is formatted with a symbol font and the document is exported to an 8-bit text file, all of the characters that were formatted with a symbol font will be converted to question marks (character code d63). Again, recall that, in Word, text formatted with a symbol font is encoded using certain PUA characters. Since PUA codepoints have no standard definition, there is no codepage that corresponds to symbol fonts and the PUA codepoints that they use, and so no mapping from these PUA character codes to 8-bit codes is defined. There is no better choice that software could have made than to convert to question marks. (Note that this problem doesn’t occur when exporting to a Unicode-encoded text file.)
A workaround is available that assumes that the PUA character codes used for symbols are to be mapped to 8-bit codes that would have been entered in the text if Word weren’t using Unicode. See  for details. This workaround may not always be available in the future, however; e.g. it will likely not be available in Windows 2000. In the long term, the best option will be to discontinue using symbol fonts for language data.
Export-to-text bugs in Word
There are some known bugs in Word 97 and Word 2000 when exporting a document to a text file: certain character codes get changed. For example, the character for fraction 1/2 gets converted to a sequence of three characters, “1/2”. We have reported these to Microsoft, and all such export bugs are expected to be fixed in the next version of Word.
The font-linking problem
One of the consequences of using Unicode is that it is now much easier to provide true multilingual - and multi-script - support in applications. Nearly 50,000 different characters are defined in version 3 of Unicode, and documents created in programs like Word 2000 can potentially contain any number of them. This raises a potential concern, however: there are several problems with creating and using fonts that support the complete range of Unicode characters, and font developers have very little interest in doing so. But then, a user might receive a document or view a web page containing characters that are not supported in any of the fonts installed on their system. In that situation, text displayed with a default font would be illegible, e.g. as nothing but empty boxes. Companies like Microsoft and Apple consider that to be an unacceptable user experience, and are working to find solutions to this problem.
The solution Microsoft has adopted involves a technique they refer to as font linking. They distribute their software with a collection of fonts that together cover a significant portion of Unicode (all of those ranges of Unicode that they have chosen to support). The idea is that, if the user’s fonts don’t support all of the characters in a document, then this collection of fonts can be used to at least make the document legible. When software gets some text to be displayed, they examine the characters in the text and then try to ascertain whether the font(s) that the user has chosen for displaying the text provide support for those characters. They apply various heuristic tests for this purpose, and if these tests conclude that the selected font does not support the given characters, then they will link up one of the fonts distributed with the software to display the text.
What this means is that text will not always be displayed with the font you want. It is a particular problem where custom-encoded fonts have been used: because these are custom fonts, there is probably a higher likelihood that they will fail the heuristic tests for some portions of the text, those portions of the text will be displayed using a stock font that will not have the custom character set, and the resulting text will not be completely legible. This is a feature that benefits 99.9% of Microsoft’s users, but it can be a show-stopper for someone using custom-encoded fonts. What makes this problem particularly serious is that the only workaround for it is to revert to earlier versions of software, e.g. to stop using Word 2000 and use Word 95 instead.
Font linking can also affect some Unicode-conformant fonts as well, specifically fonts for ranges in Unicode that were not specifically part of the feature set for the given program. For example, in recent work on fonts for Yi and Ethiopic scripts using Unicode encoding, Word 2000 was substituting other fonts. These scripts are part of Unicode, but they were not specifically taken into consideration in the development of Word. In these cases, the fonts that Word was using in place of ours didn’t even support the characters (these scripts aren’t officially supported by Microsoft, so no fonts for them are provided), and so the text would display as empty boxes. (We found a workaround by tricking Word into thinking that these fonts support Japanese and Central European languages, but that creates other concerns.)
Font linking has been used in MS Office since Office 97, in Publisher at least since Publisher 98 (I think it was also used in Publisher 97), and in Internet Explorer at least since version 5.0. Thanks to influential contacts that we have in the MS Office development group, it will be possible to disable font linking in the next version of Office using a registry setting. Also, we know that testing will be done for all Unicode ranges in the next version of Office, and so the problem should not arise in the case of fonts for Unicode ranges such as Ethiopic and Yi that are not yet fully supported by Microsoft.
Starting with Windows 2000, however, font linking has been introduced into Windows itself. Since the feature has been developed independently for Windows, disabling it for Office may not be enough if Windows itself is making font substitutions. It would still be an issue as well for Internet Explorer running on any version of Windows. This problem will not likely go away entirely, though hopefully as Unicode support improves and as users adopt Unicode-based solutions, incidents will become infrequent.
Selecting fonts in Word 2000
Users may find that Word 2000 does not allow text to be formatted with certain fonts. This behaviour is a minor issue and is related to font linking. It primarily affects Unicode-encoded fonts that support ranges of Unicode that are not specifically supported in Office 2000.
The various scripts supported in Word are of a few different types, each having particular characteristics, and Word provides some specific behaviours for each type. Depending upon what language support is enabled (set in the separate MS Office Language Settings applet), the Format/Font dialog may show three different list boxes to select fonts, one each for Latin script fonts, Asian fonts, and complex (right-to-left) script fonts. Thus, text can actually be formatted with three different fonts simultaneously, though each applying to distinct character ranges.
Word uses information inside fonts to determine which category a font fits in, and compares this with the selected text. When selecting a font from the font list on the toolbar, if Word’s heuristic tests conclude that the font is for a different category than the selected text, it will not apply the change to the text. This may occur even if the font that was chosen does support the characters in the selected text. If you look in the font dialog, however, you will find that the font that was chosen has been applied for a different category of characters.
For example, the SIL Yi font supports Yi script but also the Latin characters of ASCII. Word categories this font as an Asian font. If a string of English words is selected (i.e. the string contains only ASCII characters), we would expect that we should be able to apply that font to the text. Word will apply that font, but only for Asian characters, however. The English, Latin characters will still be formatted with the original font. To succeed in actually formatting the Latin text with the selected font, it is necessary to open the font dialog, select the SIL Yi font as the Asian font, and set the Latin font setting to “(Use Asian text font)”.
This behaviour is not likely to be a serious problem for many users, and behaviour is likely to improve as Unicode support in Word matures with future versions.
A Word 2000 bug with right-to-left support enabled
Bob Hallissy just discovered an issue that can affect users who work with Arabic or Hebrew scripts and who also use custom fonts or characters from the upper half of the Western character set (codepage 1252).. If Arabic or Hebrew support is enabled in Word 2000 (set using the Microsoft Office Language Settings applet) and you are working with Western (codepage 1252) settings (which would apply if using a keyboard layout for English and other Western European languages or if using Keyman), then Word will convert the 8-bit codes d157, d253 and d254 as it receives them from the keyboard according to the Arabic codepage (codepage 1256) rather than codepage 1252. The mappings to Unicode that should result with codepage 1252 and the actual mappings that result in this situation are as follows:
The implication of this is that a user with a custom font that is designed to use d157, d253 or d254 (i.e. has glyphs mapped from U+009D, U+00FD or U+00FE) will not see the glyphs that they expect (unless the custom font was designed to work on an Arabic system using codepage 1256). This problem can affect users even if they are not using custom fonts: if you need characters d253 “ý” or d254 “þ” from codepage 1252 (in codepage 1252, d157 is undefined), these cannot be entered into a document.
The erroneous mapping appears to happen regardless of how the characters are input from the keyboard. In testing, it happened both when using the United States 101 keyboard layout and entering the characters with Alt-key combinations, and also when using the United States - International keyboard layout, which defines keystrokes for d253 and d254 (d157 is undefined in codepage 1252). On the other hand, the error appears not to happen when the codepage 1252 is not active. In testing, it did not happen when using keyboard layouts for Polish or Vietnamese, which activate codepages 1250 and 1258 respectively. Unfortunately, that does not necessarily help users: codepages other than codepage 1252 will not provide the mappings to Unicode that are usually assumed by custom fonts. Also, for those that require the “ý” or “þ” characters in codepage 1252, using another codepage obviously will not necessarily provide the combination of characters that they need.
This problem pertains to character input from the keyboard only. It does not affect importing of data from text files, nor does it affect exporting to text files.
The possible workarounds at this point are
It should be reiterated that this problem will affect you only if Arabic or Hebrew support is enabled. Thus, most users can disregard this issue entirely.
The long-term solution to font problems
The most serious problems users are facing in using custom fonts in recent versions of Word and other Microsoft software are the result of going against the encoding standards assumed by this software. For the computer industry, Unicode and related technologies are seen to be the way of the future and will not go away. There is no possible long-term solution to the symbol font problems, and there is no incentive for Microsoft to abandon font-linking techniques, which address real concerns in a way that helps the vast majority of their users. The only long-term solution for us to avoid these serious problems will be to abandon the non-conformant practices we have come to depend upon and to begin adopting the same standards and technologies.
Fortunately, this course of action holds many more benefits for us than it does concerns. The new technologies available in Windows - Unicode and smart fonts - are some of the very things we have needed all along. They alone don’t meet the needs of all minority languages, but they do provide the basis for effective and long-term solutions to our needs, and we are recommending that these technologies be adopted throughout the organisation.
 Constable, Peter. 1998. Unicode Issues in Microsoft Word 97 and Word 98. SIL IPub resource collection 98
 Hallissy, Bob. 1998. The BoxChar Mysteries presents: The Euro Case. SIL IPub resource collection 98
Developing and using OpenType fonts
by Bob Hallissy
OpenType, a joint effort by Adobe and Microsoft to implement a “smart font” technology, is finally taking off. Microsoft is fully committed to and dependent on OpenType for rendering complex scripts, both at the operating system level and at the application level. For its part, Adobe utilizes OpenType for its advanced typography abilities in the latest version of their professional publishing application, InDesign. At the International Unicode Conference 16 in Amsterdam last month, a lot of presenters made mention of OpenType when issues of Unicode rendering were brought up.
Recently I’ve had some success at implementing a cursive RTL script using OpenType, and seeing the text properly rendered by Microsoft Word 2000 running on stock English versions of Windows 95, Windows 98, and Windows 2000.
OS and application support – when is OpenType useful?
In order to take advantage of OpenType, you need two things: an application or OS that understands OpenType, and fonts that implement the features that the application or OS is expecting. It is important to understand that successful use of OpenType requires fonts and software that are built to cooperate with each other. This is noticeably different from Apple’s AAT (was GX font) technology that allows applications to be able to take advantage of new features that a font vendor might invent even though the application was not designed with that feature in mind.
Version 1.5 of Adobe’s high-end publishing platform, InDesign, utilizes OpenType to provide Latin typography features such as ligatures, swash variants, etc. InDesign is pre-programmed to recognize a fixed set of features that may be present in a font and to give the user access to those features. At this time, however, InDesign does not support any complex non-Roman scripts. More information about InDesign and its use of OpenType can be found on Adobe’s website.
Microsoft has bet the farm on OpenType. Windows 2000, Office 2000, Rich Text Edit, and Internet Explorer 5 all depend on OpenType for rendering of complex scripts. In Windows 2000, even “dumb” applications like Notepad can handle complex scripts via OpenType. This means that in every market that requires complex scripts, Microsoft is depending on OpenType to provide the necessary rendering capability.
The core component that implements Microsoft’s OpenType support, called Uniscribe, is bundled into USP10.DLL that was first supplied with IE 5, and more recently with Office 2000 (see Errata) and Windows 2000. Note that Windows 2000 is not required to utilize Uniscribe – it works perfectly well on Windows 9.x.
Uniscribe gets its name from the phrase “Unicode Script Processor”. In brief, Uniscribe assists a Unicode-based application by splitting a string of Unicode characters into script runs, and then passing each run in turn to a script-specific rendering engine for “shaping” (the process of selecting and positioning glyphs to correctly represent the underlying text). There is one specific shaping engine in Uniscribe for Arabic, another for Hebrew, another for Thai, and more for each of the various Indic scripts (Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala), etc. A majority of the Unicode range, however, is considered non-complex and does not have its own OpenType shaping engine. As a result, you can take advantage of OpenType through Uniscribe only if your script is one of the select few with shaping engines implemented inside USP10.DLL.
Each shaping engine is, in essence, hard-coded to understand one specific script. More specifically, each engine implements a process that scans its run of text and, based on what it sees, selects various OpenType features to be enabled for each character. The Arabic engine, for example, recognizes which characters need to be in initial, medial, final or isolate form and thus it “applies” (or enables) the OpenType “init”, “medi”, “fina”, or “isol” feature to each character that needs it. It is subsequently up to the font to implement these features and so effect the correct glyph selection. Notice the cooperation between application (in this case the Uniscribe DLL) and the font: the engine will invoke only the features it knows, and the font must implement exactly these features to be successful.
Uniscribe, Uniscribe, Uniscribe. Which art thou, Uniscribe?
It should be noted that Uniscribe is a work in progress. As each release of one of these products has occurred, newer and more capable versions of Uniscribe have been released. The more capable aspect is important to understand. There are two ways in which the Uniscribe engine has grown more capable: knowledge of additional scripts, and ability to handle additional OpenType rules.
Microsoft has shipped at least three different versions of Uniscribe over the past couple of years. The primary improvements have been knowledge of additional scripts. Initial releases understood only those characters that were in Unicode 2.1. More recently Microsoft has been enhancing Uniscribe to support characters and scripts that have been added in Unicode 3.0.
The currently released versions of Uniscribe, for example, implement support for only the glyph substitution rules of OpenType and not the glyph positioning rules. Font developers have access to a version of Uniscribe that handles glyph positioning, so this capability is definitely in the pipeline for release within the next year.
Tools for developing fonts – what do I need?
In order to take advantage of the OpenType support in Uniscribe, we need fonts that implement the features that Uniscribe is going to look for. Such fonts are beginning to appear; Microsoft, for one, is working on such fonts in order to provide support in their products for major languages. But if the languages you are interested in are not major languages of economic interest, you may need to rely on custom solutions provided by third party vendors or that you develop yourself.
Suppose you wanted to build your own OpenType fonts, say to support some minority language rendering need. Where do you start?
The first requirement is to know how OpenType works. Microsoft has a lot of OpenType documentation on their typography website.
Next, you will need to know what OpenType font features the specific shaping engine is going to look for and exactly what Unicode characters will be passed to that shaping engine to be processed. The typography site already contains information for some of the shaping engines, e.g., Indic. Arabic and Syriac documentation is in preliminary form, and should be available on request from Microsoft. I am unaware of documentation for other scripts, though it may be available.
Next, you will need TrueType fonts that you may modify. If you have not designed the outlines yourself, be sure to check the license agreement governing your use of the fonts to make sure you have appropriate permissions. If the fonts do not have Windows platform Unicode cmaps, you may need to re-encode the fonts using your favorite TrueType font editing software.
Finally, you need tools to build the OpenType tables and add them to your fonts. For the last several years the only tool available was Microsoft’s free “assembler” tool, (no longer available). Using it is akin to writing programs by coding assembly language in hex, and is therefore not recommended.
More recently Microsoft has been beta-testing their Visual OpenType Layout Tool (VOLT). The first release is due out soon, and the tool will be freely downloadable from the web, with a click-to-accept license. Preliminary documentation including a tutorial and screenshots is viewable.
Does it work?
As a test of the technology, I have developed a set of OpenType fonts that implement a variant of Arabic. The particular language in question needs several characters that, although present in Unicode, are not in the Microsoft Arabic codepage. Additionally, this language requires more complete vowel marking than standard Arabic, including combinations that would not be implemented in the Arabic fonts supplied by Microsoft.
Starting with a font derived from Jonathan Kew’s Scheherazade outlines, I used VOLT to implement the features that Uniscribe would be looking for (based on documentation obtained from Microsoft). I used VOLT’s proofing window to see if the features were working correctly. Once a feature worked there, I could install the font into Windows and see if Word 2000 would correctly display text. (If you try this at home, be sure to enable the appropriate language using theapplet).
At various times in the project I ran into difficulties that were not adequately dealt with by VOLT. For example, when I discovered I needed some additional glyphs to be added to the font, it turned out there was no way to port my VOLT code to the new font. To solve this problem I resorted to writing some custom Perl scripts.
The results are very promising, and the process does work. VOLT makes the work easier, but not easy. In addition to setting up the glyph substitution rules to pick the correct one in each context (including ligatures) from a palette of about 500 glyphs, I also had to manually position at least two attachment points (one above, one below) on each of the 500 glyphs so that the vowel marks would be correctly positioned by Uniscribe.
Application support for OpenType is not widespread. Adobe’s InDesign has no complex script support, and Microsoft Office 2000 utilizes the non-extensible script engines wrapped by Uniscribe.
Uniscribe implements only those scripts and characters defined by Unicode, and only a few Unicode ranges have specific shaping engines that invoke OpenType functionality. If your script requires characters in the Private Use Area, or if it is not based on one of the select few with shaping engines, Uniscribe will not help you.
Uniscribe is a work in progress, and newer and more capable versions will be coming out regularly.
OpenType fonts have to be designed in coordination with the software that is going to use them.
While VOLT makes the job of building OpenType tables somewhat easier, it is not a trivial task. Certain limitations of VOLT can be worked around using custom programming, e.g., in Perl.
Although VOLT is a friendly “visual” tool, it does not eliminate the need to understand the relatively complex world of Unicode, Uniscribe, and OpenType technologies.
Within the constraints mentioned above, there are language projects that will be able to benefit from OpenType technology. As a result of the pilot project, several projects in Eurasia Area are anxious to deploy OpenType fonts in Office 2000 in order to obtain a world-class word processing/DTP application that supports their complex script.
Previous articles in NRSI Updates which are related to this topic:
NRSI Update 11 said “Microsoft Office 2000 is able to make use of Uniscribe/OpenType if these are installed on a machine. Uniscribe doesn’t come with Office 2000, but it can be added as an option when installing Internet Explorer 5.” In fact, Uniscribe is packaged with Office 2000. However, it will not be installed unless you enable, using the applet, a language that requires complex rendering.
Circulation & Distribution Information
The purpose of this periodic e-mailing is to keep you in the picture about current NRSI research, development and application activities.