This is an archive of the original scripts.sil.org site, preserved as a historical reference. Some of the content is outdated. Please consult our other sites for more current information: software.sil.org, ScriptSource, FDBP, and silfontdev



Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE | PRIVACY POLICY

You are here: General
Short URL: https://scripts.sil.org/NRSIUpdate08

NRSI Update #8 – December 1997

NRSI staff, 1997-12-01

In this issue:

WinRend: Development Report

by Margaret Swauger

Since our last report, Martin Hosken has indeed arrived for a three month stay in Dallas, working full-time on WinRend research. We are grateful for the granting of a visa so Martin can bring to the team his experience in SE Asian languages along with in-depth knowledge of Windows NLS (Native Language Support) facilities such as codepage and locale.

Martin has already applied his expertise and provided a write-up (see Windows and CodePages in this issue) of the details and pitfalls of Windows codepage architecture. He and Joel Lee will be concentrating on outlining requirements for tools needed by WinRend users.

As is typical in the early phase of a project, it seems that the number of options for achieving WinRend goals (see WinRend Development Report in issue #7 of the NRSI Update), and thus the number of areas that need in-depth research, is increasing. We are still asking more questions than answering! Foremost in this list are Rhapsody and YellowBox from Apple (see “WinRend: Rhapsody and Yellow Box”, below) which hold promise of a cross-platform development environment that has sophisticated non-Roman facilities built-in.

The team has made a preliminary ranking of the non-Roman script behaviors that we believe WinRend needs to support. Behaviors listed in Gaultney and Kew’s Writing System Behaviors paper were prioritized into Musts, High wants, and Low wants. We invite field comment on the outcome of this process, summarized below in“WinRend: Writing System Behaviors to be supported”.

Finally, research into Word 97 Unicode support has taken a higher priority, not so much for WinRend (though WinRend might benefit from additional research into the Unicode support provided in Office 97) but because of the problems users are having with SIL fonts such as Encore, IPA, Greek, and Hebrew. Look for a report in the next NRSI Update.

WinRend: Writing System Behaviors to be supported

by Bob Hallissy

On November 7, 1997, the WinRend team reviewed Writing Systems Behavior Listing (abbreviated and illustrated) (J. Victor Gaultney, Jonathan Kew) to determine what behaviors we believe WinRend (WR) will need to support; that paper is available on request. Here is the summary of that discussion.

It should be noted that when starting to think about the implementation of these behaviors, it is natural to think about the mechanisms used in that implementation. Listings makes no reference to mechanisms and it maybe that one behavior may require a number of mechanisms, or, more importantly, that one mechanism added to a system will implement a number of behaviors. In certain cases, where a number of behaviors are implemented using one mechanism, that mechanism will be indicated so that later prioritization of work can make a more informed decision.

Behaviors are marked with M, H, or L to denote our conclusion as to whether support for that behavior was a Must, High want, or Low want.

Directionality and Baselines

M Horizontal direction types (L-to-R, R-to-L, mixed)
H Vertical text
L All baseline types

Listing separates Directionality from Baseline Type, and then includes vertical issues only with the Baseline discussion. We determined that vertical issues belong in both categories.

WR must support left-to-right, right-to-left, and mixed direction horizontal text. Vertical text, however, is seen as a high want, with the caveat that many applications will be unable to support it due to layout issues. However, for applications that require vertical text, WR would need to supply correct positioning information.

The correct alignment of various horizontal and vertical baselines is a typographic finesse that is required of a DTP application, but DTP is outside the scope of the WR mandate.

Rearrangement

M Indic-style
M Noncontiguous characters

Reordering is a core functionality especially important in south-Asianscripts.

Justification

L Word or phrase spacing
L Intercharacter spacing
L Kashidas (contextual or not)
L Glyph alteration (decomposition)
L Glyph alteration (ductility)
L Glyph alteration (stretching)
L Glyph alteration (replacement)

The WR mandate is for SIL applications. We do not foresee (at least in the near future) SIL developing DTP applications, and justification facilities are primarily needed for publishing solutions where typographical finesse is a requirement. Therefore we give these facilities low priority.

Spacing

M Kerning (standard)
H Kerning (Cross-stream)
M Collision Avoidance

Listings does not distinguish kerning from position adjustment. The former is a change of position of a glyph that is accompanied by a change in the print position (a.k.a escapement or advance width). The Standard Kerning example (“WAVE”) is, in our view, true kerning while the Cross-stream Kerning example (Roman stacked diacritics) is not true kerning since the advance width doesn’t change.

Roman stacked diacritics are best thought of as a form of diacritic placement requiring the x,y adjustment of a glyph relative to another glyph without changing the advance width.

Substitutions

L Transliteration
L Number Substitutions
L Vertical Substitution
L Other Substitution

All the substitutions will be available if a features mechanism is implemented. This mechanism allows an application to identify what optional behaviors a font supports and to identify which behaviors are to be enabled for particular runs of text. A features mechanism is required if WR is to be able to achieve results that are similar to the NRSI Greek and Hebrew fonts. However, at this time we are not making this a high priority.

Contextual Forms

M Word Position Contextuals
L Line Position Contextuals
L Cursive Connection
M Ligatures
M Ligatures with diacritics between components

Word position contextual forms and ligatures are required core functionality. Line position and cursive connection (except as can be achieved by simple contextual substitution) are seen as typographical finesse.

Diacritic Placement

M Centered Diacritics
M Relative Diacritics
M Stacking Diacritics
M Contextual Diacritics
L Diacritic Stripping

The correct selection and positioning of diacritics is required core functionality. Diacritic stripping is viewed as a type of Substitution and would thus depend on the availability of the features mechanism.

WinRend: Rhapsody and Yellow Box

by Wes Cleveland

With the introduction of its Rhapsody operating system, and the YellowBox library for Windows applications, Apple has promised to provide full GX Typography support (like that found in the Mac Operating System) for Rhapsody and Yellow Box applications. The WinRend development team sees this as a possible solution for non-Roman rendering in the Windows operating system. We currently have Rhapsody and Yellow Box installed, and are investigating the possibility of using Yellow Box to provide a large portion of the rendering work.

SIL Hebrew Update

by Joan Wardell

We are pleased to announce that the SIL Hebrew Font System for Windows, Version 1.0, is now available, following the completion of the KeyMan keyboard definitions. See the original Announcement in NRSI Update #6.

Further information, and the full Windows and Macintosh packages are available here.

OpenType Layout (formerly TrueType Open)

by Bob Hallissy

The OpenType Font Jamboree conference, jointly sponsored by Microsoft and Adobe, was held in Redmond, Washington, October 27-28, 1997. It was followed by a day of training on a new font hinting tool, Visual TrueType. In the early planning for the conference, Microsoft expected 40 or 50 attendees. As the conference approached, however, they had to scramble to make room for the 120+ interested people. Peter Martin and I were privileged to represent SIL interests at the conference.

What is OpenType? It is impossible to come up with a one-sentence answer to this question. Among other things, OpenType is

  • An extension of TrueType that encompasses the advanced typographic (“smart-font”) capabilities of what used to be called TrueTypeOpen
  • A joint venture from Microsoft and Adobe to support a unified font format that incorporates both TrueType and Type 1 (Postscript) outlines
  • Tools and technology to improve performance of Web content that uses special fonts
  • Improved protection of fonts through the use of digital signatures and authentication technologies
  • An API (OpenType Layout Services Library) to shield applications from the intricacies of advanced typographic features of OT fonts

For more information about OpenType, visit  http://www.microsoft.com/typography/faq/faq9.htm.

Who was present? Both the presenter and attendee lists included many big names in the type industry. In addition to Microsoft and Adobe, presentations were made by Monotype, Agfa, and Hewlett Packard. Attendees included type foundries (e.g., Bitstream), tool developers (e.g., Macromedia), and application developers. Peter and I had many opportunities to network with the industry’s movers and shakers.

What did we learn? As far as the non-Roman needs within SIL are concerned, OpenType is the same as its marketing predecessor, TrueTypeOpen. Refer to NRSI Updates #1 and #2 for our detailed analysis and concerns about this technology. Microsoft has done nothing to alleviate these concerns, and this conference reinforced our belief that anything we might do to utilize OpenType technology for minority script solutions will be restricted to SIL applications only, even though OpenType itself will be utilized with future releases of Windows operating system.

It was reassuring, however, to find out that we were not alone in our concerns: several font vendors expressed the same concerns to Microsoft during the conference. Just this month Microsoft initiated an e-mail discussion list to facilitate dialogue between font vendors, application developers, and Microsoft. The list is still just getting started, but we are hoping to use it to generate industry support for solutions to our concerns.

The cooperation between Microsoft and Adobe is interesting and surprising. They appear to be working hard to bury the hatchet in their font war, hopefully to the benefit of customers and application vendors. As evidence of their new-found good will toward each other, Microsoft is including a PostScriptType 1 rasterizer in Windows, and Adobe is creating and distributing fonts in TrueType format. A fundamental difference still exists, however, in that Adobe sells fonts to make a living, and Microsoft gives them away to make a living. At times, it seemed that the truce was tenuous at best.

Another crucial component of the OpenType strategy is the application support which Microsoft calls the OpenType Layout Services Library (OTLSL).This package hides the lower level OpenType details for applications that wish to take advantage of OpenType fonts. We have had access to the specification for this library since mid-1996, though the library is not available yet, nor is Microsoft making any commitments about availability.

Font Hinting Probably the most fascinating presentation was a demonstration of Visual TrueType, Microsoft’s new font hinting tool. This excellent presentation gave us novices an introduction to the wild and woolly world of TrueType programming. I have a whole new level of respect for those who craft really good-looking TrueType fonts. For more information, see Visual TrueType in this issue.

Font security Probably the most startling revelation came in the discussion of font security. In order to protect font vendors, OpenType supports the addition of digital signatures in OpenType Fonts. The idea is that through an authentication trail it is possible to confirm that a given copy of a font was in fact authored by the type designer named in the font. This means it is possible for systems to refuse to accept fonts which cannot be certified to be authentic, thus protecting type designers from being ripped off. While the benefits of this seem obvious, the drawbacks are subtle: when authentication is fully implemented and enforced, any TrueType font that does not contain a valid signature will fail to render. This means that all fonts you use today will no longer work. As you might guess, this raised a bit of a ruckus in the conference.

Overall conference evaluation OpenType is tantalizing. It has the potential to do what we need for non-roman scripts, yet the concerns we have expressed in the past have not been alleviated. While there is some consolation in knowing we are not the only ones with concerns , it is disappointingly clear that Microsoft is going full speed ahead with their own plans. They said they wanted input and feedback from the font industry, but their consistent self-defensive responses to concerns and issues seem to belie their words.

The networking with industry movers and shakers was well worth it. And even though Microsoft does not yet appear to be responsive to our needs, who knows whether the relationships that were initiated or enhanced during the conference may not one day yield fruit.

For more information about the conference, including an outline of the presentations, visit  http://www.microsoft.com/typography/jamboree/default.htm.

HebrewSDF: Pushing the Envelope

by Peter Constable

When I returned to work in October after an extended absence, it was long overdue that one of us on the NRSI team should get acquainted with and evaluate SDF. At the same time, there was interest being expressed in seeing a Hebrew SDF definition to work with the SIL Ezra package. That seemed to me like a good test case to work on.

Various people have successfully implemented SDF definitions for the writing systems they use, and some of their experiences have been published in previous Updates. So, how does Hebrew bring something new to the discussion? Hebrew requires a relatively large number of rules, significantly more than any other implementation that I have seen.

The challenge with Hebrew is that it has various long-distance dependencies. For example, the accent character has two glyphs that are positional variants. A rule is needed for each case when the non-default variant is required. The position of the accent glyphs is mainly contingent upon the consonant which it goes over. So, a rule is needed for each consonant that takes the non-default accent glyph. This, of itself, is something that is very commonly handled in SDF implementations. What is uncommon in the Hebrew case, however, is that there may be several characters between the consonant and the accent.

The full syntax for the character encoding of a syllable in the SILEzra Hebrew package is

consonant ( [ dagesh | rafe ] ) (vowel) (accent) (cantillation mark) ( [ asterisk | circellus ] ) 

where ( X ) means X is optional and [ A | B ] means that either A or B may occur (but not both). To select the correct accent glyph a separate rule is required for each combination of consonant, dagesh, rafe and vowel that requires the non-default accent glyph. That turns out to be about140 rules just to select the correct positional variant for the accent character!

Accent is not the only character for which glyph selection is dependent upon the consonant. Of the characters that can follow the consonant, dagesh, vowels, accent and cantillation marks, all have positional variants that are (at least) dependent upon the consonant. For each of these characters, separate rules are needed for each combination of the characters that precede it. As we move through the positions in the syllable, an increasingly large number of rules is required.

I presently have a prototype SDF file that supports all combinations involving consonants, dagesh, rafe, vowels and accent. It does not include support for cantellation marks, nor does it handle a number of special cases (e.g. weak aleph, furtive patah). The current version of this SDF file has a total of 541 rules. I anticipate that adding support for cantillation marks would more than double the number of rules required. (A note on the size: the largest SDF file I had otherwise encountered has 247 rules.)

How does SDF manage with this many rules? Informal testing with Shoebox and LinguaLinks suggests that SDF can manage large numbers of rules fairly well. There are some caveats to be considered, however.

With the 16-bit variant of Shoebox 3.07 (a test version, I don’t think it has been widely distributed) and RENDER16.DLL dated Oct. 22, I was encountering system crashes after the SDF file grew beyond a certain size. After changing to the 32-bit variant of Shoebox 3.09 (a candidate for release) together with a newer version of RENDER32.DLL dated Nov. 8, I was able to use the largest of my SDF files without incident. (The largest is an unoptimized version containing 1672 rules.) I encountered some performance problems, however. (Note: I am using a 200 MHz Pentium Pro.)

Using an SDF file containing 150 rules, I experienced a noticeable delay in scrolling and editing. With 150 rules, performance is just tolerable. At 275 rules, performance was starting to become intolerable. As I increased the number of rules, performance continued to drop. After pointing this out to the Shoebox development team, they explained that, as an edit is made, Shoebox redraws from the beginning of the field, with all the text passed through SDF each time. My test file contained a page of text in a single field. This is not how Shoebox is typically used, however, so it has not been optimized for this situation. When my text was divided into multiple, smaller fields (paragraphs), performance did improve significantly. With 541 rules, performance is very tolerable. With 1672 rules, there was a delay of perhaps 0.5 seconds as each character was typed. This performance is not great, but probably could be put up with. Of course, it remains to be seen if any script would ever need that many rules.

To ensure good performance in Shoebox when using SDF, keep the contents of fields small.

With LinguaLinks, it appears that some optimization has been done inscreen-drawing routines, so performance was not a major issue, but still a factor. Stability, however, is a problem in version 2.0. When using an SDF-transduced font in LinguaLinks, under certain circumstances LinguaLinks will report an “SDF encoding error”. In spite of the message, the problem appears to lie with LinguaLinks and not with SDF.

In summary, my experimentation with SDF indicates that the product is working quite well. I have made suggestions to Timm Erickson regarding functional improvements, and he has been quite responsive. (See “Some Recent Changes to SDF” in this issue.) As I write, Timm and the Shoebox team are each giving further consideration to the performance issues in their respective products.

Some Recent Changes to SDF

by Peter Constable

As mentioned in my other article, “HebrewSDF: Pushing the Envelope”, I have recently been doing some testing with SDF. After working with SDF and the SDF editor for a while, I wrote to Timm Erickson with some suggestions. As a result, Timm has made the following improvements:

  • In the SDF file format, text to the right of a pipe character “|” in a rule is treated as a description. This could be of any length. The SDF editor, however, would truncate the description to 20 characters. Beginning in version 1.0c of the editor, the description can now be up the 30 characters long.
  • For scripts which involve contextualization that is dependent upon word position, SDF requires some way of knowing whether a character is punctuation, a diacritic, a 2-form non-connector, or a 4-form connector. In some applications (e.g. Hebrew), this would mean that a rule would have to be written giving 4 forms for every consonant, even if the consonant character has only one glyph variant. Likewise, for every diacritic a rule would be needed that made no change but simply told SDF that the character is a diacritic. (SDF treats characters as punctuation unless told otherwise, so it is not ever necessary to create rules that simply declare a character as punctuation.)
  • Timm has now implemented changes to the editor (version 1.1) and the rendering DLLs that permit writing one-line declarations of diacritics,2-form non-connectors, and 4-form connectors. The following syntax is used in the SDF file: In the header section, after the line that begins with the marker “abc “, any of the following lines may be added:
define_diacr define_char2define_char4 

where the characters to be declared for the given class are listed after the marker (in similar fashion to the abc field). For example, the following lines are used for Hebrew:

abc ‘bgdhwzxXyklmnsvpcqrHWSt�?&aAoieEOuüáóéø´define_diacr �?&aAoieEOuüáóéø´define_char4 ‘bgdhwzxXylsvqrHWSt

In the editor, the properties dialog now has additional text boxes where these declarations can be made.

To make use of this capability, you must have a copy of RENDER32.DLLdated 11/8/97 or later. I have not yet seen a 16-bit version that supports this change.

There are a couple of other changes I hope to see made in future versions of the editor:

  • a most-recently-used file list
  • the ability to specify a default location for SDF files
  • a search mechanism
  • a filtering mechanism (e.g. show only rules that contain ... in the final position of the output)

It would also be nice to see support in the editor and rendering engine for operations on classes of characters.

In the mean time, I have been very pleased with the changes Timm has made, and at how quickly he responded to my suggestions. Thanks very much, Timm!

Microsoft Visual TrueType

by Peter Martin

At the joint Microsoft and Adobe OpenType Font Jam held in Redmond, Washington in October, we saw Microsoft’s first hinting tool for use outside of the company. Bob Hallissy and I were able to attend a one-day training event for Visual TrueType (VTT) 4.0, and experience at firsthand the intimidating complexity of TrueType hinting.

TrueType fonts can include hinting information which improves the appearance of type rendered at low resolutions and small point sizes. This information is expressed in terms of instructions for the rendering virtual machine; the hints for a given character resemble chunks of assembler source code. Macromedia’s Fontographer will generate hints automatically and provides basic support for manual hinting, and, while this is better than no hinting at all, there is room for improvement. I noted the attendance of one of the Macromedia developers at this training session.

Visual TrueType is a sophisticated, cross-platform tool for embedding hinting information in a generated TrueType font. The hints can be created visually, or by editing either the TypeMan Talk hinting language or the lower-level TrueType assembly language. In its present form it works best as a post-production tool, for applying hints to a finished font. If the font is regenerated for any reason, the hints have to recreated from scratch, or glyphs from the new font imported into the old. In principle, VTT should provide excellent hinting for non-Roman glyphs; the only Roman bias we saw appeared in naming conventions for Control Value Table entries; the lead developer will consider generalizing this mechanism.

This tool certainly improves on previous approaches to hinting which involved typing arcane assembler followed by testing, in an iterative loop. It still requires a great deal of understanding of the TrueType rendering engine, and of hinting strategies and type design techniques. It involves significant analysis of glyph paths and meticulous preparation for the hinting. In other words, the visual interface is appealing, but a huge amount of detailed manual work remains. The release of VTT is tantalizing because it simplifies much, but still requires a serious commitment of personnel in order to bring any real benefit.

Windows and Codepages

by Martin Hosken

Abstract

This document examines how Windows 95 handles multi-lingual computing. It looks at Languages, Codepages, Locales, Unicode and Fonts with particular reference to their support in Windows 95.

An alternative title for this document might be: “How to add a new script to Windows 95 and fail”.

Introduction

For those people requiring the availability of different scripts on their computers, a number of tools and approaches are available for Windows 3.1. How appropriate are such tools and approaches in Windows 95 which has better support for multilingual computing?

Here we introduce some of the basic concepts used in the rest of this discussion.

Unicode

Unicode is a 16-bit character set. Its primary purpose is for data interchange, just like ASCII. Whilst it aims to support every language, as we shall see, care should be taken in assuming that something which supports Unicode will necessarily support your language.

Language ID

A Language ID is a 16-bit number used to identify a particular language. Amongst other things, a particular language has one sort order associated. For this reason, the language ID is broken into two parts: a 10-bit primary language ID and a 6-bit sub-language ID. For example, US English = 0x0409, whilst UK English = 0x0809. Spanish (Traditional Sort) = 0x040a and Spanish (Modern Sort) = 0x0c0a.

Locale

Each language has a locale identified by the language ID. A locale specifies how to represent certain information, e.g. dates, monetary values, month names, in the language. It contains no information on how data is stored or sorted in any scripts of the language. In Windows 95 and Windows NT all locale information is stored in Unicode.

Codepage

Each different encoding in a system needs to have information describing how to map to and from Unicode, to describe the semantics of each character (e.g. upper to lowercase mapping, identifying numbers) and to give default and language specific sorting information. Each encoding, or codepage, is given a 16-bit number identifying it.

The rest of this document is a discussion of how all this is implemented in Windows 95 and associated products. We start by examining how we might expect it all to work, and then look at the problems of this and resulting realities. Finally we take a quick peek into the fog trying to guess what the future might hold.

The Windows 95 Solution

Files

Windows 95 has a single locale file, WindowsSystemLocale.nls, which holds all the locale information for every language.

Windows 95 also holds one file for each codepage. The names of these files are usually of the form WindowsSystemcp_nnnn.nls where nnnn is the codepage number, in decimal. The particular file for a codepage is referenced via the registry at key:

HKEY_LOCAL_MACHINESystemCurrentControlSetcontrolNlsCodepage

Within this key, each codepage number has an entry which references a file relative to WindowsSystem.

Fonts

TrueType fonts are a technology within themselves. Each font consists of a number of tables holding various pieces of information pertaining to rendering. One of the tables (cmap) is used to map between the external codepoints and the internal glyphs. In Windows (all versions) this mapping is from a 16-bit value (assumed to be Unicode) to a glyph in the font.

Applications normally store data in an 8-bit form, requiring a mapping from the 8-bit form to the 16-bit form used by a font. This is where codepages come into play. They hold the 8-bit to 16-bit mapping information. In Windows 3.11 there is one, de facto, mapping. The precise nature of the mapping is dictated by the national version of Windows that you have. So, for example, US Windows supports codepage 1252; Thai Windows supports codepage 874; and so on. This also corresponds to the default codepage provided with a particular national version of Windows 95. Windows also supports a few other minor mappings: Symbol and OEM (corresponding to DOS), but again, these are fixed and not extensible. Windows 95, on the other hand, theoretically, can support any number of codepages. This is particularly useful when doing multilingual computing.

Once you get into the realm of multiple codepages, a font needs to indicate which codepages it supports. This is done, to a limited extent, within a TrueType font. Details of how this works, and the limitations it imposes are covered in this next section.

Windows 95 Implementation

Introduction

Given the solution presented by Windows 95, adding a new orthography to Windows 95 would consist merely in producing a codepage for the encoding and any locale entries for the languages which use that codepage. These could then be inserted into the appropriate locations, and, hey presto! we can work with the new orthography in all our applications.

Unfortunately Windows 95 has various problems to overcome, and the solution to these problems results in a severe limiting of the openness of the system.

Let us return to the problem of a font indicating which codepages it supports. This is a necessary activity in order to provide scripting support by choice of font. In a program such as Word, each font is listed with the scripts it supports. Thus, if you install the multilingual extensions to Windows 95, you will have large versions of such fonts as Times New Roman. When you pull down a font selection list in, say, Word, you will see that Times New Roman can be selected in various forms, including Central European, Greek, etc. In order for Windows to give you this list, it is necessary for it to be able to interrogate the font in question to see which scripts (or codepages) it supports.

The information is provided by means of a 64-bit bitfield, stored in the OS/2 table of the TrueType font file, in which each codepage in question is allocated one of the bits. If the font supports that codepage, then the corresponding bit is set.

Toward our end of adding a new script to Windows, therefore, all we need do is allocate one of those bits to a codepage of our choice and everything is fine. The difficulty is in how to do this. Due to the small number of bits, Windows may just as well hard-code the allocation of the bits to codepage numbers, which is what it does. There is no way to add a bitfield entry to codepage number mapping to the system. Thus, if Windows does not know about your codepage at design time, it cannot be properly integrated.

The overall upshot of this is that such applications as Word and WordPad do not support codepages beyond a restricted set.

Good News

Thankfully, this lack of a mapping is not insurmountable. At the API level (the level at which programs interact with Windows internally) any codepage is referencable and useable, if care is taken. Programmers are referred to the MultiByteToWideChar() type function calls which map from 8-bit to Unicode.

Porting from Windows 3.1 to Windows 95

In order to support fonts indicating which codepages they support, the TrueType specification underwent a quiet change between Windows 3.1 and Windows 95. As a result it is possible that a font may work perfectly adequately in Windows 3.1 but not at all in Windows 95. This is because it has not got the codepage information in it. For much of the time Windows 95 guesses quite happily, but this should not be relied upon.

Another change that was added at the same time was that a font can indicate which Unicode ranges it supports. I am not sure what this is used for yet, but I have my suspicions. For a table of which bit means which codepage, see Appendix A: CodePage Bitfields, and for a table of Unicode ranges, see Appendix B: Unicode Bitfields.

If it is necessary to add any of this information to a font, there are a number of tools to help. Typecaster, from version 3, supports two commands at the start of a .cst file. codepage_range is followed by two 32-bit hex values separated by commas, and indicates the codepage bitfield to be included in the font. unicode_range is followed by four 32-bit hex values separated by commas, and indicates the Unicode ranges supported by the font. For example:

code 1
uni 3

This is the default, used for an ANSI font and indicates codepage 1252. Notice that missing values are assumed to be 0.

Fontographer version 4.1 and beyond allows the insertion of the necessary information.

An interim PERL v4 program exists called hackos2 which allows the manipulation of the OS/2 table in a TrueType font which contains the appropriate bitfields.

To mimic the behaviour of Windows 3.1, it is most likely that a user will want to make their font an ANSI font and indicate that it supports codepage 1252. Symbol fonts, whilst not having a codepage file, do have a codepage bit associated with them.

Multilingual Extensions

As another example of what is going on, we can look at the multilingual extensions supplied with Windows 95. To install them, go to Add/Remove Programs in the control panel and click on the Windows Setup tab. From there, select Multilingual Extensions and click  OK . You will have to restart Windows to gain the full benefits.

Here is what installing these extensions does.

  • Adds a bunch of codepage files to your WindowsSystem directory and update the registry accordingly.
  • Replaces your system fonts (Times New Roman, Arial, Lucida Sans, Courier New, etc.) with large fonts that encompass all the codepages added.
  • Adds different language keyboards. Setting a particular language keyboard indicates to the application the language associated with that keyboard. This is used in such applications as Word 97.

Overall this is probably a worthwhile thing to do if you are intending to work with any scripts beyond Western European.

Unicode: The Future

As far as Windows and NT are concerned, the future is Unicode. This means that underlying storage will be increasingly Unicode. For example, Word 97 uses Unicode to store its data as will WordPad, etc. and will use conversion techniques to generate 8-bit data when necessary.

Example: Word 97

One of the difficulties encountered with Word 97 sometimes occurs with a font change, when data unpredictably either disappears into little boxes or, when saving, converts to question marks. What is going on?

Word 97 keeps track of which codepage data is entered with. In the case of a Symbol font, there is no associated codepage, due to the vagueries of Unicode. Thus Word 97 converts the data directly into Unicode (and incidently gives it the system codepage).

Then a user decides to change font to one with a different encoding. In the case of a supported codepage, Word will not allow the user to change the encoding. In the case of Symbol encoding, Word allows you to change the font to one which supports the system codepage. But that font need not support the Unicode values used by the Symbol encoding (U+F020-U+F0FF), and so those characters are converted into boxes.

There is a mechanism in later versions of Word 97 (Service Release 1) to allow conversion from fonts using alien (to the bitfield system) codepages into fonts with known codepages. But then there is a problem with typing since the 8-bit key-codes are converted using the known, converted, codepage, rather than the alien codepage. So we cannot fully support a new codepage that way.

A second problem arises when storing as 8-bit ASCII text. Word 97 converts the data to ASCII via the system codepage (see the ACP entry in the codepage section of the registry). This conversion, from one codepage to another via Unicode, makes a best approximation to an 8-bit form of the characters. Resulting in, for example, the letter a being output rather than a hooked-a; or, when there is no good approximation, a question mark. Since the system has no idea what Symbols are, they all get converted to question marks.

This, at least, is what we think Word 97 is up to. It’s handling of codepages, and especially Symbol fonts, is consistent in that the same thing happens every time, but not necessarily logical when compared with behaviour in other parts of the program. (For example, try converting some text from Times New Roman to Symbol and back again).

Conclusion

The future trend towards Unicode support has major implications for those wishing to work with scripts not specified in the version of Unicode that is implemented.

Firstly, there is more information held about characters than just how to render them. There is all sorts of semantic information to do with case, directionality, diacritics, etc. At the moment, this is stored in the codepage, thus allowing one codepage to effectively give a different semantic meaning to a Unicode character than another codepage. NT and probably Windows will tend towards a centralised semantic database for the whole of Unicode. As it is, this is achieveable, through compression, in about 9K bytes.

The implication for multilingual users is that it will be increasingly difficult to reinterpret characters to our own ends. Our existing technique of saying that an A acute looks like a high tone diacritic in another font is not going to work so well.

Secondly, Unicode is a data transfer standard, as ASCII was, and rendering directly from Unicode is sometimes very difficult. Our fonts are going to have to become smarter, as will our rendering technology. Scripting issues will increasingly have to become a speciality rather than something that OWLs can necessarily deal with unaided.

Having said all this, as an organisation we are not in an unhealthy position and if we keep working at it, we can stay that way.

Appendix A: Codepage Bitfields

BitCode pageDescription
0 1252 Latin 1
1 1250 Latin 2: Eastern Europe
2 1251 Cyrillic
3 1253 Greek
4 1254 Turkish
5 1255 Hebrew
6 1256 Arabic
7 1257 Baltic
8 - 15 Reserved for ANSI

ANSI

BitCode pageDescription
16 874 Thai
17 932 Japanese, Shift-JIS
18 936 Chinese: Simplified chars, PRC and Singapore
19 949 Korean Unified Hangeul Code (Hangeul TongHabHyung Code)
20 950 Chinese: Traditional chars-Taiwan and Hong Kong
21 1361 Korean (Johab)
22 - 29 Reserved for alternate ANSI and OEM
30 - 31 Reserved by system. (Bit 31 is used for Symbol Fonts)

ANSI and OEM

BitCode pageDescription
32 - 47 Reserved for OEM
48 869 IBM Greek
49 866 MS-DOS Russian
50 865 MS-DOS Nordic
51 864 Arabic
52 863 MS-DOS Canadian French
53 862 Hebrew
54 861 MS-DOS Icelandic
55 860 MS-DOS Portuguese
56 857 IBM Turkish
57 855 IBM Cyrillic; primarily Russian
58 852 Latin 2
59 775 Baltic
60 737 Greek; former 437 G
61 708 Arabic; ASMO 708
62 850 Western European/Latin 1
63 437 US

OEM

Appendix B: Unicode Subset Bitfields

BitDescription
0 Basic Latin
1 Latin-1 Supplement
2 Latin Extended-A
3 Latin Extended-B
4 IPA Extensions
5 Spacing Modifier Letters
6 Combining Diacritical Marks
7 Basic Greek
8 Greek Symbols and Coptic
9 Cyrillic
10 Armenian
11 Basic Hebrew
12 Hebrew Extended
13 Basic Arabic
14 Arabic Extended
15 Devanagari
16 Bengali
17 Gurmukhi
18 Gujarati
19 Oriya
20 Tamil
21 Telugu
22 Kannada
23 Malayalam
24 Thai
25 Lao
26 Basic Georgian
27 Georgian Extended
28 Hangul Jamo
29 Latin Extended Additional
30 Greek Extended
31 General Punctuation
32 Subscripts and Superscripts
33 Currency Symbols
34 Combining Diacritical Marks for Symbols
35 Letter-like Symbols
36 Number Forms
37 Arrows
38 Mathematical Operators
39 Miscellaneous Technical
40 Control Pictures
41 Optical Character Recognition
42 Enclosed Alphanumerics
43 Box Drawing
44 Block Elements
45 Geometric Shapes
46 Miscellaneous Symbols
47 Dingbats
48 Chinese, Japanese, and Korean (CJK) Symbols and Punctuation
49 Hiragana
50 Katakana
51 Bopomofo
52 Hangul Compatibility Jamo
53 CJK Miscellaneous
54 Enclosed CJK
55 CJK Compatibility
56 Hangul
57 Reserved for Unicode Subranges
58 Reserved for Unicode Subranges
59 CJK Unified Ideographs
60 Private Use Area
61 CJK Compatibility Ideographs
62 Alphabetic Presentation Forms
63 Arabic Presentation Forms-A
64 Combining Half Marks
65 CJK Compatibility Forms
66 Small Form Variants
67 Arabic Presentation Forms-B
68 Halfwidth and Fullwidth Forms
69 Specials
70-127 Reserved for Unicode Subranges

Circulation & Distribution Information

The purpose of this periodic e-mailing is to keep you in the picture about current NRSI research, development and application activities.


© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.