WSTech: Writing Systems Technology (formerly known as NRSI)
NRSI Update #8 – December 1997
In this issue:
WinRend: Development Report
by Margaret Swauger
Since our last report, Martin Hosken has indeed arrived for a three month stay in Dallas, working full-time on WinRend research. We are grateful for the granting of a visa so Martin can bring to the team his experience in SE Asian languages along with in-depth knowledge of Windows NLS (Native Language Support) facilities such as codepage and locale.
Martin has already applied his expertise and provided a write-up (see Windows and CodePages in this issue) of the details and pitfalls of Windows codepage architecture. He and Joel Lee will be concentrating on outlining requirements for tools needed by WinRend users.
As is typical in the early phase of a project, it seems that the number of options for achieving WinRend goals (see WinRend Development Report in issue #7 of the NRSI Update), and thus the number of areas that need in-depth research, is increasing. We are still asking more questions than answering! Foremost in this list are Rhapsody and YellowBox from Apple (see “WinRend: Rhapsody and Yellow Box”, below) which hold promise of a cross-platform development environment that has sophisticated non-Roman facilities built-in.
The team has made a preliminary ranking of the non-Roman script behaviors that we believe WinRend needs to support. Behaviors listed in Gaultney and Kew’s Writing System Behaviors paper were prioritized into Musts, High wants, and Low wants. We invite field comment on the outcome of this process, summarized below in“WinRend: Writing System Behaviors to be supported”.
Finally, research into Word 97 Unicode support has taken a higher priority, not so much for WinRend (though WinRend might benefit from additional research into the Unicode support provided in Office 97) but because of the problems users are having with SIL fonts such as Encore, IPA, Greek, and Hebrew. Look for a report in the next NRSI Update.
WinRend: Writing System Behaviors to be supported
by Bob Hallissy
On November 7, 1997, the WinRend team reviewed Writing Systems Behavior Listing (abbreviated and illustrated) (J. Victor Gaultney, Jonathan Kew) to determine what behaviors we believe WinRend (WR) will need to support; that paper is available on request. Here is the summary of that discussion.
It should be noted that when starting to think about the implementation of these behaviors, it is natural to think about the mechanisms used in that implementation. Listings makes no reference to mechanisms and it maybe that one behavior may require a number of mechanisms, or, more importantly, that one mechanism added to a system will implement a number of behaviors. In certain cases, where a number of behaviors are implemented using one mechanism, that mechanism will be indicated so that later prioritization of work can make a more informed decision.
Behaviors are marked with M, H, or L to denote our conclusion as to whether support for that behavior was a Must, High want, or Low want.
Directionality and Baselines
Listing separates Directionality from Baseline Type, and then includes vertical issues only with the Baseline discussion. We determined that vertical issues belong in both categories.
WR must support left-to-right, right-to-left, and mixed direction horizontal text. Vertical text, however, is seen as a high want, with the caveat that many applications will be unable to support it due to layout issues. However, for applications that require vertical text, WR would need to supply correct positioning information.
The correct alignment of various horizontal and vertical baselines is a typographic finesse that is required of a DTP application, but DTP is outside the scope of the WR mandate.
Reordering is a core functionality especially important in south-Asianscripts.
The WR mandate is for SIL applications. We do not foresee (at least in the near future) SIL developing DTP applications, and justification facilities are primarily needed for publishing solutions where typographical finesse is a requirement. Therefore we give these facilities low priority.
Listings does not distinguish kerning from position adjustment. The former is a change of position of a glyph that is accompanied by a change in the print position (a.k.a escapement or advance width). The Standard Kerning example (“WAVE”) is, in our view, true kerning while the Cross-stream Kerning example (Roman stacked diacritics) is not true kerning since the advance width doesn’t change.
Roman stacked diacritics are best thought of as a form of diacritic placement requiring the x,y adjustment of a glyph relative to another glyph without changing the advance width.
All the substitutions will be available if a features mechanism is implemented. This mechanism allows an application to identify what optional behaviors a font supports and to identify which behaviors are to be enabled for particular runs of text. A features mechanism is required if WR is to be able to achieve results that are similar to the NRSI Greek and Hebrew fonts. However, at this time we are not making this a high priority.
Word position contextual forms and ligatures are required core functionality. Line position and cursive connection (except as can be achieved by simple contextual substitution) are seen as typographical finesse.
The correct selection and positioning of diacritics is required core functionality. Diacritic stripping is viewed as a type of Substitution and would thus depend on the availability of the features mechanism.
WinRend: Rhapsody and Yellow Box
by Wes Cleveland
With the introduction of its Rhapsody operating system, and the YellowBox library for Windows applications, Apple has promised to provide full GX Typography support (like that found in the Mac Operating System) for Rhapsody and Yellow Box applications. The WinRend development team sees this as a possible solution for non-Roman rendering in the Windows operating system. We currently have Rhapsody and Yellow Box installed, and are investigating the possibility of using Yellow Box to provide a large portion of the rendering work.
SIL Hebrew Update
by Joan Wardell
We are pleased to announce that the SIL Hebrew Font System for Windows, Version 1.0, is now available, following the completion of the KeyMan keyboard definitions. See the original Announcement in NRSI Update #6.
Further information, and the full Windows and Macintosh packages are available here.
OpenType Layout (formerly TrueType Open)
by Bob Hallissy
The OpenType Font Jamboree conference, jointly sponsored by Microsoft and Adobe, was held in Redmond, Washington, October 27-28, 1997. It was followed by a day of training on a new font hinting tool, Visual TrueType. In the early planning for the conference, Microsoft expected 40 or 50 attendees. As the conference approached, however, they had to scramble to make room for the 120+ interested people. Peter Martin and I were privileged to represent SIL interests at the conference.
What is OpenType? It is impossible to come up with a one-sentence answer to this question. Among other things, OpenType is
For more information about OpenType, visit http://www.microsoft.com/typography/faq/faq9.htm.
Who was present? Both the presenter and attendee lists included many big names in the type industry. In addition to Microsoft and Adobe, presentations were made by Monotype, Agfa, and Hewlett Packard. Attendees included type foundries (e.g., Bitstream), tool developers (e.g., Macromedia), and application developers. Peter and I had many opportunities to network with the industry’s movers and shakers.
What did we learn? As far as the non-Roman needs within SIL are concerned, OpenType is the same as its marketing predecessor, TrueTypeOpen. Refer to NRSI Updates #1 and #2 for our detailed analysis and concerns about this technology. Microsoft has done nothing to alleviate these concerns, and this conference reinforced our belief that anything we might do to utilize OpenType technology for minority script solutions will be restricted to SIL applications only, even though OpenType itself will be utilized with future releases of Windows operating system.
It was reassuring, however, to find out that we were not alone in our concerns: several font vendors expressed the same concerns to Microsoft during the conference. Just this month Microsoft initiated an e-mail discussion list to facilitate dialogue between font vendors, application developers, and Microsoft. The list is still just getting started, but we are hoping to use it to generate industry support for solutions to our concerns.
The cooperation between Microsoft and Adobe is interesting and surprising. They appear to be working hard to bury the hatchet in their font war, hopefully to the benefit of customers and application vendors. As evidence of their new-found good will toward each other, Microsoft is including a PostScriptType 1 rasterizer in Windows, and Adobe is creating and distributing fonts in TrueType format. A fundamental difference still exists, however, in that Adobe sells fonts to make a living, and Microsoft gives them away to make a living. At times, it seemed that the truce was tenuous at best.
Another crucial component of the OpenType strategy is the application support which Microsoft calls the OpenType Layout Services Library (OTLSL).This package hides the lower level OpenType details for applications that wish to take advantage of OpenType fonts. We have had access to the specification for this library since mid-1996, though the library is not available yet, nor is Microsoft making any commitments about availability.
Font Hinting Probably the most fascinating presentation was a demonstration of Visual TrueType, Microsoft’s new font hinting tool. This excellent presentation gave us novices an introduction to the wild and woolly world of TrueType programming. I have a whole new level of respect for those who craft really good-looking TrueType fonts. For more information, see Visual TrueType in this issue.
Font security Probably the most startling revelation came in the discussion of font security. In order to protect font vendors, OpenType supports the addition of digital signatures in OpenType Fonts. The idea is that through an authentication trail it is possible to confirm that a given copy of a font was in fact authored by the type designer named in the font. This means it is possible for systems to refuse to accept fonts which cannot be certified to be authentic, thus protecting type designers from being ripped off. While the benefits of this seem obvious, the drawbacks are subtle: when authentication is fully implemented and enforced, any TrueType font that does not contain a valid signature will fail to render. This means that all fonts you use today will no longer work. As you might guess, this raised a bit of a ruckus in the conference.
Overall conference evaluation OpenType is tantalizing. It has the potential to do what we need for non-roman scripts, yet the concerns we have expressed in the past have not been alleviated. While there is some consolation in knowing we are not the only ones with concerns , it is disappointingly clear that Microsoft is going full speed ahead with their own plans. They said they wanted input and feedback from the font industry, but their consistent self-defensive responses to concerns and issues seem to belie their words.
The networking with industry movers and shakers was well worth it. And even though Microsoft does not yet appear to be responsive to our needs, who knows whether the relationships that were initiated or enhanced during the conference may not one day yield fruit.
For more information about the conference, including an outline of the presentations, visit http://www.microsoft.com/typography/jamboree/default.htm.
HebrewSDF: Pushing the Envelope
by Peter Constable
When I returned to work in October after an extended absence, it was long overdue that one of us on the NRSI team should get acquainted with and evaluate SDF. At the same time, there was interest being expressed in seeing a Hebrew SDF definition to work with the SIL Ezra package. That seemed to me like a good test case to work on.
Various people have successfully implemented SDF definitions for the writing systems they use, and some of their experiences have been published in previous Updates. So, how does Hebrew bring something new to the discussion? Hebrew requires a relatively large number of rules, significantly more than any other implementation that I have seen.
The challenge with Hebrew is that it has various long-distance dependencies. For example, the accent character has two glyphs that are positional variants. A rule is needed for each case when the non-default variant is required. The position of the accent glyphs is mainly contingent upon the consonant which it goes over. So, a rule is needed for each consonant that takes the non-default accent glyph. This, of itself, is something that is very commonly handled in SDF implementations. What is uncommon in the Hebrew case, however, is that there may be several characters between the consonant and the accent.
The full syntax for the character encoding of a syllable in the SILEzra Hebrew package is
consonant ( [ dagesh | rafe ] ) (vowel) (accent) (cantillation mark) ( [ asterisk | circellus ] )
where ( X ) means X is optional and [ A | B ] means that either A or B may occur (but not both). To select the correct accent glyph a separate rule is required for each combination of consonant, dagesh, rafe and vowel that requires the non-default accent glyph. That turns out to be about140 rules just to select the correct positional variant for the accent character!
Accent is not the only character for which glyph selection is dependent upon the consonant. Of the characters that can follow the consonant, dagesh, vowels, accent and cantillation marks, all have positional variants that are (at least) dependent upon the consonant. For each of these characters, separate rules are needed for each combination of the characters that precede it. As we move through the positions in the syllable, an increasingly large number of rules is required.
I presently have a prototype SDF file that supports all combinations involving consonants, dagesh, rafe, vowels and accent. It does not include support for cantellation marks, nor does it handle a number of special cases (e.g. weak aleph, furtive patah). The current version of this SDF file has a total of 541 rules. I anticipate that adding support for cantillation marks would more than double the number of rules required. (A note on the size: the largest SDF file I had otherwise encountered has 247 rules.)
How does SDF manage with this many rules? Informal testing with Shoebox and LinguaLinks suggests that SDF can manage large numbers of rules fairly well. There are some caveats to be considered, however.
With the 16-bit variant of Shoebox 3.07 (a test version, I don’t think it has been widely distributed) and RENDER16.DLL dated Oct. 22, I was encountering system crashes after the SDF file grew beyond a certain size. After changing to the 32-bit variant of Shoebox 3.09 (a candidate for release) together with a newer version of RENDER32.DLL dated Nov. 8, I was able to use the largest of my SDF files without incident. (The largest is an unoptimized version containing 1672 rules.) I encountered some performance problems, however. (Note: I am using a 200 MHz Pentium Pro.)
Using an SDF file containing 150 rules, I experienced a noticeable delay in scrolling and editing. With 150 rules, performance is just tolerable. At 275 rules, performance was starting to become intolerable. As I increased the number of rules, performance continued to drop. After pointing this out to the Shoebox development team, they explained that, as an edit is made, Shoebox redraws from the beginning of the field, with all the text passed through SDF each time. My test file contained a page of text in a single field. This is not how Shoebox is typically used, however, so it has not been optimized for this situation. When my text was divided into multiple, smaller fields (paragraphs), performance did improve significantly. With 541 rules, performance is very tolerable. With 1672 rules, there was a delay of perhaps 0.5 seconds as each character was typed. This performance is not great, but probably could be put up with. Of course, it remains to be seen if any script would ever need that many rules.
To ensure good performance in Shoebox when using SDF, keep the contents of fields small.
With LinguaLinks, it appears that some optimization has been done inscreen-drawing routines, so performance was not a major issue, but still a factor. Stability, however, is a problem in version 2.0. When using an SDF-transduced font in LinguaLinks, under certain circumstances LinguaLinks will report an “SDF encoding error”. In spite of the message, the problem appears to lie with LinguaLinks and not with SDF.
In summary, my experimentation with SDF indicates that the product is working quite well. I have made suggestions to Timm Erickson regarding functional improvements, and he has been quite responsive. (See “Some Recent Changes to SDF” in this issue.) As I write, Timm and the Shoebox team are each giving further consideration to the performance issues in their respective products.
Some Recent Changes to SDF
by Peter Constable
As mentioned in my other article, “HebrewSDF: Pushing the Envelope”, I have recently been doing some testing with SDF. After working with SDF and the SDF editor for a while, I wrote to Timm Erickson with some suggestions. As a result, Timm has made the following improvements:
where the characters to be declared for the given class are listed after the marker (in similar fashion to the abc field). For example, the following lines are used for Hebrew:
abc ‘bgdhwzxXyklmnsvpcqrHWSt�?&aAoieEOuüáóéø´define_diacr �?&aAoieEOuüáóéø´define_char4 ‘bgdhwzxXylsvqrHWSt
In the editor, the properties dialog now has additional text boxes where these declarations can be made.
To make use of this capability, you must have a copy of RENDER32.DLLdated 11/8/97 or later. I have not yet seen a 16-bit version that supports this change.
There are a couple of other changes I hope to see made in future versions of the editor:
It would also be nice to see support in the editor and rendering engine for operations on classes of characters.
In the mean time, I have been very pleased with the changes Timm has made, and at how quickly he responded to my suggestions. Thanks very much, Timm!
Microsoft Visual TrueType
by Peter Martin
At the joint Microsoft and Adobe OpenType Font Jam held in Redmond, Washington in October, we saw Microsoft’s first hinting tool for use outside of the company. Bob Hallissy and I were able to attend a one-day training event for Visual TrueType (VTT) 4.0, and experience at firsthand the intimidating complexity of TrueType hinting.
TrueType fonts can include hinting information which improves the appearance of type rendered at low resolutions and small point sizes. This information is expressed in terms of instructions for the rendering virtual machine; the hints for a given character resemble chunks of assembler source code. Macromedia’s Fontographer will generate hints automatically and provides basic support for manual hinting, and, while this is better than no hinting at all, there is room for improvement. I noted the attendance of one of the Macromedia developers at this training session.
Visual TrueType is a sophisticated, cross-platform tool for embedding hinting information in a generated TrueType font. The hints can be created visually, or by editing either the TypeMan Talk hinting language or the lower-level TrueType assembly language. In its present form it works best as a post-production tool, for applying hints to a finished font. If the font is regenerated for any reason, the hints have to recreated from scratch, or glyphs from the new font imported into the old. In principle, VTT should provide excellent hinting for non-Roman glyphs; the only Roman bias we saw appeared in naming conventions for Control Value Table entries; the lead developer will consider generalizing this mechanism.
This tool certainly improves on previous approaches to hinting which involved typing arcane assembler followed by testing, in an iterative loop. It still requires a great deal of understanding of the TrueType rendering engine, and of hinting strategies and type design techniques. It involves significant analysis of glyph paths and meticulous preparation for the hinting. In other words, the visual interface is appealing, but a huge amount of detailed manual work remains. The release of VTT is tantalizing because it simplifies much, but still requires a serious commitment of personnel in order to bring any real benefit.
Windows and Codepages
by Martin Hosken
This document examines how Windows 95 handles multi-lingual computing. It looks at Languages, Codepages, Locales, Unicode and Fonts with particular reference to their support in Windows 95.
An alternative title for this document might be: “How to add a new script to Windows 95 and fail”.
For those people requiring the availability of different scripts on their computers, a number of tools and approaches are available for Windows 3.1. How appropriate are such tools and approaches in Windows 95 which has better support for multilingual computing?
Here we introduce some of the basic concepts used in the rest of this discussion.
Unicode is a 16-bit character set. Its primary purpose is for data interchange, just like ASCII. Whilst it aims to support every language, as we shall see, care should be taken in assuming that something which supports Unicode will necessarily support your language.
A Language ID is a 16-bit number used to identify a particular language. Amongst other things, a particular language has one sort order associated. For this reason, the language ID is broken into two parts: a 10-bit primary language ID and a 6-bit sub-language ID. For example, US English = 0x0409, whilst UK English = 0x0809. Spanish (Traditional Sort) = 0x040a and Spanish (Modern Sort) = 0x0c0a.
Each language has a locale identified by the language ID. A locale specifies how to represent certain information, e.g. dates, monetary values, month names, in the language. It contains no information on how data is stored or sorted in any scripts of the language. In Windows 95 and Windows NT all locale information is stored in Unicode.
Each different encoding in a system needs to have information describing how to map to and from Unicode, to describe the semantics of each character (e.g. upper to lowercase mapping, identifying numbers) and to give default and language specific sorting information. Each encoding, or codepage, is given a 16-bit number identifying it.
The rest of this document is a discussion of how all this is implemented in Windows 95 and associated products. We start by examining how we might expect it all to work, and then look at the problems of this and resulting realities. Finally we take a quick peek into the fog trying to guess what the future might hold.
The Windows 95 Solution
Windows 95 has a single locale file, WindowsSystemLocale.nls, which holds all the locale information for every language.
Windows 95 also holds one file for each codepage. The names of these files are usually of the form WindowsSystemcp_nnnn.nls where nnnn is the codepage number, in decimal. The particular file for a codepage is referenced via the registry at key:
Within this key, each codepage number has an entry which references a file relative to WindowsSystem.
TrueType fonts are a technology within themselves. Each font consists of a number of tables holding various pieces of information pertaining to rendering. One of the tables (cmap) is used to map between the external codepoints and the internal glyphs. In Windows (all versions) this mapping is from a 16-bit value (assumed to be Unicode) to a glyph in the font.
Applications normally store data in an 8-bit form, requiring a mapping from the 8-bit form to the 16-bit form used by a font. This is where codepages come into play. They hold the 8-bit to 16-bit mapping information. In Windows 3.11 there is one, de facto, mapping. The precise nature of the mapping is dictated by the national version of Windows that you have. So, for example, US Windows supports codepage 1252; Thai Windows supports codepage 874; and so on. This also corresponds to the default codepage provided with a particular national version of Windows 95. Windows also supports a few other minor mappings: Symbol and OEM (corresponding to DOS), but again, these are fixed and not extensible. Windows 95, on the other hand, theoretically, can support any number of codepages. This is particularly useful when doing multilingual computing.
Once you get into the realm of multiple codepages, a font needs to indicate which codepages it supports. This is done, to a limited extent, within a TrueType font. Details of how this works, and the limitations it imposes are covered in this next section.
Windows 95 Implementation
Given the solution presented by Windows 95, adding a new orthography to Windows 95 would consist merely in producing a codepage for the encoding and any locale entries for the languages which use that codepage. These could then be inserted into the appropriate locations, and, hey presto! we can work with the new orthography in all our applications.
Unfortunately Windows 95 has various problems to overcome, and the solution to these problems results in a severe limiting of the openness of the system.
Let us return to the problem of a font indicating which codepages it supports. This is a necessary activity in order to provide scripting support by choice of font. In a program such as Word, each font is listed with the scripts it supports. Thus, if you install the multilingual extensions to Windows 95, you will have large versions of such fonts as Times New Roman. When you pull down a font selection list in, say, Word, you will see that Times New Roman can be selected in various forms, including Central European, Greek, etc. In order for Windows to give you this list, it is necessary for it to be able to interrogate the font in question to see which scripts (or codepages) it supports.
The information is provided by means of a 64-bit bitfield, stored in the OS/2 table of the TrueType font file, in which each codepage in question is allocated one of the bits. If the font supports that codepage, then the corresponding bit is set.
Toward our end of adding a new script to Windows, therefore, all we need do is allocate one of those bits to a codepage of our choice and everything is fine. The difficulty is in how to do this. Due to the small number of bits, Windows may just as well hard-code the allocation of the bits to codepage numbers, which is what it does. There is no way to add a bitfield entry to codepage number mapping to the system. Thus, if Windows does not know about your codepage at design time, it cannot be properly integrated.
The overall upshot of this is that such applications as Word and WordPad do not support codepages beyond a restricted set.
Thankfully, this lack of a mapping is not insurmountable. At the API level (the level at which programs interact with Windows internally) any codepage is referencable and useable, if care is taken. Programmers are referred to the MultiByteToWideChar() type function calls which map from 8-bit to Unicode.
Porting from Windows 3.1 to Windows 95
In order to support fonts indicating which codepages they support, the TrueType specification underwent a quiet change between Windows 3.1 and Windows 95. As a result it is possible that a font may work perfectly adequately in Windows 3.1 but not at all in Windows 95. This is because it has not got the codepage information in it. For much of the time Windows 95 guesses quite happily, but this should not be relied upon.
Another change that was added at the same time was that a font can indicate which Unicode ranges it supports. I am not sure what this is used for yet, but I have my suspicions. For a table of which bit means which codepage, see Appendix A: CodePage Bitfields, and for a table of Unicode ranges, see Appendix B: Unicode Bitfields.
If it is necessary to add any of this information to a font, there are a number of tools to help. Typecaster, from version 3, supports two commands at the start of a .cst file. codepage_range is followed by two 32-bit hex values separated by commas, and indicates the codepage bitfield to be included in the font. unicode_range is followed by four 32-bit hex values separated by commas, and indicates the Unicode ranges supported by the font. For example:
code 1 uni 3
This is the default, used for an ANSI font and indicates codepage 1252. Notice that missing values are assumed to be 0.
Fontographer version 4.1 and beyond allows the insertion of the necessary information.
An interim PERL v4 program exists called hackos2 which allows the manipulation of the OS/2 table in a TrueType font which contains the appropriate bitfields.
To mimic the behaviour of Windows 3.1, it is most likely that a user will want to make their font an ANSI font and indicate that it supports codepage 1252. Symbol fonts, whilst not having a codepage file, do have a codepage bit associated with them.
As another example of what is going on, we can look at the multilingual extensions supplied with Windows 95. To install them, go toin the control panel and click on the tab. From there, select and click . You will have to restart Windows to gain the full benefits.
Here is what installing these extensions does.
Overall this is probably a worthwhile thing to do if you are intending to work with any scripts beyond Western European.
Unicode: The Future
As far as Windows and NT are concerned, the future is Unicode. This means that underlying storage will be increasingly Unicode. For example, Word 97 uses Unicode to store its data as will WordPad, etc. and will use conversion techniques to generate 8-bit data when necessary.
Example: Word 97
One of the difficulties encountered with Word 97 sometimes occurs with a font change, when data unpredictably either disappears into little boxes or, when saving, converts to question marks. What is going on?
Word 97 keeps track of which codepage data is entered with. In the case of a Symbol font, there is no associated codepage, due to the vagueries of Unicode. Thus Word 97 converts the data directly into Unicode (and incidently gives it the system codepage).
Then a user decides to change font to one with a different encoding. In the case of a supported codepage, Word will not allow the user to change the encoding. In the case of Symbol encoding, Word allows you to change the font to one which supports the system codepage. But that font need not support the Unicode values used by the Symbol encoding (U+F020-U+F0FF), and so those characters are converted into boxes.
There is a mechanism in later versions of Word 97 (Service Release 1) to allow conversion from fonts using alien (to the bitfield system) codepages into fonts with known codepages. But then there is a problem with typing since the 8-bit key-codes are converted using the known, converted, codepage, rather than the alien codepage. So we cannot fully support a new codepage that way.
A second problem arises when storing as 8-bit ASCII text. Word 97 converts the data to ASCII via the system codepage (see the ACP entry in the codepage section of the registry). This conversion, from one codepage to another via Unicode, makes a best approximation to an 8-bit form of the characters. Resulting in, for example, the letter a being output rather than a hooked-a; or, when there is no good approximation, a question mark. Since the system has no idea what Symbols are, they all get converted to question marks.
This, at least, is what we think Word 97 is up to. It’s handling of codepages, and especially Symbol fonts, is consistent in that the same thing happens every time, but not necessarily logical when compared with behaviour in other parts of the program. (For example, try converting some text from Times New Roman to Symbol and back again).
The future trend towards Unicode support has major implications for those wishing to work with scripts not specified in the version of Unicode that is implemented.
Firstly, there is more information held about characters than just how to render them. There is all sorts of semantic information to do with case, directionality, diacritics, etc. At the moment, this is stored in the codepage, thus allowing one codepage to effectively give a different semantic meaning to a Unicode character than another codepage. NT and probably Windows will tend towards a centralised semantic database for the whole of Unicode. As it is, this is achieveable, through compression, in about 9K bytes.
The implication for multilingual users is that it will be increasingly difficult to reinterpret characters to our own ends. Our existing technique of saying that an A acute looks like a high tone diacritic in another font is not going to work so well.
Secondly, Unicode is a data transfer standard, as ASCII was, and rendering directly from Unicode is sometimes very difficult. Our fonts are going to have to become smarter, as will our rendering technology. Scripting issues will increasingly have to become a speciality rather than something that OWLs can necessarily deal with unaided.
Having said all this, as an organisation we are not in an unhealthy position and if we keep working at it, we can stay that way.
Appendix A: Codepage Bitfields
ANSI and OEM
Appendix B: Unicode Subset Bitfields
Circulation & Distribution Information
The purpose of this periodic e-mailing is to keep you in the picture about current NRSI research, development and application activities.