Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: General
Short URL: http://scripts.sil.org/NRSIUpdate10

NRSI Update #10 – June 1999

Peter Martin (ed.), 1999-06-01

In this issue:

Keyman Update

by Peter Constable

[May 1999] There have been recent developments in Keyman to report. The most signficant of these is that a new version is available! Marc Durdin has released Keyman 4.0, which you can download from his web site,  http://www.tavultesoft.com.

Over the past couple of years, we have been aware of needs for further improvements in keyboarding utilities. A year ago, Marc had begun work on a Windows 95 version of Keyman, but had to put a stop to this work in order to focus on his university studies. At that time, Marc decided to discontinue his commitments to Keyman, and he approached SIL with an offer to sell the rights to the software to us. Darrel Eppler of IPub began negotiations with Marc, a funding proposal was prepared, and the Language Software Board gave prioritisation to work on keyboarding enhancements. In anticipation of the purchase of Keyman being finalised, the Software Development Management Team allocated personnel for work on keyboarding enhancements (Darrell Zook), and work began on a requirements document.

However, Marc’s situation has now changed considerably and he has decided to retain ownership of Keyman and to continue development. This has led to the recent release of version 4.0, with plans for a version 4.1 update in the near future, and a Unicode-capable version after that.

The most significant changes in version 4.0 are that it now works on Windows NT 4.0 and makes use of new multilingual support features introduced in Windows 95. This version also makes use of compiled keyboards, so existing KBD files need to be converted to the new, compiled KMX files (not a difficult process) before they can be used.

Another change in version 4.0 is that it is now shareware, and requires a registration key. SIL has a corporate license, so the program is available to SIL personnel without charge. An announcement with registration details will be forthcoming in NOC. If you have questions regarding licensing issues, please contact us at Contact Form.

Since the release of version 4.0, we have had some discussion with Marc regarding future enhancements. Marc is very interested in responding to our keyboarding needs, and we have begun discussing our requirements document with him. We anticipate on-going cooperation with Marc in further development of Keyman.

There is one aspect to the implementation of version 4.0 that is an area of concern for us. In making use of multilingual support features in Windows, Keyman makes use of LANGIDs, a language identification parameter used within Windows. The concern with LANGIDs is that there are only so many available: 512 are reserved for assignment by Microsoft (a number of which have already been allocated); another 512 are available for custom use, though Microsoft and commercial software will not recognise these custom IDs. Marc has had in mind that people developing Keyman keyboard layouts would register custom LANGIDs with him. We have not yet had a chance to consider the implications of this use of LANGIDs, and I think this needs to be given further consideration before we get very far along in developing a lot of keyboards with various custom LANGIDs assigned. Look for more information on this in the future.

Type Design Apprenticeship Levels

by Victor Gaultney

[March 1999] One of the NRSI’s strategies for meeting non-Roman computing needs is to identify and train personnel in specific tasks that could be used in a field setting. Last fall we announced our intention to formalize our existing type designer training into a Type Design Apprenticeship program. We are now in the midst of our first such program with Karen Block (MSEAG) as our first apprentice.Before the apprenceship began we planned out a general set of achievement levels intended to reflect a possible grouping of skills and accomplishments in the training of a Type Designer. This allows prospective apprentices a summary of what they would learn and how much of a time commitment each level requires.

Novice

To become a candidate for an SIL Type Design Apprenticeship, the person must have an interest in type design, a good eye for type/calligraphy, the motivation to follow through with an apprenticeship plan and demonstrate a significant time commitment to this course of study. At that point the candidate is automatically given the rank of Novice.

Apprentice

The next achievement level—Apprentice—is awarded to the student who has demonstrated:

  • a thorough knowledge of basic type design tools (Fontographer, in particular)
  • a clear understanding of the artistic aspects of letter design
  • familiarity with the type design process
  • basic knowledge of font formats and technical issues

We also require that the student has made a significant contribution to a current type design project.

It would normally take 3-6 months of apprenticeship with a recognized Master level designer to reach this level (this currently can only be done in Dallas). Once it has been completed the apprentice could be expected to prepare font designs from various sources, test and refine the typeface, and deliver it on a single computer platform (either Macintosh or Windows) with some consulting assistance. They would not be able to work completely independently—particularly in the areas of cross-platform fonts and technical issues—but could make a major contribution to a type design project.

Journeyman

After completion of the Apprentice level and proven experience working in a variety of type design projects afterwards, the student may begin work on the Journeyman level. The emphasis for this level is on becoming more efficient, broadening the range of tools used, deepening the understanding of technical font topics (on both Macintosh and Windows) and adding a thorough background in script project planning and management. The time required for this study can vary and may typically not be a single focused time period. Some topics can be studied in field locations and/or by correspondence under through a specific study program, but the student will have to spend a total of 3–6 months working alongside a Master (currently only in Dallas).The end goal of the Journeyman level is enabling the student to work on type design projects completely independent of other type designers and technical personnel. They should be able to work anywhere in the world and have the experience, knowledge and ability to be successful. Most SIL type designers ought to eventually function at this level. At this level the apprentice can also begin to give consulting help to less experienced type designers, but would not be prepared to fully train apprentices until they have gained more experience.

Master

The Master level represents a very high level of achievement gained through years of experience at the Journeyman level. It represents someone who has learned the craft of type design well enough to manage multi-person projects, train others and contribute to the craft. The emphasis is on proven achievements rather than on specific knowledge, as a Master will likely have greatest depth in a unique part of the craft. As a result, there is no specific training program for Masters. There are some specific achievements, though, that a Master candidate needs to have completed. Some of these achievements might be:

  • successful management of multiple type design projects
  • original font designs of both roman and non-Roman fonts
  • proven depth in technical font issues (such as hinting)
  • published or presented papers on font topics
  • the training of an apprentice

Recognition of these achievements would be through peer review. Once competence has been demonstrated the Journeyman would be awarded the rank of Master.

Practical Application

Apprenticeships based on this model are not intended to be restrictive, but rather to provide encouraging milestones for the student. In practice an apprentice will not usually restrict their learning to one specific level, but choose from multiple levels to gain the knowledge needed for their assignment. If you know of someone who meets the initial candidate requirements for a Type Design Apprenticeship and could contribute to the work in this area please contact me.

Microsoft Office 2000 Beta 2 Preview

by Peter Constable

[January 1999] After waiting over 2 months, I finally received my copy of Microsoft Office 2000 Beta 2 preview. I’ve just spent the past week doing an evaluation of its multilingual capabilities. I thought you’d all be interested in what I found. Please bear in mind that this is an evaluation of beta software and that the product that is finally released may differ considerably.

Overview of Office 2000 Preview package

The preview package that MS sent out is the Premium Edition. This is a new member of the Office product line which includes:

  • Word 2000
  • Excel 2000
  • PowerPoint 2000
  • Access 2000
  • Outlook 2000
  • FrontPage 2000
  • Publisher 2000
  • PhotoDraw 2000
The preview came on a whopping seven CDs! Before you run out to buy new hard disks, I should add that the first six apps in the list above together with a pre-release version of Internet Explorer 5 all came on the first CD. Consider, too, that a beta version probably contains a lot of debug code, so the shipping version should be smaller. The remaining CDs contain the following:
  • CD2: Publisher 2000 and shared clip art
  • CD3: PhotoDraw 2000, MS Office Server Extensions, MS FrontPage Server Extensions
  • CD4: PhotoDraw 2000 content
  • CD5 and CD6: MS Office Language Pack - user interface and proofing tool support for German, Arabic, Japanese, Korean and Chinese (traditional & simplified)
  • CD7: MS Office 2000 Resource Kit and other documentation

I limited my testing to Word. Presumably this app would be the most advanced in its multilingual text support, except perhaps for FrontPage.

Some problems I encountered reminded me that this was a beta version. I wanted to install only the fewest components I’d need to do the testing I wanted to do, but it took me a while to figure out how to do that. The setup program on the first CD installed Word, but the items on CDs 5 and 6 had their own setup program. I struggled to get the help files working and finally figured out that I needed to install the pre-release version of IE 5. (I installed it with a certain amount of anxiety but was relieved to find that they provided an option to keep earlier versions of IE installed on the system.) Not all of the functionality I expected was available (more on that below), and CD 5 appeared to contain resources for Hebrew but the setup utility did not give the option of installing these. Once I got everything I wanted installed, however, all of the features I wanted to examine were functioning, and I encountered fewer problems than I had been led to expect after reading a review in the press.

Multilingual capabilities

It was clear to me that this version represents a very big step forward for Office in many respects, but particularly in terms of multilingual capabilities. In a nutshell, there will be a single version of Office for all world-wide markets that is able to support entry, display and proofing of text in multiple languages using several different scripts, and that is able to support user interfaces in several languages. These enhancements will be beneficial for SIL members in many situations. It still falls short of what is needed for most minority language projects, though, since the multilingual capabilities are limited to what is provided by Microsoft for a limited number of major languages and is not extensible, yet it is an important step in the right direction.

Some qualifications should be added to the general statement I just made about the extent of the multilingual support. First, in this version of Office, or at least for Word, there will actually be a single binary for the entire world except for Southeast Asia, and a separate binary for Southeast Asia. The reason for this is that support for Thai was not available in time for the planned development schedule and so MS opted to leave that for a separate version to be shipped a little later. Secondly, while their design now is based upon a single worldwide binary, MS will continue to market localized versions which will have some features specific to the given locale (e.g. selection of fonts provided). Thirdly, the number of languages and scripts that are supported is dependent upon the operating system on which the program is running. Let me elaborate on that.

In the documentation that is provided on CD7, a distinction is made between four language-related capabilities:

  • main document display
  • main document input
  • user interface display
  • user interface input

For each of these, a spreadsheet is provided showing what languages (from a list of 62) are supported for that capability when running on a particular language version of a given operating system (Win 95, Win 98, NT 4 or NT 5). For example, the documentation shows that, on English Win 9x or NT 4 systems, Word can display document text in 53 languages - all but Arabic, Farsi, Urdu, Hebrew, Thai, Vietnamese, Indic, Armenian and Georgian. The RTL languages can only be displayed when running on RTL language versions of these operating systems, and Thai and Vietnamese text can only be displayed when running on those versions of these operating systems. Indic, Armenian and Georgian text cannot be displayed when running on any version of these operating systems. On the other hand, text in all 62 languages can be displayed when Word 2000 is running on any language version of NT 5 (Windows 2000).

The documentation indicates that the distribution of capabilities B and C are the same as I’ve just described for A. For user interface input, there is greater limitation for Far East languages and Far East language versions of Win 9x: user interface input of Far East languages are supported only on Far East versions of Win 9x, and Far East versions of Win 9x support user interface input of only the corresponding Far East language or English. Again, NT 5 has the greatest level of support: any language version of NT 5 supports user interface input in all languages except Indic, Armenian and Georgian.

Now, I’ve told you what the documentation in the package says. What I found was close to that, though better in some respects. First, the documentation suggests that you should not be able to work with Arabic at all on an English Win 95 system, but in fact Arabic text in a document displayed with proper contextualization. I don’t know enough about Arabic to be able to say if all behaviour for that script is correctly supported, though. I was also able to paste Arabic text into a text box, and it appeared correctly, as far as I could tell, contextual forms and all.

In the case of Hebrew, there really was no level of support on a Win 95 system, as the documentation suggests. There is no way to enter Hebrew into a document or into UI controls, and it Hebrew characters do not contextualize as needed.

The documentation indicates that you should be able to enter and display Far East text in a document and input Far East text in the user interface, but not display Far East text in the user interface. I’m not actually sure what is meant by “user interface display”, but here is what I found: I was not able to get Far East user interfaces running on English Win 95. (The installer for CDs 5 and 6 was willing to install items for the simplified Chinese interface, so I was initially hopeful.) I was able to use an IME to enter Chinese, Japanese or Korean text into dialog text boxes, however, and the text appeared in the text boxes as desired.

Thai fonts were installed on the system, but these were part of Internet Explorer 5 and not Office 2000. While IE 5 can correctly display Thai text, Word does not, nor does it provide any way to enter Thai text. The same appears to be true for Vietnamese.

The documentation indicates that there is no support for “Indic”, Armenian and Georgian on Win 9x or NT 4, but that there is support of these languages on NT 5 except for UI display. There certainly was nothing provided for any of these languages and scripts in the package. Armenian and Georgian (as far as I know) do not require any complex rendering behaviour, however, and so it should be possible to display text in these languages on any system given an appropriate font.

Language/script-related files installed

There are a number of fonts distributed with the package, 61 in all not counting the 24 (mostly Thai and Hebrew) that are included with IE 5. The fonts that ship with Office include new versions of Arial and Times New Roman that contain 1296 glyphs each, CJK fonts that contain over 20,000 glyphs, and the Arial Unicode font, which contains over 40,000 glyphs and takes up around 23MB of disk space. (That font had no difficulty bringing our P75 test machine to its knees.)

A basic installation of Word, without any particular multilingual features, added 640 files to the system. These included about 60 KBD and NLS files, but nothing else that I noticed relating to language or script support. When I added some basic language support, more KBDs and TTFs were added. as well as a small handful of EXEs and DLLs:

Setlang.exe
Bidi32.dll
Intldate.dll
Metrtxtw.dll
Meuedit.dll
Meunicod.dll
Ucscribe.dll

Setlang.exe is used to determine what languages are available for text and what language is used for the user interface. Most of the DLLs appear to be related to Arabic rendering. Note that Ucscribe.dll is not the same as Uniscribe, the script-support engine being incorporated into NT5.

Adding the German and Arabic language packs added another 72 files; the CJK language packs added 158 files. Most of these appear to be language-specific proofing tools and help files. I did not notice any files that appear to be related to script support.

Finally, adding input method editors for CJK added around 35 more files. (I’m not sure of the exact number since I ended up also having to install IE5 at this point.) These included a few files in the System folder that appear to be adding general IME capability to Win95 (something must have been added to provide IME capability since the standard IME APIs are supposedly stubbed out in US versions of Win95):

Aimminst.exe
Dimm.dll
Indicdll.win
Internat.win

New language-related features

There are a number of menu and dialog box items that are new, at least outside of Far East versions. Here are some of the new menu items:

Format / Asian Layout / Phonetic Guide
Format / Asian Layout / Enclosed Characters
Format / Asian Layout / Horizontal in Vertical
Format / Asian Layout / Combine Characters
Format / Asian Layout / Two Lines in One
Format / Text Direction
Tools / Japanese Postcard Wizard
Tools / Chinese Envelope Wizard
Tools / Language / Japanese Consistency Checker
Tools / Language / Hangul Hanja Conversion
Tools / Language / Chinese Translation

The Format Paragraph dialog has a new tab labelled “Asian Typography”. Not being sufficiently familiar with Chinese, Japanese or Korean, I couldn’t really evaluate most of the Asian features. I did look at the Text Direction settings, but was told that my version of Windows couldn’t support vertical text.

The Format Font dialog has some new controls on the “font” tab: Earlier versions have one control each for font, font style and size. In Word 2000, there is a control for “Latin text font” and another for “Asian text font”, these two going together with font style and size controls; then there is a separate section on the tab called “Complex scripts” which has its own controls for font, font style and size. This means that it would be possible to define a style and specify font settings that may be different according to whether the text is Latin, Asian or “Complex”. The three categories appear to be determined by Unicode range. A little testing suggested that, in the current version, “Complex” means Arabic and Hebrew, “Asian” means CJK, and “Latin” means everything else.

Another new feature is the ability to automatically detect language. When Word detects a change in language, it will tag the text as being in that language. (This has the effect of specifying which proofing tools to use.) Again, this is being determined according to Unicode range. As I was testing the behaviour of the font dialog settings just described, I was entering characters using my EnterUnicodeCharacters macro, and I could see changes in language being reflected in the status bar. This feature isn’t without bugs, however: when I entered either Arabic or Hebrew characters, it specified the language as “Urdu”, and both Chinese characters and Japanese Katakana resulted in a language tag of “Chinese (PRC)”. (Presumably, this will have improved by the release.) Also, we can’t place unreasonable expectations on this ability: I typed in some French, from an English keyboard, and it continued to think I had English text. If I use a French keyboard, though, then it tags the text appropriately.

Bidirectional behaviour

The Format Paragraph dialog also has controls to specify whether the primary line direction of the paragraph is LTR or RTL. (This appears to be unrelated to the Format / Text Direction dialog mentioned above, the latter relating to CJK only.) There are also buttons on the formatting toolbar for setting primary line direction.

Word uses Unicode ranges to determine what script behaviour is applicable. One example of this at work is that, when entering text in a LTR paragraph, if Arabic or Hebrew characters are entered (which I did using my EnterUnicodeCharacters macro), a RTL run is automatically inserted within the LTR text. When characters other than Arabic or Hebrew are entered, Word ends the RTL run of text and resumes entering text in LTR order after the RTL run.

The arrow key and selection behaviours reflect the order of the stored data. So, for example, if using the right arrow key to move through a LTR paragraph (or to extend a selection), when an embedded RTL run is encountered, the caret (or end of selection) will go to the right end of the RTL run and proceed leftward to the end of the RTL run, and then will resume in the matrix LTR run in a rightward direction.

At bidi boundaries, a split caret was not implemented. Insertion and delete behaviours can depend upon how the current caret position was arrived at. If you have just gone through a RTL run using the right arrow key and are at the direction boundary with the following LTR text (so the caret is at the right end of the RTL run), hitting the backspace key will remove the character at the left end of the RTL text. On the other hand, I used the mouse to drop the insertion point into that same location, and when I typed a Latin character, it was entered at the end of the matrix LTR run that preceeded the RTL run.

Issues related to symbol fonts and Unicode

I have reported elsewhere on a variety of problems people encountered as they began using fonts with non-standard character sets on Word 97. The same issues still occur with Office 2000:

  • If text is formatted with a symbol-encoded font and the formatting is changed to a non-symbol-encoded font, the text may appear as empty boxes.
  • If a run of text is formatted with a symbol-encoded font, lines will wrap at any point in the run, not only at word boundaries.
  • If text is formatted with a symbol-encoded font and the file is saved to Word 6.0/95 or text formats, when the file is reopened, that text will have been changed to question marks.

It should not be expected that this type of behaviour will ever change in future versions of Word. The problem is the result of using symbol-encoded fonts for orthographic characters. The only solution to this problem will be to handle special characters and non-Roman text by using the Private Use Area of Unicode or standard Unicode allocated characters. The Word 97 macros that provide workarounds to these problems will still work in Word 2000, however. (If you need more information about these problems or the workarounds to them, contact me.)

Exporting text and encoding of data

With Word 97, when text that is formatted with a non-symbol font is exported using “save as text”, some translations will occur. Characters U+0100 and above must get translated into 8-bits some way, and it is done by mapping characters into their nearest ASCII equivalents, where this makes sense, or into “?” if there is no similar ASCII character. For characters in the range U+0020 to U+00FF, you would expect that these would be translated into the 8-bit numerical equivalents, but certain translations occur:

00A0 > x20 (non-breaking space > space)00A9 > x28 x63 x29  (copyright symbol > “(c)”)
00AE > x28 x72 x29  (registered symbol > “(r)”)
00BC > x31 x2f x34  (fraction one quarter > “1/4”)
00BD > x31 x2f x32  (fraction one half > “1/2”)
00BE > x33 x2f x34  (fraction three quarters > “3/4”)

The preview version of Word 2000 behaved exactly the same in these regards.

One of the “save as” formats offered by Word 97 is “Unicode Text”. This has been replaced in Word 2000 by “Encoded Text”. When you choose this option, a second dialog will appear that gives you a number of encoding options. These options are shown here along with the corresponding codepages (when I was able to identify them):

Arabic (ASMO 708) CP 708
Arabic (DOS) CP 864
Arabic (ISO) CP 28596
Arabic (Windows) CP 1256
Baltic (ISO) CP 28594
Baltic (Windows) CP 1257
Central European (DOS) CP 852
Central European (ISO) CP 28592
Central European (Windows) CP 1250
Chinese Simplified (EUC) CP
Chinese Simplified (GB2312) CP 936
Chinese Simplified (HZ) CP
Chinese Traditional (Big5) CP 950
Cyrillic (DOS) CP 855
Cyrillic (ISO) CP 28595
Cyrillic (KOI8-R) CP 20866
Cyrillic (KOI8-U) CP
Cyrillic (Windows) CP 1251
Greek (ISO) CP 28597
Greek (Windows) CP 1253
Hebrew (DOS) CP 862
Hebrew (ISO-Logical) CP 28598
Hebrew (ISO-Visual) CP 28598
Hebrew (Windows) CP 1255
Japanese (EUC) CP
Japanese (JIS-Allow 1 byte Kana - 201) CP
Japanese (JIS-Allow 1 byte Kana - 202) CP
Japanese (Shift-JIS) CP 932
Korean CP
Korean (EUC) CP
Korean (ISO) CP
Korean (Johab) CP 1361
Latin 3 (ISO) CP 28593
Latin 9 (ISO) CP
OEM United States CP 437
Thai (Windows) CP 874
Turkish (ISO) CP 28599
Turkish (Windows) CP 1254
Unicode CP 1200
Unicode (Big-Endian) CP
Unicode (UTF-7) CP
Unicode (UTF-8) CP
Vietnamese (Windows) CP 1258
Western European (ISO) CP 28591
Western European (Windows) CP 1252

I did some tests exporting text to the following encodings:

  • Unicode
  • Unicode (UTF-8)
  • Western European (ISO)
  • Western European (Windows)

These tests were done using a document that contained every character in the range U+0020 to U+FFFF.

The results with Western European (Windows) were identical to those for “save as text” described above.

On exporting to Unicode, 11 characters were translated into ASCII characters:

U+00A0 > ‘ ‘  (x0020)U+00AB > ‘“‘  (x0022)U+00BB > ‘“‘  (x0022)
U+2005 > ‘ ‘  (x0020)U+2013 > ‘-’  (x002D)U+2014 > ‘-’  (x002D)
U+2018 > ‘‘‘  (x0027)U+2019 > ‘‘‘  (x0027)U+201C > ‘“‘  (x0022)
U+201D > ‘“‘  (x0022)U+201E > ‘“‘  (x0022)

Only the first of these was among the translations that occurred with “save as text”.

When exporting to UTF-8, there were a total of 16 translations: the 11 that occured when exporting to Unicode, plus those that occurred using “save as text”.

Finally, I testing saving the same document as RTF. The RTF file stored some characters in ways that did not make complete sense to me, but it appeared not to have made any translations. Whenever possible, a character was stored by setting a font (i.e. logical font, therefore specifying a charset, therefore specifying a codepage) and storing an 8-bit codepoint. For example, U+03A0 was stored as

f183 ’d0

where f183 points to TimesNewRoman Greek, and ’d0 refers to xD0 in the Greek codepage, which is captial pi. If the character was found in a DBCS codepage, then the Unicode value was given (in decimal representation) along with the pair of 8-bit codepoints. For example, U+4E10 was stored as

uc2u19984’d8’a4

When a character could not be found in any codepage, the Unicode value is stored along with the nearest translation into the default codepage (cp1252 in this case) or a question mark if there is no similar character in the default codepage. So, for example, U+0189 and U+018A were stored as

u393’d0u394’3f

All of this is easy to follow. What I could not make sense of were characters (not CJK) for which the Unicode value was stored along with a pair of bytes. For example, U+01D0 (LATIN SMALL LETTER I WITH CARON) was stored as

uc2u464’a8’ab

There isn’t a codepage I know of in which xA8 is anything related to an i with caron, and I’m not aware of any DBCS system for anything other than CJK. (If anybody can make sense of this, I’d be interested in knowing what’s going on.)

Importing text and encoding of data

I ran similar tests importing from various formats and encodings. I began with files containing every 16-bit character from U+0020 to U+FFFF or every 8-bit character from x20 to xFF, as applicable.

Opening a Unicode-encoded file, there were no translations; everything came across as expected. Opening the UTF-8 file was a little problematic in that it did not recognize it as a UTF-8 file at first, and it did not present the dialog to select the encoding. Adding the UTF-8 form of the byte order mark (U+FEFF, UTF-8 = xEF xBB xBF) at the beginning of the file didn’t help, but adding the bytes xFF xFE caused the dialog that gives encoding options to appear, at which point I simply selected UTF-8. (The first two bytes were translated into a space). Once I did this, the entire was interpreted as expected.

Opening the 8-bit text file, I was surprised to find several serious problems. Even some ASCII characters were handled incorrectly. Presumably these bugs will be fixed before the release version.

As mentioned above, the RTF filed I had saved appeared to keep every character intact, so I tried opening this file. I was surprised to find 55 errors, all but two in the range U+00A0 to U+00FF. Most of these came across as Arabic! Again, this is a bug which, hopefully, will be fixed before the release. (I will be attempt to notify people at Microsoft concerning these problems.)

Unicode characters in WM_CHAR messages

One need we have is to be able to use a program like KeyMan to enter Unicode characters. The way this would work would be to use Windows’ messaging system to send WM_CHAR messages containing Unicode characters. On Win9x systems, however, this is not supposed to be possible: the system itself will only send 8-bit codepoints, and apps should only expect 8-bit codepoints. Tests with Word 97 revealed that, if a Unicode character was sent in a WM_CHAR message, Word would only see the lower byte. Martin Hosken found that FrontPage Express on Win95 would see the Unicode characters, however. Because of Martin’s results, I thought I would check what would happen with Word 2000. Not surprisingly, it only recognised the lower byte, like Word 97. This means that if we really want to use KeyMan to enter Unicode characters into Word on Win9x systems, we’ll need to find a side door that we can break open. On NT systems, however, we shouldn’t have any problems entering Unicode characters from a Unicode-enabled version of KeyMan.

Some new features in VBA 6.0

One way that might have worked to get Unicode characters into Word would be to write macros that respond to something like an application.keypress event, but such did not exist in Word 97. I looked for something like this in Word 2000, but there still isn’t anything like this.

There were a couple of new features in VBA 6.0 I was very glad to see, though they aren’t specifically related to languages or scripts: VBA now supports callback functions (useful with Windows APIs) and modeless dialogs (I’ve already made the change to my EnterUnicodeCharacters macro).

Conclusion

There is more that could be said about multilingual features in Word 2000. Not knowing enough about the languages and scripts involved, I wasn’t able to test proofing tools and many of the features, such as those related to Asian typography. There are still some serious bugs in this beta code, and there is plenty of room for improvement. All in all, though, I thought that this version of Word provided some important and valuable advances in the area of multilingual support. We still need to find ways to lobby Microsoft to make language- and script-related functionality open and extensible so that we can get Office apps to work for the minority languages and the scripts that we need to work with. They have made a step in the right direction, though, and some of us will now be able to do things that we need to do which previously weren’t possible.

While work on Office 2000 continues, Microsoft has already begun work on the specifications for the next version. A contact at Microsoft working on the new spec has indicated that the next version will probably make use of the Uniscribe rendering engine being developed as part of NT5. It will likely added several S. Asian scripts to the list of scripts that are supported. As they work on this, we still have some opportunity to provide input regarding what is needed for minority languages and scripts. Due to marketing issues, we may never see all the multilingual capabilities we might dream of seeing in Office, but we can still pursue that goal in hope. In the mean time, we can’t complain that Microsoft has been standing still.

Addendum

Here are some observations made after the above article was submitted:

1) New encoding translations:

  • Whereas entering alt+0157 (x9D) used to give character U+009D, it now gives U+200C (ZWNJ).
  • Whereas entering alt+0158 (x9E) used to give character U+009E, it now gives U+200D (ZWJ).

2) I should have mentioned the following translations that occur when saving as text since these characters are accessible via cp1252:

201E  (x84 in cp1252)  > x22  (dbl low-9 quotation mk > straight quotation mk)
2026  (x85 in cp1252)  > x2E x2E x2E  (ellipsis > “...”)
2018  (x91 in cp1252)  > x27  (left single curly quote > single straight quote)
2019  (x92 in cp1252)  > x27  (right single curly quote > single straight quote)
201C  (x93 in cp1252)  > x22  (left dbl curly quote > dbl straight quote)
201D  (x94 in cp1252)  > x22  (right dbl curly quote > dbl straight quote)
2013  (x96 in cp1252)  > x2D  (en dash > hyphen)2014  (x97 in cp1252)  > x2D  (em dash > hyphen)
2122  (x99 in cp1252)  > x28 x74 x6D x29  (trademark symbol > “(tm)”)
200C  (not in cp1252, but obtained by entering d157 = x9D from keypad)                       > x3F (ZWNJ > “?”)
200D  (not in cp1252, but obtained by entering d158 = x9E from keypad)                       > x3F (ZWJ > “?”)

3) I mentioned new menu / dialog items related multilingual support. I forgot to mention that the Tools / Options dialog has several new tabs:

  • Japanese Find
  • Hangul Hanja Conversion
  • Right-to-Left
  • Asian Typography

Some previously existing tabs also have some new controls. For example, the Edit tab has an option button for auto-keyboard switching (if you drop the caret into existing text, it will detect the language and pick the appropriate font, and can also change the keyboard if this option is set).

4) With regard to bidi behaviour, I mentioned that arrow keys move the caret in logical order. In the RTL tab of the Tools / Options dialog, there are controls that allow you to specify the arrow key behaviour. Arrow keys can move the caret either in logical order or in visual order. When extending the selection of text, however, the arrow keys always work in logical order.

14th Unicode Conference

by Peter Constable

The 14th semi-annual Unicode conference was held in Boston in March. As is the usual pattern, the conference was two days long, and was preceded by a day and a half of pre-conference tutorials. Through most of the conference and tutorials, sessions were held in three parallel tracks. As a result, it was impossible to attend most of them. It would take too long to mention even all those that I attended, so I just mention a few highlights. The complete program can be found at  http://www.unicode.org/unicode/iuc14/program.html.

The pre-conference tutorials were on a variety of topics. While they were aimed pretty much at the beginner, a few were, nonetheless, quite interesting. The most notable of these for me was a fascinating presentation by Thomas Milo of DecoType on the intracacies of Arabic script. I also got to the latter half of a panel discussion on font-related issues. (This wasn’t a tutorial, but that was they only time available for it in the schedule.) The panel included representatives from Monotype, Adobe, Microsoft, Apple, Macromedia (makers of Fontographer) and others. I missed the beginning of this while I was in another session, but I had the chance in the second half of this discussion to raise several concerns regarding font-related issues. I found this quite useful.

There were papers presented on a number of implementation issues: rendering engines, input methods, efficient storage of character property data, ways of doing boundary analysis (word breaks, line breaks, sentence breaks), dealing with problems of string comparison. Many papers focused on technologies from certain vendors and developers: Microsoft (Win32), Apple (ATSUI), Sun and IBM (Java), New Mexico State University’s Computer Research Lab, and others.

The papers I found to be most interesting were those describing efforts to develop technologies for input and rendering. These were of particular interest to me since these are technologies that we are currently working on ourselves. Some were taking approaches that seemed familiar. For example, one of the two rendering projects at Sun is pursuing a general-purpose rendering engine to which script-specific modules can be added. This is not unlike our design for WinRend. In every case, however, people were doing things somewhat differently from the approaches we are taking. So, for example, in the case of this same rendering project at Sun, no high-level language has been proposed for specifying the behaviour of an individual script which is to be added to the system.

I found it particularly interesting to see the number of people working on addressing issues that we are working on ourselves. Beyond those companies that gave presentations, I could tell from certain people’s questions that they were working on creating their own implementations. At the same time, I also found it extremely interesting to see how our implementations go beyond the work of others. For one thing, nobody else (apart from Apple) is trying to make all of their software work with as wide a variety of scripts as we are. Nobody else (apart from Apple) is working on general purpose engines to which script-specific modules can be added for both input and rendering. Nobody else is working on providing a way to compile high-level descriptions of rendering or keyboard behaviour into an installable module. There are clearly ways in which we are ahead of the game.

It was also interesting for me to see the extent to which Chinese, Japanese and Korean scripts are of particular concern to so many. In this regard, there was a very interesting paper presented Thursday afternoon regarding problems involved in converting between Traditional and Simplified Chinese. The issues involved were both fascinating and challenging.

In addition to the various presentations, the conference gave me a good opportunity to meet and build relationships with many people from various companies and organisations, including Apple, Microsoft, Monotype, Sun Microsytems, and others.


© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.