NRSI: Computers & Writing Systems
NRSI Update #12 – April 2000
In this issue:
WinRend has a New Name: Graphite
by Sharon Correll
Graphite is the new name of what was previously known as the “WinRend” project. The name reflects the orientation of the package to drawing or rendering, and the fact that we hope it will “lubricate” the process of working with non-Roman scripts!
The project has made good progress over recent months. The compiler for the “Graphite Description Language,” the rule-based programming language for describing writing system behavior, is almost complete. In the rendering engine itself, most of the sophisticated features for performing complex rendering are fully operational.
In March, I presented a paper on Graphite at the Sixteenth International Unicode Conference in Amsterdam. The presentation included a demo showing the current functionality of the system and was attended by representatives of several leading companies in industry. We received some very positive feedback. A number of individuals were impressed with capabilities of the system and expressed that we are moving in the right direction to meet a significant technological need. There are also possibilities of collaboration with related development efforts going on in other organizations.
A field test period is planned for the summer of 2000. Members of the testing team are expected to be knowledgeable in a complex writing system that might usefully be implemented in Graphite. Responsibilities include: learning the Graphite Description Language, using it to implement an interesting writing system, testing and debugging the program, and providing feedback on the language, compiler, behavior, and performance. A commitment of about one person-month is needed over the period of July - September. If you would like to be involved in this test effort, please contact me at: . More information can be found at this web site.
SIL Releases Three New Asian Font Packages
by Victor Gaultney
SIL International is very pleased to announce the availability of three new font packages for scripts of Asia - SIL Dai Banna, SIL Tai Dam and SIL Yi.
As a service to the general academic community, we are happy to make these packages available at no charge. You may (and are encouraged to) share these packages with your friends and co-workers, as long as you abide by certain distribution conditions detailed on the web pages for each package.
SIL Dai Banna
The SIL Dai Banna Fonts are a new rendering of the New Tai Lue (Xishuangbanna Dai) script. Two font families, differing only in weight, allow for a wide range of uses. The fonts are available for both Macintosh and Windows systems and include keyboard definitions for SIL Key and KeyMan respectively.
The New Tai Lue script is used by approximately 700,000 people who speak the Tai Lue language in Yunnan, China. It is a simplification of the original Lanna script as used for the Tai Lue language for hundreds of years.
For additional information or to download the SIL Dai Banna Fonts package, click here.
We particularly thank the staff of the Research Center for the Minority Languages of China for their assistance in developing this font package. The Research Center for the Minority Languages is part of the Chinese Academy of Social Sciences, Institute for Nationalities Studies located in Beijing, China.
SIL Tai Dam
The SIL Tai Dam Fonts are regular and bold versions of the traditional Tai Dam script and are closely based on handwritten letters. The fonts are available for both Macintosh and Windows systems and include keyboard definitions for SIL Key and KeyMan respectively.
Over half a million Tai Dam people (also known as Black Tai or Tai Noir) live in northwestern Vietnam and northern Laos. Their language is a member of the Tai-Kadai language family and is closely related to Laotian and Standard Thai. The Tai Dam have a long tradition of literacy in the script rendered by the TaiHeritage font family. In more recent years, other orthographies, including romanizations, have come into use as well.
For additional information or to download the SIL Tai Dam Fonts package, click here.
Special thanks are due to Mr. Faah Baccam, whose drawings have served as the basis for the development of these fonts. Any shortcomings in the final digital designs, however, remain our responsibility. Suggestions for design improvements in any future edition, especially from Tai Dam users of the fonts, are welcome.
The SIL Yi Font is a single Unicode font for the standardized Yi script used by a large ethnic group in southwestern China. It can be used in certain Windows applications that support Unicode.
The traditional Yi scripts have been in use for centuries, and have a tremendous number of local variants. The script was standardized in the 1970s by the Chinese government. In the process of standardization, 820 symbols from the traditional scripts of the Liangshan region were chosen to form a syllabary. The syllable inventory of a speech variety from Xide County, Sichuan was used as the phonological basis for standardization. For the most part there is one symbol per phonologically-distinct syllable and vice-versa. The direction of writing and reading was standardized as left-to-right. Punctuation symbols were borrowed from Chinese, and a diacritic was incorporated into the system to mark one of the tones.
This font includes a complete set of Yi syllables and radicals (as defined in The Unicode Standard, Version 3.0), a basic set of Roman glyphs and various punctuation.
For additional information or to download the SIL Yi Font package, click here.
eXtensible Markup Language—More Than Just Publishing
by Dennis Drescher
My work in the NRSI this past year has revolved around the study of XML and how we might implement it in future software products. To say that I have learned a lot is an understatement. It has totally changed the way I look at data and formatting. To sum up what I’ve learned so far, I have written an article with the above title. It is about six pages long, so to avoid information overload in our NRSI Update we are publishing only the abstract for the article. Here it is:
On February 10, 1998 the World Wide Web Consortium (W3C) released version 1.0 of the eXtensible Markup Language. In the abstract of the specification document it says this:
“The Extensible Markup Language (XML) is a subset of SGML... Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML.”
That all sounds nice but what does this mean for us? In this article we will explore the potential of XML and how it might affect our work.
Highlights from XML ’99
by Dennis Drescher
Last December I had the opportunity to attend the Markup Technologies ’99 and XML ’99 Expo held in Philadelphia. At the end of the week-long event I took some time to record what I had experienced. Following is the report I submitted:
It would be impossible for me to describe all I’ve seen here in Philadelphia this week at XML ’99. In many ways this has been more of a philosophical awakening for me than a technical one. There are many ways that we can apply this technology to what we do, and I am not just talking about publishing.
As the week went on I found myself drawn to several areas of XML-based applications. There are more but these were the ones that I could see that would solve some of our immediate needs in various facets of our work. They were: editing, DTP, text/data conversion, page rendering and live web applications.
The ideas that have been sparked in my mind are almost too numerous to describe here. However, I would like to just briefly describe some of the things that I’ve seen as highlights of my time here. Here they are:
Being here and experiencing this has given me a much broader base to work from concerning XML. It has been a good addition to the foundation of research I have been doing the past six months on XML.
If you are not going to use, or even seriously consider using, an XML-based solution to your data flow problems or needs, your data will become very lonely very soon.
I am of the opinion now that most of the data problems we have today in our organization can be solved by looking at them with an XML mind set. Our biggest problem will be defining the problems and until we can do that XML will do us no good. However, if a data flow problem can be described, an XML-based solution is possible.
That said, I would like to summarize with a statement. If you get nothing else out of this, please read and carefully consider what I’m saying. If you are a developer, IT manager or anyone who is actively seeking solutions to data flow problems in our organization, this statement is aimed at you. If you are an end user, don’t worry, XML will come to you. Here’s the statement:
“If you are not going to use, or even seriously consider using, an XML-based solution to your data flow problems or needs, your data will become very lonely very soon.”
Six months ago no one would have been able to substantiate that statement. Today no one needs to. If this statement bothers you, please contact me—I’d be glad to talk to you about it.
Shoebox, XML & Perl
A Resource review by Dennis Drescher
Converting Shoebox Data to XML
As we are all aware, new technologies do not just show up on the average user’s computer: early adopters and power users sift out the bugs first, so that those of us at the bottom of the “software food chain” don’t have to. It’s kind of a trickle down effect.
It will be a while before XML is the norm for the average user. Right now it is definitely a power user’s toy. One of our power users, Martin Hosken, has been researching practical solutions for converting legacy data to XML so we can begin to take advantage of the power, flexibility and longevity of XML. (Martin has written a 50 page document entitled “Shoebox, XML & Perl,” and this is a review of that document.) One area in particular is that of converting Shoebox Data into XML format.
Once in an XML markup system, data can flow freely to the Web or print without having to touch the source. Imagine being able to take a Shoebox database and convert it into HTML for a Web presentation without having to spend a lot of time creating CC (Consistent Changes) tables and batch files for a one-of-a-kind job.
Though Martin’s paper does not really cover the XML to HTML conversion process, it does cover the much more difficult process of converting Shoebox data to XML. Here is the abstract from his paper:
This paper provides some sample Perl code for converting Standard Format Marked text into XML. Four programs are introduced and explained and there are four appendices giving similar explanations of the modules which the programs use. The four programs cover: simple conversion, encoding correction, template layout and auto DTD creation. Each program provides a solution to one of the different types of problems of data conversion from SFM to XML.
The paper is targeted towards those at ease with other programming languages, but who may not be highly conversant with Perl. As such the paper provides a good introduction to software development with Perl and would act as a useful follow-on to one of the many Perl courses readily available in book form.
If you are a power user, work with Shoebox databases, and are interested in XML, you will definitely want to read this paper. In addition to the paper, Martin has also included some practical code so you will have something to start with.
The easiest way to obtain a copy of this material is to contact me. This is a must-have for your XML tool box. Please take advantage of Martin’s good work.
Editor's note: When Shoebox 5.0 is released it will have the option of outputting to XML.
IPA: The Future
by Martin Hosken
For years now we have been relatively successfully using an 8-bit-based IPA font which has met many of our IPA needs. With the advent of Unicode and its supporting technologies, we need to develop a Unicode-based IPA font. The purposes of such a font, in the long run, will be:
In order to create a useful font we need to work on both areas. The second area is primarily technological, and while not being at all easy, is the job of the script engineer. The first area is where help is needed from members and entities throughout the organisation.
What do we want our IPA font to be able to do?
Perhaps one way of addressing this question is to look at what the current IPA font is not giving us.
Please note that the purpose of such a font is not to solve any orthographic needs but to provide a font for analysis work and for technical papers, etc. This is a “linguist’s only” font! And for that reason we would ask that you ask these questions of the linguists around you. It is understandable that computer support people or administrators are not, themselves, experts in this area, but we are not finding it easy to contact the people who know the answers to these questions and who have the biggest interest in our getting this project right.
Where are we?
The current state of the project is that we continue to gather information from people like yourselves and your colleagues. We are also beginning to do some glyph work to add attachment points. Adding these points will allow us to position diacritics correctly every time without having to have different codes for the same diacritic in a different position. From this information then we will be able to start work on a Graphite-based font and also add some OpenType tables, which Microsoft have indicated that they will have a go at making work in either the next or a later version of Office.
If there is a definite GX need for the Mac OS, then I am sure that GX support will follow.
We already have a test font which has no smart capabilities, but is a basis that we can use to think ideas through. We are hoping to have at least an early Graphite-capable version of IPA ready for CTC, but we have a long way to go to get there with many critical elements needing to converge before then.
Throughout the process, we will also be interacting with the Unicode Consortium to try to add characters that are missing from Unicode to a future version of the standard.
All in all, the IPA project is a key project for the following reasons:
We would appreciate having responses sent to Larry Hayashi, Peter Constable, Martin Hosken, and Victor Gaultney.
Future of Keyman
by Martin Hosken
by Martin Hosken
Microsoft gradually continues to improve their keyboarding support for a wider number of languages, including languages that use non-Roman scripts. The latest improvements have come in Windows 2000 with its ability to allow direct keying of Unicode data.
Windows 9x does not currently support keyboard handlers to generate Unicode characters directly. Users are therefore stuck with the application converting a keystroke, which has some arbitrary 8-bit value, through some codepage into Unicode. The problem with this is that the relationship between the keyboard and the codepage is hardwired into the application and therefore not extensible.
We have had very fruitful interaction with some Microsoft engineers which has allowed us to suggest an approach which may, in the future, allow direct keying of Unicode in Windows 9x, etc. thus, perhaps, somewhat alleviating the need for people to move towards Windows 2000 quite so quickly.
As to our own solutions, we continue to see Keyman at the centre of our solutions for the following reasons:
Where are we?
Keyman 5 is in very active development at the moment and we are expecting an early beta test version at any moment. Keyman 5 is primarily aimed at Unicode-based keying, initially in Windows 2000. But Keyman 5 is also aimed at working on older technologies like Windows 9x to the best of their ability too. Thus Keyman 5 will take the place of Keyman 4, which is a fine tool, but has its weaknessess. Marc is recommending that users hold off from using Keyman 4 if they can, to wait for a much better Keyman 5, and that the transition from Keyman 3.2 to Keyman 5 will be much easier than the incremental transition via Keyman 4.
But Keyman 5 is not the easiest program to write. The core engine is trivial in comparison to the problem of correctly integrating Keyman 5 into the various targeted OSes and their varying multilingual support. We will announce Keyman 5 availability as soon as we can.
Entity-specific versus Corporation-wide PUA Sub-areas
by Peter Constable
The Private Use Area (PUA) in plane 0 of Unicode is a somewhat limited resource. In order to manage it to the best benefit for all SIL entities, the 1998 CTC passed a motion requesting the NRSI to develop a plan for entities to follow. A draft of the NRSI’s recommendations was presented at that conference  and is available on the IPub Resource Collection 98 CD-ROM . That proposal allows entities to make free use of the lower portion of the PUA range while NRSI manages the upper portion for corporation-wide use.
The original NRSI proposal for overall management of the plane 0 PUA was to allow entities free use of the range U+E000–U+EFFF and for NRSI to manage U+F000–U+F8FF for corporation-wide use. This would provide 4,096 codepoints for field-entity-specific use and 2,304 for corporation-wide use.
In fact, a portion of the upper range may not be safe for our use due to use by major software developers, such as Microsoft, Apple and Adobe; i.e. codepoints may exhibit undesirable behaviour in software developed by those companies due to proprietary use of those codepoints. (This is perfectly legal use of the PUA, though in the interest of end users one hopes that it will be kept to a minimum.) So, for example, NRSI already plans to avoid using U+F000–U+F0FF since these codepoints are used for “symbol-encoded” fonts in Microsoft software.
In addition to the U+F000–U+F0FF range, Apple has documented assignments that they have made in the range U+F800 through U+F8FF. Adobe has likewise documented assignments in the range U+F600 through U+F7FF. We also know that Microsoft has made use of other codepoints in the upper PUA range in some of their fonts for presentation forms, though these are probably less of a concern. (Actually, they’ve also used some of the U+Exxx range in some fonts, but again I don’t think this use is a particular concern.) This is not a particular concern because the only software that would be aware of these assignments is Microsoft’s Uniscribe rendering engine, and it would only use these codepoints as output, never as input. So, text that contains characters that use one of these codepoints would be unaffected. Furthermore, such use by Microsoft is likely only to be temporary as they implement OpenType fonts, and so there is little likelihood that software will begin to appear that assumes definitions for those codepoints based upon their use in Microsoft fonts.
What is of greater concern is that Adobe has not only documented their use of codepoints but that they have also suggested that other major vendors cooperate with them in making assignments in the upper PUA range, and they have had some vendors such as HP interact with them in this way. (Such quasi-standardisation is not condoned by the Unicode standard.) The concern I have with this is that it could lead to commercial software making assumptions about the definition of these codepoints, which would mean that other uses of those codepoints could meet with problems.
In addition, we know that software vendors including Microsoft have used some upper PUA codepoints for internal processing purposes within their code. Such uses make those codepoints particularly unreliable for encoding text. (Unfortunately, I do not know which codepoints have actually been used in this way. I do know that most or all such uses have been at the very upper end of the PUA range, however.)
As a result, the range that NRSI has to work with for corporation-wide characters is actually somewhat smaller than 2,304 codepoints. To be safe, we should probably try to avoid the ranges U+F000–U+F0FF and U+F700–U+F8FF. The remainder leaves us with 1,536 codepoints. This may be entirely adequate to meet the pan-corporation PUA needs, but it is hard to make any attempt at guessing whether that will be the case or not.
Because of this restriction, I would like to encourage entities to avoid using the range U+EC00–U+EFFF if they can at all avoid it. That would leave 1,024 codepoints available to NRSI for future expansion for corporation-wide characters should the need arise. Restricting use to U+E000–U+EBFF still provides 3,072 codepoints. This should be more than adequate for most entities’ needs.
 Hosken, Martin. 1998. “PUA corporate strategy: A discussion on the organization of the PUA.” SIL IPub Resource Collection 98 (CD-ROM). Dallas: SIL International.
 International Publishing Services, 1998. Resource Collection 98 CD. Dallas: SIL International.
New Multilingual OCR Product from IRIS
by Peter Constable
IRIS has recently released version 5.0 of their OCR product, Readiris Pro. It reportedly works for 55 languages using Roman, Greek and Cyrillic scripts. The list of languages largely consists of typical European and European-derived suspects, as well as Indonesian, Malay, Swahili and Tagalog. Support for Simplified Chinese and Japanese is available as an option. It is designed to interoperate well with Microsoft Word 97, Word 2000, Excel 97 and Excel 2000, and generates Unicode-encoded text in those apps. (It can also be used with other apps, but may provide only 8-bit, “ANSI” codepage-encoded text.) I tested it on some Russian and Greek text, and the appropriate Unicode characters were generated in Word.
The software also supports user interfaces in Dutch, English, French, German, Italian and Spanish, and the UI language can be changed without restarting.
The program runs on Windows 95 and later, and Windows NT 3.51 and later. It works with many common scanners, and integrates well with Microsoft Word and Excel. List pricing: US$490 for a full version, or $196 as a competitive upgrade or upgrade from a previous version.
For more information, see http://www.irislink.com/c2-480/Readiris-Pro-11-OCR-software.aspx.
Circulation & Distribution Information
The purpose of this periodic e-mailing is to keep you in the picture about current NRSI research, development and application activities.