WSTech: Writing Systems Technology (formerly known as NRSI)
Guidelines for Writing System Support: Components of a Writing System Implementation
UNESCO project Initiative B@bel
A complete index of all SIL's contributions to UNESCO‘s project Initiative B@bel can be found here.
Guidelines Table of Contents
Communication is a basic component of society. The ability of a group of people to communicate with one another and with the rest of the world is critical to community and nation development. Economic growth, education and cultural development depend on rich spoken and written communication.
Computers have enabled the diverse peoples of the world to communicate in powerful ways, whether that be in printed media or digital packets of information. Language communities can potentially communicate faster, farther, and in much greater volume. But there is a problem — most computer applications and operating systems were initially designed to handle English. They soon broadened enough to support the major languages of Europe and the Pacific Rim.
There are still, however, millions of people who are cut off from global communication. Their languages and writing systems are not supported on standard computers. The long-term viability of their culture and language is increasingly dependent on the ability to communicate through the digital medium, but many groups are hindered from even print publishing because there are no broadly available means to type and print their language in their alphabet.
This paper is intended as a guide to the planning and management of projects that seek to solve this problem. It provides a basic framework for the development of computer software components that support the diverse languages of the world. It is written primarily for policy makers and professionals, but includes some introductory technical material in later sections.
The scope of this document is necessarily limited. It does not seek to recommend particular software applications, or prescribe a single type of solution. Rather, it is intended to stimulate creative approaches to writing system implementations and encourage cooperation between commercial industries, governmental bodies and non-governmental organizations (NGOs).
1.1 Writing system implementations
Language is primarily a spoken means of communication. Writing systems (also called orthographies) are ways of communicating on printed or visual media, and are usually, but not always, based on spoken languages. A script, sometimes referred to using terms such as alphabet and syllabary, is a collection of symbols (typically with accompanying rules of behavior) that forms the basis for writing many languages. This leads to a basic definition of a writing system, as used in this paper: the use of one or more scripts to form a complete system for writing a particular language. Note that a writing system is unique to a specific language, or language family. Although Russian and Ukranian share the same Cyrillic script, they represent two distinct writing systems.
A writing system implementation (WSI) refers to a set of software components that allow computer users to process textual data in that script and language; making it possible, for instance, to enter data using a keyboard and display it on the screen. Because writing systems are language-specific, there is no guarantee that an implementation of a certain script on a computer will work for all languages that use that script. A WSI that works for Farsi, for example, may not work for Sindhi, although they share the same Arabic script.
Common examples of WSIs would be those computer systems found in newspaper publishing houses around the world. WSIs can be very simple, such as for English, for which only a simple font is needed. They can also be complex and include expensive, dedicated software applications. For instance, to use Chinese on a computer requires a combination of fonts, input systems, and publishing software that can write both horizontally and vertically.
1.2 Goals for writing system support
Language communities need WSIs in order for them to express their languages using computers. A WSI is needed to do even the most basic communications, such as printing out a letter for mass reproduction. Because of the growing importance of cyberspace — the realm of digital communications — WSIs will continue to be critical.
Every language deserves to have its writing system working on computers. But what does this mean? WSIs can differ in their breadth and depth of support. So what level of support does a WSI need to provide for a language in order to be sufficient? At what point can the need be considered to be met? And what should be the goal of a WSI development program?
1.2.1 Minimal — type, see, print, store
At a foundational, minimal level, people must be able to:
This allows for basic functionality, particularly for publishing on paper. If any one of these four attributes are missing, the WSI should be considered foundationally inadequate.
1.2.2 Reasonable — interchange, process, analysis
In today’s digital world, print publishing is not really enough. People need to be able to share information in their language with others — through email, Web sites, portable documents, even interactive presentations. It is reasonable to expect that every language needs the ability to communicate in a wider world.
A WSI that enables broader communication would need to support:
There are also other means of electronic communication (Web logs, chat systems) that should be considered, as well as the broad realm of interactive documents.
In addition to information sharing, a WSI ought to provide for an appropriate level of language processing or analysis. This can include word sorting (for dictionaries and directories), spell-checking, hyphenation and other algorithmic processes.
The technology needed to support these activities already exists in some cases, but is often in an early state of development.
1.2.3 Ideal — parity
Ideally, computing in any language should be as easy as it currently is for English or other European languages. This is sometimes technologically or economically unfeasible. Nevertheless, there is no reason to be satisfied with a minimal solution for a writing system. As a language community grows in its technical ability and desire for communication, there should be no artificially imposed barriers on their use of computers.
There remains, however, a question of completeness. For many activities, there may be a range of WSI support that is sufficient. In the end, WSIs can only be deemed to be complete if they fully meet the need for which they were intended.
Publishing, for example, is a term that covers the production of everything from a leaflet to a glossy airline magazine. For anything but the most basic publishing, a wider range of support is needed. For example, people typesetting Hebrew text (which runs right-to-left on the page), often need to mix words or phrases from left-to-right writing systems into their text. There is a whole set of paragraph and line construction rules that need to be applied when mixing such diverse scripts. A WSI may be sufficient for basic Hebrew publishing without supporting these rules, but more complex publishing requires them.
Finally, a WSI needs to be accessible to those who need it. There are a number of barriers to this:
These three issues need consideration in the planning, development and evaluation of WSIs.
1.3 Components of a WSI — the SIL NRSI Model
Successful WSIs depend on a wide variety of software, including operating systems, standalone applications, fonts, conversion tools and other utilities. Some of these, such as the underlying operating systems themselves, are outside the control of most WSI developers, and cannot be altered. Others are fully in the hands of independent companies and individuals. Despite these differences, it is possible to construct a general architectural model for WSIs.
NRSI Text Encoding Model
SIL International, through its Non-Roman Script Initiative (NRSI) department1, has developed a model for implementing writing systems that encompasses text input, storage, processing, and output. Because of the variety and breadth of locations where it works, SIL has one of the greatest needs for flexibility in these various functions. During its early days a model was developed for dealing with complex scripts that encompassed all of these needed functions.
The model begins with data encoding — how language information is stored on the computer. Any information within a computer system is given a digital representation, meaning that it is encoded in terms of binary numerical values. For text data, the encoding refers to a set of rules by which the sequence of characters that make up a text are represented in terms of a sequence of numbers.
All WSIs are based upon an encoding, whether public or private. An adequate encoding needs to store all the information needed to represent the text in a given writing system. For example, an encoding for English needs to maintain a difference between capital A and lowercase a, because capitalization is an important writing system feature.
The current international standard for encoding is Unicode2, a system with almost unanimous support from software companies and government bodies. This standard assigns a distinct numerical range to individual scripts, and unique numeric identifiers to each character of each script. This allows data in a variety of languages to be stored together without confusion. For more information, consult the section on Technical Details: Encoding and Unicode. It can be assumed that any adequate WSI stores its data according to The Unicode Standard.3
Examples of encoding components:
Data needs to be entered into the computer somehow. This process involves input, whether by the computer keyboard or other input methods. The technical process of keyboarding involves translating keystroke sequences into character data in some given encoding. Since most physical computer keyboards are designed around the entry of Roman script data, the routines needed to interpret keystrokes into language data for non-roman scripts can be very complex, and are commonly built directly into computer operating systems. The section on Technical Details: Data Entry and Editing addresses these issues in greater detail.
Examples of keyboarding components:
Rendering takes stored text in a given encoding and presents it visibly on a screen or printed page. If the data stored on the computer exactly parallels the individual letters in a line of text, this is a simple process: one code on the computer per letter. Even in this situation it is important that the letter be shaped correctly, and harmonize with the rest of the alphabet. Computer font design is a subtle but important process, and is briefly addressed in Technical Details: Glyph Design.
In complex Roman and most non-Roman scripts, however, data must be interpreted to display or print properly. In the end, users want to view their data in an easy, trouble-free manner that accurately reflects the spelling and word formation rules of their writing system. This can require ‘smart’ rendering components that perform sophisticated interpretations of the data before sending it to the computer screen; see Technical Details: Smart Rendering.
Examples of rendering components:
Many WSIs include analysis components. This refers to a variety of actions in which data is processed or analyzed so that:
Such processes include sorting, word demarcation, hyphenation, and spell-checking, as well as more complex systems such as speech synthesis. Analysis may, at first, seem less important, but can be critical to language support. For example, a means of sorting language data is needed to produce dictionaries and telephone books. The brevity of this document, and the breadth of the topic, does not allow for further discussion here.
Examples of analysis components:
Basic conversion components transform data from one encoding into another. Until the advent of Unicode, WSIs used hundreds of different encodings for their data. Some of these encodings were official standards, but others were proprietary and unique to a specific application program. Any time more than one encoding exists for a given language, there needs to be conversion routines and tools as a ‘bridge’ between them. This is especially relevant in the transition from older encodings to Unicode.
Without reliable conversion, users will hesitate to migrate their data to newer systems. A lack of good conversion tools may hinder the uses that could be made of older data. This is, again, a very important topic, but there is not room for further discussion in this document.
Examples of conversion components:
1.4 WSI example
A real-world example can make this model, and the various components, easier to understand. There is a long history of computing for the Thai writing system. There is now a movement toward use of Unicode, away from old encodings, and that transition requires a variety of components. The diagram below describes, in brief, one system that has been developed as part of this new movement. It is one of many WSIs for Thai.
An example WSI for Thai
This WSI for Thai is based upon not one, but two central encodings: Unicode and an older Thai Industrial Standard (TIS620) with extensions by Microsoft (now known as TIS620-2). This reflects the current transitional state of the industry. Support for dual encodings requires conversion routines that can take text in either encoding and convert it to the other one.
There are two software keyboard mechanisms, one for each encoding. The keyboard layout for both is the de facto typewriter standard for Thai. Users can pick which keyboard to use and switch between them depending on what encoding they wish to produce at the time. One of these is distributed by Microsoft and the other was developed by a third party.
There are many fonts available, especially for the older encoding, that translate text in the encodings into symbols on the screen or page. Because the Unicode encoding is simpler, it relies on ‘smart’ fonts to position diacritics.
Finally, a set of sorting routines have been developed for use with a linguistic database program called Shoebox. This is currently only available for the TIS encoding, but a Unicode version is in development. General sorting routines, built into Thai versions of operating systems, have been available for many years, but are not incorporated into this illustration.
Although this system architecture describes a WSI for Thai, it is very similar to WSIs used for many encodings throughout the world.
The ability to communicate using one’s own language is a basic human right, and use of an accompanying writing system can be an important part of that communication. To do this using a computer requires a writing system implementation (WSI). There is no standard set of requirements for WSIs that are applicable to every situation. Each language will require different capabilities, although the ability to type, see, print and store information is a basic minimum.
Although some WSIs are very simple in design, most involve a wide variety of components that must be coordinated. The choice of encoding is a critical one, as all the other components directly depend on it. A minimum system must include at least encoding, keyboarding and rendering components.
The following three sections focus on the process of developing WSIs, the important roles of companies, organizations and governments in WSI development, and some basic keys to successful WSIs. Later sections go into more detail on technical topics and are particularly appropriate to those initiating or managing WSI development.
[UNI2003] The Unicode Consortium, The Unicode Standard, Version 4.0 Reading, MA: Addison-Wesley, 2003.
(c) Copyright 2003 UNESCO and SIL International Inc.
Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.
Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.
I am please to inform you that I have developed 69 keyboard for Pakistani languages and now i want to share these keyboards with SIL, Is there any policy to upload my keyboards with name? Waiting for your response
Thank you for contacting us about the development of these keyboards, and for your desire to share these with SIL. If they are Keyman keyboard layouts, the most effective distribution channel would be to contribute them to the Keyman Repository (SIL has acquired Keyman). Instructions on the process can be found here: http://help.keyman.com/developer/keyboards/ If the keyboards are in another format, please use the Contact Us link at the bottom of this page, so that we can correspond with you further. Many thanks for your work in supporting these languages.
Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.