Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: General > WSI Guidelines
Short URL: http://scripts.sil.org/WSI_Guidelines_Sec_1

Guidelines for Writing System Support: Components of a Writing System Implementation

Victor Gaultney, Melinda Lyons, 2003-10-20

Contents

Communication is a basic component of society. The ability of a group of people to communicate with one another and with the rest of the world is critical to community and nation development. Economic growth, education and cultural development depend on rich spoken and written communication.

Computers have enabled the diverse peoples of the world to communicate in powerful ways, whether that be in printed media or digital packets of information. Language communities can potentially communicate faster, farther, and in much greater volume. But there is a problem — most computer applications and operating systems were initially designed to handle English. They soon broadened enough to support the major languages of Europe and the Pacific Rim.

There are still, however, millions of people who are cut off from global communication. Their languages and writing systems are not supported on standard computers. The long-term viability of their culture and language is increasingly dependent on the ability to communicate through the digital medium, but many groups are hindered from even print publishing because there are no broadly available means to type and print their language in their alphabet.

This paper is intended as a guide to the planning and management of projects that seek to solve this problem. It provides a basic framework for the development of computer software components that support the diverse languages of the world. It is written primarily for policy makers and professionals, but includes some introductory technical material in later sections.

The scope of this document is necessarily limited. It does not seek to recommend particular software applications, or prescribe a single type of solution. Rather, it is intended to stimulate creative approaches to writing system implementations and encourage cooperation between commercial industries, governmental bodies and non-governmental organizations (NGOs).

1.1  Writing system implementations

Language is primarily a spoken means of communication. Writing systems (also called orthographies) are ways of communicating on printed or visual media, and are usually, but not always, based on spoken languages. A script, sometimes referred to using terms such as alphabet and syllabary, is a collection of symbols (typically with accompanying rules of behavior) that forms the basis for writing many languages. This leads to a basic definition of a writing system, as used in this paper: the use of one or more scripts to form a complete system for writing a particular language. Note that a writing system is unique to a specific language, or language family. Although Russian and Ukranian share the same Cyrillic script, they represent two distinct writing systems.

A writing system implementation (WSI) refers to a set of software components that allow computer users to process textual data in that script and language; making it possible, for instance, to enter data using a keyboard and display it on the screen. Because writing systems are language-specific, there is no guarantee that an implementation of a certain script on a computer will work for all languages that use that script. A WSI that works for Farsi, for example, may not work for Sindhi, although they share the same Arabic script.

Common examples of WSIs would be those computer systems found in newspaper publishing houses around the world. WSIs can be very simple, such as for English, for which only a simple font is needed. They can also be complex and include expensive, dedicated software applications. For instance, to use Chinese on a computer requires a combination of fonts, input systems, and publishing software that can write both horizontally and vertically.

1.2  Goals for writing system support

Language communities need WSIs in order for them to express their languages using computers. A WSI is needed to do even the most basic communications, such as printing out a letter for mass reproduction. Because of the growing importance of cyberspace — the realm of digital communications — WSIs will continue to be critical.

Every language deserves to have its writing system working on computers. But what does this mean? WSIs can differ in their breadth and depth of support. So what level of support does a WSI need to provide for a language in order to be sufficient? At what point can the need be considered to be met? And what should be the goal of a WSI development program?

1.2.1  Minimal — type, see, print, store

At a foundational, minimal level, people must be able to:

  • Type their language in a manner that is sensible to them. Aspects of the keyboarding experience, such as the key layout or typing order, should usually match up with the way people think about writing their language, and not require a shift in linguistic understanding. For writing systems with hundreds or thousands of letters, this means that the correct symbol should be accessible in a reasonably intuitive manner.
  • See the results of their typing on the screen and view the text as it will be printed. A WSI needs to provide basic feedback to users and allow them to catch and change mistakes.
  • Print their document with text rendered correctly. Lettershapes need to be correct for the language, with accents and diacritics positioned appropriately. Words need to be formed using the right word-formation rules. For writing systems that require ligatures — the joining of two or more letters to form a combined shape that is different than the individual letters — the ligatures need to be present, and correct.
  • Store their data in a format that retains all important information. This should include all information needed to present the text content legibly and facilitate common text processes (editing, sorting, spell-checking, etc.) in a manner that is appropriate to that writing system. For example, if a language is written with special tone marks, those marks need to be stored as an integral part of the text, not as separate graphics or manually positioned marks.

This allows for basic functionality, particularly for publishing on paper. If any one of these four attributes are missing, the WSI should be considered foundationally inadequate.

1.2.2  Reasonable — interchange, process, analysis

In today’s digital world, print publishing is not really enough. People need to be able to share information in their language with others — through email, Web sites, portable documents, even interactive presentations. It is reasonable to expect that every language needs the ability to communicate in a wider world.

A WSI that enables broader communication would need to support:

  • Transmission of adequate language data via email. Information needs to be encoded so that it could be sent to another computer without loss of important language information.
  • Preparation of Web pages that display the language correctly. This is, however, highly dependent on individual Web browsers. Development of a WSI that supports Web page design that displays correctly in all browsers may not be feasible, and WSIs may need to be prepared that depend on a single browser platform.
  • Preparation of portable documents. The current standard for this type of document is Adobe’s PDF (portable document format). WSIs need to allow the embedding of language data into documents that can be viewed on a variety of operating system platforms.

There are also other means of electronic communication (Web logs, chat systems) that should be considered, as well as the broad realm of interactive documents.

In addition to information sharing, a WSI ought to provide for an appropriate level of language processing or analysis. This can include word sorting (for dictionaries and directories), spell-checking, hyphenation and other algorithmic processes.

The technology needed to support these activities already exists in some cases, but is often in an early state of development.

1.2.3  Ideal — parity

Ideally, computing in any language should be as easy as it currently is for English or other European languages. This is sometimes technologically or economically unfeasible. Nevertheless, there is no reason to be satisfied with a minimal solution for a writing system. As a language community grows in its technical ability and desire for communication, there should be no artificially imposed barriers on their use of computers.

1.2.4  Completeness

There remains, however, a question of completeness. For many activities, there may be a range of WSI support that is sufficient. In the end, WSIs can only be deemed to be complete if they fully meet the need for which they were intended.

Publishing, for example, is a term that covers the production of everything from a leaflet to a glossy airline magazine. For anything but the most basic publishing, a wider range of support is needed. For example, people typesetting Hebrew text (which runs right-to-left on the page), often need to mix words or phrases from left-to-right writing systems into their text. There is a whole set of paragraph and line construction rules that need to be applied when mixing such diverse scripts. A WSI may be sufficient for basic Hebrew publishing without supporting these rules, but more complex publishing requires them.

1.2.5  Accessibility

Finally, a WSI needs to be accessible to those who need it. There are a number of barriers to this:

  • System complexity. A WSI may meet all the technical needs for a writing system, but be too complex for the general public to use. It may meet the need for a smaller segment of the community, but generally useful WSIs are still needed.
  • Available software platforms. To be adequate for a broad population, WSIs need to be usable on operating system and application platforms that are available to the public. For example, a WSI that only works on a Macintosh computer would not be sufficient in a location where only Windows computers are available.
  • Economic cost. Although commercial investment is often necessary for the development of WSIs, solutions that are affordable by only a small segment of the community may not be sufficient. It will take creativity and cooperation to develop WSIs for the general public.

These three issues need consideration in the planning, development and evaluation of WSIs.

1.3  Components of a WSI — the SIL NRSI Model

Successful WSIs depend on a wide variety of software, including operating systems, standalone applications, fonts, conversion tools and other utilities. Some of these, such as the underlying operating systems themselves, are outside the control of most WSI developers, and cannot be altered. Others are fully in the hands of independent companies and individuals. Despite these differences, it is possible to construct a general architectural model for WSIs.

NRSI Text Encoding Model



SIL International, through its Non-Roman Script Initiative (NRSI) department1, has developed a model for implementing writing systems that encompasses text input, storage, processing, and output. Because of the variety and breadth of locations where it works, SIL has one of the greatest needs for flexibility in these various functions. During its early days a model was developed for dealing with complex scripts that encompassed all of these needed functions.

1.3.1  Encoding

The model begins with data encoding — how language information is stored on the computer. Any information within a computer system is given a digital representation, meaning that it is encoded in terms of binary numerical values. For text data, the encoding refers to a set of rules by which the sequence of characters that make up a text are represented in terms of a sequence of numbers.

All WSIs are based upon an encoding, whether public or private. An adequate encoding needs to store all the information needed to represent the text in a given writing system. For example, an encoding for English needs to maintain a difference between capital A and lowercase a, because capitalization is an important writing system feature.

The current international standard for encoding is Unicode2, a system with almost unanimous support from software companies and government bodies. This standard assigns a distinct numerical range to individual scripts, and unique numeric identifiers to each character of each script. This allows data in a variety of languages to be stored together without confusion. For more information, consult the section on Technical Details: Encoding and Unicode. It can be assumed that any adequate WSI stores its data according to The Unicode Standard.3

Examples of encoding components:

  • A Unicode set defined for a given script that includes all needed characters (letters).
  • A storage format that allows data to be encoded according to Unicode.

1.3.2  Input

Data needs to be entered into the computer somehow. This process involves input, whether by the computer keyboard or other input methods. The technical process of keyboarding involves translating keystroke sequences into character data in some given encoding. Since most physical computer keyboards are designed around the entry of Roman script data, the routines needed to interpret keystrokes into language data for non-roman scripts can be very complex, and are commonly built directly into computer operating systems. The section on Technical Details: Data Entry and Editing addresses these issues in greater detail.

Examples of keyboarding components:

  • A simple keyboard definition for the script, developed by Microsoft and included in distributions of Windows.
  • A Macintosh keyboard definition file, created by someone outside of Apple Computer, but that plugs into the standard Mac OS system.
  • A complex keyboard definition created for use with the Keyman program. This would be separately installed, along with Keyman, onto computers running Windows.

1.3.3  Rendering

Rendering takes stored text in a given encoding and presents it visibly on a screen or printed page. If the data stored on the computer exactly parallels the individual letters in a line of text, this is a simple process: one code on the computer per letter. Even in this situation it is important that the letter be shaped correctly, and harmonize with the rest of the alphabet. Computer font design is a subtle but important process, and is briefly addressed in Technical Details: Glyph Design.

In complex Roman and most non-Roman scripts, however, data must be interpreted to display or print properly. In the end, users want to view their data in an easy, trouble-free manner that accurately reflects the spelling and word formation rules of their writing system. This can require ‘smart’ rendering components that perform sophisticated interpretations of the data before sending it to the computer screen; see Technical Details: Smart Rendering.

Examples of rendering components:

  • A simple font, such as for English, that does not require special ‘smart’ code for rendering.
  • An OpenType font, also intended for English, that has ‘smart’ code in order to display typographic ligatures required for fine typesetting.
  • A non-Roman font to which special SIL Graphite ‘smart’ code has been added to support sophisticated diacritic positioning.

1.3.4  Analysis

Many WSIs include analysis components. This refers to a variety of actions in which data is processed or analyzed so that:

  • the data can be transformed in some way specific to the writing system,
  • the data can be presented to the user in some useful manner, or
  • operations related to the editing, management or other processing of text data can be performed.

Such processes include sorting, word demarcation, hyphenation, and spell-checking, as well as more complex systems such as speech synthesis. Analysis may, at first, seem less important, but can be critical to language support. For example, a means of sorting language data is needed to produce dictionaries and telephone books. The brevity of this document, and the breadth of the topic, does not allow for further discussion here.

Examples of analysis components:

  • Standard sorting routines built into operating systems.
  • Word demarcation (word breaking) rules built into a Thai typesetting program.
  • Systems that turn text into a series of phonemes for input into a speech synthesis program.
  • A Spanish spell-checking module that works within a word processor.

1.3.5  Conversion

Basic conversion components transform data from one encoding into another. Until the advent of Unicode, WSIs used hundreds of different encodings for their data. Some of these encodings were official standards, but others were proprietary and unique to a specific application program. Any time more than one encoding exists for a given language, there needs to be conversion routines and tools as a ‘bridge’ between them. This is especially relevant in the transition from older encodings to Unicode.

Without reliable conversion, users will hesitate to migrate their data to newer systems. A lack of good conversion tools may hinder the uses that could be made of older data. This is, again, a very important topic, but there is not room for further discussion in this document.

Examples of conversion components:

  • A small program that converts text files back and forth between two encodings (such as the old MacRoman encoding and Unicode).
  • Definition files that document the differences and correspondences between encodings, and that could serve as input for a conversion program.

1.4  WSI example

A real-world example can make this model, and the various components, easier to understand. There is a long history of computing for the Thai writing system. There is now a movement toward use of Unicode, away from old encodings, and that transition requires a variety of components. The diagram below describes, in brief, one system that has been developed as part of this new movement. It is one of many WSIs for Thai.

An example WSI for Thai



This WSI for Thai is based upon not one, but two central encodings: Unicode and an older Thai Industrial Standard (TIS620) with extensions by Microsoft (now known as TIS620-2). This reflects the current transitional state of the industry. Support for dual encodings requires conversion routines that can take text in either encoding and convert it to the other one.

There are two software keyboard mechanisms, one for each encoding. The keyboard layout for both is the de facto typewriter standard for Thai. Users can pick which keyboard to use and switch between them depending on what encoding they wish to produce at the time. One of these is distributed by Microsoft and the other was developed by a third party.

There are many fonts available, especially for the older encoding, that translate text in the encodings into symbols on the screen or page. Because the Unicode encoding is simpler, it relies on ‘smart’ fonts to position diacritics.

Finally, a set of sorting routines have been developed for use with a linguistic database program called Shoebox. This is currently only available for the TIS encoding, but a Unicode version is in development. General sorting routines, built into Thai versions of operating systems, have been available for many years, but are not incorporated into this illustration.

Although this system architecture describes a WSI for Thai, it is very similar to WSIs used for many encodings throughout the world.

1.5  Summary

The ability to communicate using one’s own language is a basic human right, and use of an accompanying writing system can be an important part of that communication. To do this using a computer requires a writing system implementation (WSI). There is no standard set of requirements for WSIs that are applicable to every situation. Each language will require different capabilities, although the ability to type, see, print and store information is a basic minimum.

Although some WSIs are very simple in design, most involve a wide variety of components that must be coordinated. The choice of encoding is a critical one, as all the other components directly depend on it. A minimum system must include at least encoding, keyboarding and rendering components.

The following three sections focus on the process of developing WSIs, the important roles of companies, organizations and governments in WSI development, and some basic keys to successful WSIs. Later sections go into more detail on technical topics and are particularly appropriate to those initiating or managing WSI development.

1.6  References

[UNI2003] The Unicode Consortium, The Unicode Standard, Version 4.0 Reading, MA: Addison-Wesley, 2003.

Copyright notice

(c) Copyright 2003 UNESCO and SIL International Inc.



Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.

Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.

 Reply
"Rehmat Aziz Chitrali Linguist , Wed, Apr 19, 2017 03:01 (CDT)

Keyboards for endangered Languages of Pakistan

I am please to inform you that I have developed 69 keyboard for Pakistani languages and now i want to share these keyboards with SIL, Is there any policy to upload my keyboards with name? Waiting for your response

 Reply
martinpk, Wed, Apr 19, 2017 04:31 (CDT)

Re: Keyboards for endangered Languages of Pakistan

Thank you for contacting us about the development of these keyboards, and for your desire to share these with SIL. If they are Keyman keyboard layouts, the most effective distribution channel would be to contribute them to the Keyman Repository (SIL has acquired Keyman). Instructions on the process can be found here: http://help.keyman.com/developer/keyboards/ If the keyboards are in another format, please use the Contact Us link at the bottom of this page, so that we can correspond with you further. Many thanks for your work in supporting these languages.

Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.



1 SIL International’s Non-Roman Script Initiative (NRSI) is focused on enabling WSI development by making WSI components, linguistic information, development resources and training materials available to the computing community. Website: scripts.sil.org
2 See [UNI2003]. Details on the current Unicode version, can be obtained from  www.unicode.org
3 There are, however, some writing systems that have not yet been incorporated into Unicode.

© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.