You are here: General > WSI Guidelines
Short URL: https://scripts.sil.org/WSI_Guidelines_Sec_6_2

Guidelines for Writing System Support: Technical Details: Encodings and Unicode: Part 2

Peter Constable, 2003-09-05

Contents

6.2 An Introduction to Unicode

6.2 An Introduction to Unicode

Unicode is an industry standard character set encoding developed and maintained by The Unicode® Consortium. The Unicode character set has the capacity to support over one million characters, and is being developed with an aim to have a single character set that supports all characters from all scripts, as well as many symbols, that are in common use around the world today or in the past. Currently, the Standard supports over 96,000 characters representing a large number of scripts. The benefits of a single, universal character set and the practical considerations for implementation that have gone into the design of Unicode have made it a success, and it is well on the way to becoming a dominant and ubiquitous standard.

6.2.1 A brief history of Unicode

In order to understand Unicode, it is helpful to know a little about the history of its development. By the early 1980s, the software industry was starting to recognize the need for a solution to the problems involved with using multiple character encoding standards. Inspired by early innovative work by Xerox, the Unicode project began in 1988, with representatives from several companies collaborating to develop a single character set encoding standard that could support all of the world’s scripts. This led to the formation of the Unicode Consortium in January of 1991, and the publication of Version 1.0 of the Unicode Standard in October of the same year.

There were four key original design goals for Unicode:

To create a universal standard that covered all writing systems.
To use an efficient encoding that avoided mechanisms such as code page switching, shift-sequences and special states.
To use a uniform encoding width in which each character was encoded as a 16-bit value.
To create an unambiguous encoding in which any given 16-bit value always represented the same character regardless of where it occurred in the data.

How well these goals have been achieved as the Standard has developed is probably a matter of opinion. There is no question, however, that some compromises were necessary along the way. To fully understand Unicode and the compromises that were made, it is also important to understand another, related standard: ISO/IEC 10646.1

In 1984, a joint ISO/IEC working group was formed to begin work on an international character set standard that would support all of the world’s writing systems. This became known as the Universal Character Set (UCS). By 1989, drafts of the new international standard were starting to get circulated.

At this point, people became aware that there were two efforts underway to achieve similar ends, those ends being a comprehensive standard that everyone could use. Of course, the last thing anybody wanted was to have two such standards. As a result, in 1991 the ISO/IEC working group and the Unicode Consortium began to discuss a merger of the two standards.

The complete details of the merger were worked out over many years, but the most important issues were resolved early on. The first step was that the repertoires in Unicode and the draft 10646 standard were aligned, and an agreement was reached that the two character sets should remain aligned.

The Unicode Standard has continued to be developed up to the present, and work is still continuing with an aim to make the Standard more complete, covering more of the world’s writing systems, to correct errors in details, and to make it better meet the needs of implementers. The most current version at this time, version 4.0, was published in 2003 (The Unicode Consortium, 2003). As of this version, the Standard includes a total of 96,447 encoded characters.

6.2.2 The Unicode Consortium and the maintenance of the Unicode Standard

The Unicode Consortium is a not-for-profit organization that exists to develop and promote the Unicode Standard. Anyone can be a member of the consortium, though there are different types of memberships, and not everyone gets the privilege of voting on decisions regarding the Standard. That privilege is given only to those in the category of Full Member. There are two requirements for Full Membership: this category is available only for organizational members, not to individuals; and there is an annual fee of US$12,000. At the time of writing, there are currently 15 Full Members.

The work of developing the Standard is done by the Unicode Technical Committee (UTC). Every Full Member organization is eligible to have a voting position on the UTC, though they are not required to participate.

There are three other categories of membership: Individual Member, Specialist Member, and Associate Member. Each of these has increasing levels of privileges. The Associate and Specialist Member categories offer the privilege of being able to participate in the regular work of the UTC through an e-mail discussion list—the “unicore” list. All members are eligible to attend meetings.

6.2.3 Unicode and ISO/IEC 10646

The UTC maintains a liaison relationship with the corresponding body within ISO/IEC that develops and maintains ISO/IEC 10646. Any time one body considers adding new characters to the common character set, those proposals need to be evaluated by both bodies. Before any new character assignments can officially be made, approval of both bodies is required. This is how the two standards are kept in synchronization.

Because the process of developing the Unicode Standard involves interaction with ISO/IEC and the international standard ISO/IEC 10646, it is worth mentioning briefly the workings of the international standards body as it relates to Unicode. The Joint Technical Committee 1 (JTC1) of ISO and IEC is responsible for standards related to information technologies, and the work of this technical committee is divided among multiple sub-committees. Sub-committee 2 (JTC1/SC 2) is responsible for standards related to character encoding, and that work is divided among various working groups. Among these, working group 2 (JTC 1/SC 2/WG 2) is responsible for the development of ISO/IEC 10646.

The combined standards body ISO/IEC is an international standards organization — the members of which are national standards bodies from various countries. Standards bodies from any country are potentially eligible to participate in the work of any ISO or ISO/IEC technical committee or sub-committee, including work on ISO/IEC 10646.

In order to ensure quality standards that facilitate domestic and international commerce without providing unfair advantage to certain countries over others, a very formal process is used that includes several stages of review and balloting before something is published as part of an international standard. Thus, if a standards institute of a given country wishes to influence the development of ISO/IEC 10646 (and, in turn, Unicode), they should become a member of ISO/IEC, become a participating member of JTC 1/SC 2, and then actively contribute to the work by voting on ballots, preparing and commenting on draft revisions, and attending meetings of JTC 1/SC 2/WG 2 whenever possible.

6.2.4 Types of information

The Unicode Standard is embodied in the form of three types of information:

Firstly, there is the printed version of the most recent major version. At present, this corresponds to The Unicode Standard (TUS) 4.0 (The Unicode Consortium, 2003).
Secondly, the Unicode Consortium publishes a variety of documents known as Unicode Technical Reports (UTRs) on its Web site. These discuss specific issues relating to implementation of the Standard, and some even become parts of the Standard. A UTR with this status is identified as a Unicode Standard Annex (UAX). These annexes may include documentation of a minor version release, or information concerning specific implementation issues.
Thirdly, the Unicode Standard includes a collection of data files that provide detailed information about semantic properties of characters in the Standard that are needed for implementations. These data files are distributed on a CD-ROM with the printed versions of the Standard, but the most up-to-date versions are always available from the Unicode Web site. Further information on the data files is available at http://www.unicode.org/unicode/onlinedat/online.html.

Thus, the previous version of Unicode, TUS 3.2, consists of the published book for TUS 3.0, plus the UAXes that describes the minor versions for TUS 3.1 and TUS 3.2, UAX #27 and UAX #28 respectively, together with the current versions of the other annexes and data files.

Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.

The abbreviations ISO and IEC stand for The International Organization for Standardization (ISO) and The International Electrotechnical Commission. Each of these organizations is responsible for the development of international standards. They collaborate in the development of standards for information technology.

© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.