You are here: General > WSI Guidelines
Short URL: https://scripts.sil.org/WSI_Guidelines_Sec_6_1

Guidelines for Writing System Support: Technical Details: Encodings and Unicode: Part 1

Peter Constable, 2003-09-05

Contents

6.1 An Introduction to Encodings
1. 6.1.1 Text as numbers
2. 6.1.2 Industry standard legacy encodings

6.1 An Introduction to Encodings

Computer systems employ a wide variety of character encodings. The most important of these is Unicode. It is also important for us to understand other encodings, however, and how they relate to Unicode. This section introduces basic encoding concepts, briefly mentions legacy encodings, and gives an introduction to Unicode. It also describes the process of interacting with the Unicode Consortium to get new characters and scripts accepted into the standard.

6.1.1 Text as numbers

Encoding refers to the process of representing information in some form. In computer systems, we encode written language by representing the graphemes or other text elements of the writing system in terms of sequences of characters, units of textual information within some system for representing written texts. These characters are in turn represented within a computer in terms of the only means of representation the computer knows how to work with: binary numbers.

A character set encoding (or character encoding) is such a system for doing this. Any character set encoding involves at least these two components: a set of characters and some system for representing these in terms of the processing units used within the computer.

6.1.2 Industry standard legacy encodings

Encoding standards are important for at least two reasons. First, they provide a basis for software developers to create software that provides appropriate text behaviors. Secondly, they make it possible for data to be exchanged between users.

The ASCII standard was among the earliest encoding standards, and was minimally adequate for US English text. It was not minimally adequate for British English, however, let alone fully adequate for English-language publishing or for most any other language. Not surprisingly, it did not take long for new standards to proliferate. These have come from two sources: standards bodies and independent software vendors.

Software vendors have often developed encoding standards to meet the needs of a particular product in relation to a particular market. Among personal computer vendors, Apple created various standards that differed from IBM and Microsoft standards in order to suit the distinctive graphical nature of the Macintosh product line. Similarly, as Microsoft began development of Windows, the needs of the graphical environment led them to develop new codepages—ways of encoding character sets. These are the familiar Windows codepages, such as codepage 1252, alternately known as “Western”, “Latin 1” or “ANSI”.

The other main source of encoding standards is national or international standards bodies. A national standards body may take a vendor’s standard and promote it to the level of a national standard, or they may create new encoding standards apart from existing vendor standards. In some cases, a national standard may be adopted as an international standard, as was the case with ASCII.

It is important to understand the relationship between industry standard encodings and individual software products. Any commercial software product is explicitly designed to support a specific collection of character set encoding standards.

Of course, the problem with software based solely on standards is that, if you need to work with a character set that your software does not understand, then you are stuck. This happens because software vendors have designed their products with specific markets in mind, and those markets have essentially never included people groups that are economically undeveloped or are not accessible to the vendor. This is not unfair on the part of software vendors; they can only support something they know about and that is predictable, implying a standard.

When the available software does not support the writing systems they need to work with, linguists and others create their own solutions. They define their own character set and encoding, they “hack” out fonts that support that character set using that encoding so that they can view data, they create input methods (keyboards) to support that character set and encoding so that they can create data, and then go to work.

Such practice is quite a reasonable thing to do from the perspective of doing what it takes to get work done. People who have needed to resort to this have been quite resourceful in creating their own solutions. There is a dark side to this, however. Although the user has defined a custom codepage, the software they are using is generally still assuming that some industry standard encoding is being used.

The most serious problem with custom codepages, which affects data archiving and interchange, is that the data is useless apart from a custom font. Dependence on a particular font creates lots of hassles when exchanging data: you always have to send someone a font whenever you send them a document, or make sure they already have it.

One context in which this is a particular problem is the Web. People often work around the problem by using Adobe Acrobat (PDF) format, but for some situations, including the Web, this is a definite limitation. If data is being sent to a publisher, who will need to edit and typeset the document, Acrobat is not a good option1. Custom codepages are especially a problem for a publisher who receives content from multiple sources since they are forced to juggle and maintain a number of proprietary fonts. Furthermore, if they are forced to use these fonts, they are hindered in their ability to control the design aspects of the publication.

The right way to avoid all of these problems is to follow a standard encoding that includes these characters. This is precisely the type of solution that is made possible by Unicode, which is being developed to have a universal character set that covers all of the scripts in the world.

Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.

1	Once a document is typeset and ready for press, however, Acrobat format is generally a good option.

© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.