This is an archive of the original scripts.sil.org site, preserved as a historical reference. Some of the content is outdated. Please consult our other sites for more current information: software.sil.org, ScriptSource, FDBP, and silfontdev



Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE | PRIVACY POLICY

You are here: Encoding
Short URL: https://scripts.sil.org/IWS-Chapter03

Implementing Writing Systems: An introduction

Character set encoding basics

Understanding character set encodings and legacy encodings

Peter Constable, 2001-06-13

In understanding technologies for working with multilingual and multi-script text data, we need to start with an understanding of character encoding. Systems for working with text involve a collection of processes that work together—processes for creating and editing text, presenting it, for sorting, for laying out paragraphs and wrapping at line breaks, etc. Character encoding is the thing that ties all of these processes together.

Computer systems employ a wide variety of character encodings. The most important of these for us is Unicode. It is also important for us to understand other encodings, however, and how they relate to Unicode. In this section, I want to look at some basic concepts that relate to all encodings, and also give an overview of legacy encodings and their importance for us.

Contents

1 Text as numbers

Encoding refers to the process of representing information in some form. Human language is an encoding system by which we represent information in terms of sequences of lexical units, and those in terms of sound or gesture sequences. Written language is a derivative system of encoding by which those sequences of lexical units, sounds or gestures are represented in terms of the graphical symbols that make up some writing system.

In computer systems, we encode written language by representing the graphemes or other text elements of the writing system in terms of sequences of characters, units of textual information within some system for representing written texts. These characters are in turn represented within a computer in terms of the only means of representation the computer knows how to work with: binary numbers. A character set encoding (or character encoding) is such a system for doing this.

Any character set encoding involves at least these two components: a set of characters and some system for representing these in terms of the processing units used within the computer. (Notice that these correspond to the two levels of representation described in the previous paragraph.) There is, of course, no predetermined way in which this is done. The ASCII standard is one system for doing this, but not the only way, with the result that a number stored in the computer’s data can mean different things depending upon the conventions being assumed:

Figure 1. Data interpreted according to selected codepage



I will explain in more detail later the relationship between the character set and the system for encoding those characters in terms of the computer’s processing units. For the moment, I will use the terms codepage to mean a specified set of characters with a specified encoded representation for each character, and codepoint to mean the encoded representation used for a given character. These are temporary approximations. I will review the meaning of these terms in Section 7.

2 What the numbers mean: character properties

When we talk about interpretation, or what the representational units mean, we are reversing the encoding process. Thus, if we are assuming the ASCII system, then the number 97 within text data means to us the Latin letter “a” as would be used in spelling the English word “cat”, for example. For the user, the numbers mean the text elements in the writing system they are working with. The user thinks of them in terms of their function within the written language.1

The computer does not understand written language, however. To the computer, these are just numbers. A programmer has the task, however, of making it appear to the user that the computer thinks about these numbers the same way the user does—in terms of the writing system. This is done in the way the various text processes are designed; for instance, in the way words are sorted or line breaks are inserted when laying out paragraphs.

A character set encoding has another important aspect, therefore. As we saw, each character in the character set is assigned some numeric representation. In addition, each character has various properties and relationships to other characters that affect how software treats it in performing various processes.

For example, consider ASCII again. The numbers 65 to 90 and 97 to 122 are used to represent the upper and lowercase letters of the English alphabet, which are all word-forming characters. Any text processing software designed to work with the ASCII encoding knows (or should know) that it must as far as possible avoid breaking lines between any pair of these numbers. Similarly, the number 44 represents a comma, which is word-trailing punctuation, and the software generally is designed to know that it should not break a line between it and a preceding alphabetic character, though it is allowed to break a line after the comma. The software may also be designed to know about certain relationships between characters, such as the case relationship between 65 “A” and 97 “a”.

Note that all of these kinds of properties and relationships are dependent upon the particular encoding standard being used. In ASCII, the number 65 represents “A” and is the case pair to 97, but in IBM’s EBCDIC standard, 65 is undefined and 97 is the forward slash “/”, while “A” is represented using the number 193, with its case pair “a” represented as the number 129.

Note also that some properties or relationships may be dependent upon more than just the encoding. For example, sorting and case mappings may vary from language to language. So, for instance, when sorting ASCII-encoded data, the character sequence “ch” comes after “cg” and before “ci” if the data is English, but in Spanish “ch” is a distinct grapheme that has its own place in the sort order, and “ch” should sort after “cz” and before “da”.

3 Industry standard legacy encodings

Encoding standards are important for at least two reasons. First, they provide a basis for software developers to create software that provides appropriate text behaviours, as described above. Secondly, they make it possible for data to be exchanged between users.

The ASCII standard was among the earliest encoding standards, and was minimally adequate for US English text. It was not minimally adequate for British English, however, let alone fully adequate for English-language publishing or for most any other language. Not surprisingly, it did not take long for new standards to proliferate. These have come from two sources: standards bodies, and independent software vendors.

Software vendors have often developed encoding standards to meet the needs of a particular product in relation to a particular market. For example, IBM developed codepage 437 for DOS, taking ASCII and adding characters needed for British English and some major European languages, as well as graphical characters such as line-drawing characters for DOS applications to use in creating user interface elements. There were other DOS codepages created to meet other markets in which other languages or scripts were used. For example, codepage 852 for Eastern European languages that use Latin script, codepage 855 for Russian and some other Eastern European languages that use Cyrillic script, etc.

At the same time, other vendors were creating other encoding standards to meet the needs of their products and their clients. Within the publishing industry, there were special encoding needs even for English that codepages like ASCII and IBM codepage 437 do not meet (to handle the demands of high quality typesetting). These were generally used internally, however, and did not come into widespread usage. Most other encoding standards came from other hardware vendors, often mainframe developers. For instance, even within IBM, the mainframe division was developing their own codepages that were different from those used for DOS.

Among personal computer vendors, Apple created various standards that differed from IBM and Microsoft standards in order to suit the distinctive graphical nature of the Macintosh product line. Similarly, as Microsoft began development of Windows, the needs of the graphical environment led them to develop new codepages. These are the familiar Windows codepages, such as codepage 1252, alternately known as “Western”, “Latin 1” or “ANSI”.2

Occasionally, an encoding developed by an individual vendor for use in their software becomes widely used and, as a result, a de facto industry standard. For example, because of the importance of DOS, the IBM codepage 437 is supported on many other platforms. In fact, even though Windows uses a newer set of codepages, support for DOS was built in at a deep level in order to provide backward compatibility with DOS applications. Similarly, the Big Five standard was developed by a group of five vendors in Taiwan, but has since been adopted by others and become the de facto standard for encoding Traditional Chinese, being used more commonly than even the Taiwanese national standards.3

The other main source of encoding standards is national or international standards bodies. A national standards body may take a vendor’s standard and promote it to the level of a national standard, or they may create new encoding standards apart from existing vendor standards. In some cases, a national standard may be adopted as an international standard. This was the case with ASCII, for example, which began as a US standard (published as ANSI X3.4: 1968), but later became adopted by the International Organization for Standardization (ISO) as ISO646:IRV. In other cases, standards may originate at the international level, as has been the case with a series of ISO encoding standards known as ISO 8859.

Most of these legacy encoding standards encode each character in terms of a single 8-bit processing unit, or byte. Not all hardware architectures have used an 8-bit byte, however; different architectures have used bytes that range anywhere from 6 bits to 36 bits.4 For virtually all character encoding standards that affect personal computers, however, 8-bit bytes are the norm.

It is not always the case, however, that characters are encoded in terms of just a single 8-bit value. For example, Microsoft codepages for Chinese, Japanese and Korean use so-called double-byte encodings, which use a mixture of one- or two-byte sequences to represent a character. To illustrate, in codepage 950, used for Traditional Chinese, the byte value 0x73 by itself is used to represent LATIN SMALL LETTER S, but certain two-byte sequences that end in 0x73 can represent a different character; for example, the two-byte sequence 0xA4 0x73 is used to represent the Traditional Chinese character ‘mountain’.

I will discuss different approaches to encoding in more detail in Section 5.3.

Of course, with all of these different encoding standards, it is not surprising that in many cases a given standard may offer support for certain characters that others do not. None of these legacy standards is comprehensive. It is also true in many cases that multiple encoding standards support the same character sets but in incompatible ways. Both of these problems are being addressed today in the Unicode standard. If a software developer designs a product to use Unicode, however, the legacy5 standards cannot simply be ignored. The reasons for this are that there will continue to be a lot of existing data that is encoded in terms of the legacy standards, and that there are a lot of existing software implementations that use the legacy standards that the application will have to interact with.

Before leaving this discussion of industry standard encodings, it would probably be helpful for me to explain Windows codepages and the term ANSI. When Windows was being developed, the American National Standards Institute (ANSI) was in the process of drafting a standard that eventually became ISO 8859-1 “Latin 1”. Microsoft created their codepage 1252 for Western European languages based on an early draft of the ANSI proposal, and began to refer to this as “the ANSI codepage”. Codepage 1252 was finalised before ISO 8859-1 was finalised, however, and the two are not the same: codepage 1252 is a superset of ISO 8859-1.

Later, apparently around the time of Windows 95 development, Microsoft began to use the term “ANSI” in a different sense to mean any of the Windows codepages, as opposed to Unicode. Therefore, currently in the context of Windows, the terms “ANSI text” or “ANSI codepage” should be understood to mean text that is encoded with any of the legacy 8-bit Windows codepages rather than Unicode. It really should not be used to mean the specific codepage associated with the US version of Windows, which is codepage 1252.

4 Industry standards versus custom encodings

It is important to understand the relationship between industry standard encodings and individual software products. Any commercial software product is explicitly designed to support a specific collection of character set encoding standards. All of its text processes will be performed assuming one of these encodings to be active. This has an important bearing on data: if data is not adequately labelled to identify its encoding, or if it is incorrectly labelled, the data will undergo inappropriate processing leading to unexpected results. This has been a particular problem with information on the Web: authors create HTML content, but do not realise that they need to specify the encoding being used.

Of course, the problem with software based on standards is that, if you need to work with a character set that your software does not understand, then you are stuck. For linguists working with minority languages, this has usually been the case. This happens because software vendors have designed their products with specific markets in mind, and those markets have essentially never included people groups that are economically undeveloped or are not accessible to the vendor. This is not unfair on the part of software vendors; they can only support something they know about and that is predictable, implying a standard.

We all know what linguists and others do when the available software does not support the writing systems they need to work with: they create their own solutions. They define their own character set and encoding, they “hack” out fonts that support that character set using that encoding so that they can view data, they create input methods (keyboards) to support that character set and encoding so that they can create data, and then go to work.

Such practice is quite a reasonable thing to do from the perspective of doing what it takes to get work done. People who have needed to resort to this have been quite resourceful in creating their own solutions. This includes building specialised tools to facilitate it, such as the SIL Encore font system and Tavultesoft Keyman 3.2. There is a dark side to this, however. Although the linguist has defined a custom codepage, the software they are using is generally still assuming that some industry standard encoding is being used.

For example, consider someone working with Microsoft Word on a US version of Windows 98, but using a “hacked” font based on a custom codepage such as the one shown in Figure 2:6

Figure 2. A custom-defined codepage



We all know that Word has features that help with things like capitalisation. But how does Word know how to perform a process like that? It does it in terms of the character semantics that it is assuming, which in this case would typically be that for Windows codepage 1252, shown in Figure 3 (with all the characters that were redefined in the codepage in Figure 2 highlighted).

Figure 3. A standard codepage (Windows codepage 1252)



So, consider what will happen when the user enters any of the characters from 0xE0 to 0xEF at the start of a sentence: Word will change the character to the corresponding character from the range 0xC0 to 0xCF.

For the linguist, this behaviour is a nuisance. People that use such custom codepages often complain about Microsoft, saying that their software assumes too much or that it messes up their data. We need to keep in mind, though, that vendors have to design their software using some standards. In this situation, this is a feature from which far more of Microsoft’s customers benefit. Fortunately for the people using the codepage in Figure 2, Microsoft made it possible to turn this feature off.

There are a number of other things that can go wrong that the user might not be able to control, however, such as the following:

  • Software may use certain characters for special display purposes. For instance, Word uses 0xB0 to display a non-breaking space when in the display-non-printing mode. When using a “hacked” font, the wrong symbol will appear.
  • In certain contexts, software might change the dollar sign to other currency symbols at 0xA3 or 0xA5.
  • Many applications have a “smart quotation mark” feature, which often leads to undesired results when using a “hacked” font.
  • The character at 0xAD is the soft hyphen. Some software might only display this when it occurs as the last character on a line. This happens, for instance, with Macromedia Freehand.
  • Software may apply special behaviour to codepoints for non-word-forming characters. This applies to many applications in relation to line-breaking and editing operations, such as cursor movement using  CTRL  +  ARROW  combinations.
  • Software may apply encoding conversions to data. This creates a major problem with custom-codepage data whenever data is exchanged, for example, between Windows and the Mac. (This works fairly smoothly when you conform to the standards.)
  • Software may perform certain character translations into alternate representations. For instance, when saving to plain text, some versions of Word will convert characters such as the copyright sign (0xA9 in codepage 1252) into equivalent ASCII representations (0x28, 0x63, 0x29). This can result in garbage data when using a custom-encoded font.
  • Codepoints that are undefined in codepage 1252, such as 0x8F, may work on some platforms (e.g. Windows 9x) but not on others (e.g. Windows 2000).

All of these behaviours are the result of the software assuming the standards which it was designed to work with, and can lead to undesired results when going against the standard by using a custom-encoded font.

One way to avoid these problems would be to have software that does not make any assumptions about the codepage or, even better, allows the user to specify what the semantics associated with each codepoint should be. This kind of thing would generally be risky for a commercial vendor since it makes it possible for the user to mess things up, leading to problems in user support. As a result, this is usually made possible only in software developed by linguists or language workers for their own work.

Another way to avoid these problems is to use “symbol-encoded” fonts. These fonts on Windows use a special encoding reserved for only symbol fonts. When a symbol-encoded font is applied to a run of text, an application has no way of knowing what the characters are supposed to be, and so might apply some set of default behaviours. This can lead to problems of its own, however. For example, in Word 97, line breaking behaviour changes dramatically when the font is changed to a symbol-encoded font.

Both of these solutions fail to address what is perhaps the most serious problem with custom codepages, which affects data archiving and interchange: the data is useless apart from a custom font. Since it is easy for data to get separated from the font that documents its meaning, the data inherently lacks robustness, which is critical for archiving. Even today, it is probably easier to read data off punch cards or paper tape than it would be to try to reconstruct what the bytes mean if the encoding and character set are not documented. Usually, however, the documentation consists only of a font.

Dependence on a particular font creates lots of hassles when exchanging data: you always have to send someone a font whenever you send them a document, or make sure they already have it. One context in which this is a particular problem is the Web. People often work around the problem by using Adobe Acrobat (PDF) format, but for some situations, including the web, this is a definite limitation. If data is being sent to a publisher, who will need to edit and typeset the document, Acrobat is not a good option.7 Custom codepages are especially a problem for a publisher who receives content from multiple sources since they are forced to juggle and maintain a number of proprietary fonts. Furthermore, if they are forced to use these fonts, they are hindered in their ability to control the design aspects of the publication.

These difficulties are compounded since custom codepages have a tendency to proliferate as different users working on the same language create incompatible codepages for the same writing system. For example, there are a number of different custom codepages that have been developed for Biblical Hebrew. Consider the plight of an editor of a journal on Biblical studies who receives manuscripts using a dozen different codepages.

The right way to avoid all of these problems is to follow a standard encoding that includes these characters. This is precisely the type of solution that is made possible by Unicode, which is being developed to have a universal character set that covers all of the scripts in the world.

5 Character set encoding model

Before leaving this discussion of character encoding basics and legacy encodings, I would like to elaborate on the complete model needed to describe character sets and encodings. Earlier I said that a character set encoding involves at least two components, a set of characters, and a system for their encoded representation in the computer. A more complete model is actually necessary, involving four different levels of representation: the abstract character repertoire, the coded character set, the character encoding form, and the character encoding scheme.8 What will be new to most is the distinction between a coded character set and a character encoding form. This distinction is not important for many character encodings that people commonly use, including custom codepages, but it is important for some industry standards and for understanding Unicode in particular.

Note that the terms “codepage” and “codepoint” are commonly used with different meanings. In this section, I will provide meanings for these as they are defined and used in The Unicode Standard. The use of these terms will be revisited in Section 7.

5.1 Abstract character repertoire (ACR)

An abstract character repertoire (ACR) is simply an unordered collection of characters to be encoded. In a given standard, the repertoire may be closed, meaning that it is fixed and cannot be added to, or it may be open, meaning that new characters can be added to it over time (though not necessarily indefinitely: an ACR might be open yet also have some imposed limits regarding the total number of characters that might eventually be added). For example, Unicode has an open repertoire, which is regularly added to in order to make the standard more universal.

The reference to characters being abstract reflects a few things. First, they are not things that exist directly in a computer system. In fact, they are not even concrete objects in the real world. Rather, they are notional objects, like the notion of the letter “f” that corresponds to the last letter in the last word of this sentence, rather than the letter itself. Furthermore, they do not necessarily correspond to graphemes from a given writing system. For example, the tilde in the Spanish letter “ñ” might be treated as a distinct character such that the Spanish grapheme is represented in terms of a sequence of characters, < n, ~ >. Finally, it should also be noted that a character need not necessarily be a graphic object. For instance, an abstract character repertoire might include a zero-width space, which has no graphic representation but rather is a control character that controls the behaviour of runs of text in processes such as line wrapping.

5.2 Coded character set (CCS)

The second level in the model is the coded character set (CCS). A CCS is merely a mapping from some repertoire to a set of unique numeric designators. Typically, these are integers, though in some standards they are ordered pairs of integers.

The numeric designator is known as a codepoint, and the combination of an abstract character and its codepoint is known as an encoded character. It is important to note that these codepoints are not tied to any representation in a computer. The codepoints are not bytes; they are simply integers (or pairs of integers). The range of possible codepoints is typically defined in a standard as being limited. The valid range of codepoints in an encoding standard is referred to as the codespace. The collection of encoded characters in a standard is referred to as a codepage.9

Any properties that a standard defines for characters will be specified in relation to encoded characters in the CCS. Typically, a standard will specify unique names for encoded characters in addition to the unique codepoints, such as “LATIN SMALL LETTER A”. In practice, it makes sense for the character names and any other properties of characters to be considered part of the definition of the encoded characters in the CCS.

Before going on to the third level, it is important to note that some industry standards operate at the CCS level. They standardise a character inventory and perhaps a set of names and properties, but they do not standardise the encoded representation of these characters in the computer. This is the case, for example, with several standards used in the Far East, such as GB2312-80 (for Simplified Chinese), CNS 11643 (for Traditional Chinese), JIS X 0208 (for Japanese) and KS X 1001 (for Korean). These standards depend upon separate standards for encoded representation, which is where the next level in our model fits in.

5.3 Character encoding form (CEF)

The third level is the character encoding form (CEF). It is in this level that we begin to take into consideration actual representation in a computer. A CEF is a mapping from the codepoints in a CCS to sequences of values of a fixed data type. These values are known as code units. In principle, the code units can be of any size: they might be seven-bit values, 8-bit values, 19-bit values, or whatever. The most commonly encountered sizes for code units are eight-, sixteen- and thirty-two-bits, though other sizes are also used.

In some contexts, a CEF applied to a particular coded character set is referred to as a codepage. (For further discussion, see Section 7.)

The mapping between codepoints in the CCS and code units in the CEF does not have to be one-to-one. In many cases, an encoding form may map one codepoint to a sequence of multiple code units. This occurs in so-called “double-byte” encodings, like Microsoft codepages 932 or 950. An encoding form also does not have to map characters to code unit sequences of a consistent length. We saw an example of this earlier in the case of codepage 950: the Roman character “s” is encoded as a single byte 0x73, but the Han character is encoded as a sequence of two bytes, 0xA4 0x73. Moreover, an encoding is not limited to a maximum of two code units. For example, in the EUC-TW standard,10 characters are encoded using sequences of one to four bytes in length. One thing is required of a CEF, however: the mapping for any given codepoint must be a unique code unit sequence.

An encoding form may be stateless, meaning that any of the characters it encodes, or any of the sequences it uses, are available at any time. Some are modal, however, involving alternate states so that in a given state only a portion of the characters can be encoded unless the state is changed. The way this works is that the encoding form will specify a particular code unit sequence to indicate a change of state. Once that sequence is encountered in data, all subsequent data is interpreted in accordance with that particular state until another special code unit sequence is encountered that results in a change to a different state.

For example, the HZ encoding11 works this way. In the default state, bytes are interpreted as ASCII. Certain shift sequences are defined to shift to a different set of characters. The sequence 0x7E 0x7B (in ASCII, “~{”) causes subsequent data up to the next shift sequence or the beginning of the next line to be interpreted in terms of the GB2312-80 character set (for Simplified Chinese). The sequence 0x7E 0x7D (“~}”) shifts back to ASCII.

Some coded character sets are generally used with only one encoding form. For all common legacy character sets other than those used for the Far East, the codespace fits within a single-byte range, and so the encoded representation can easily be made identical in value to the codepoint.12 Many would not have any incentive to look for another encoding form. Among East Asian standards, the Big Five character set (for Traditional Chinese) is generally encoded using the Big Five encoding.

Similarly, some encoding forms are used only with certain character sets, as is the case with Big Five encoding, or with the UTF-8 encoding form of Unicode.

On the other hand, some character sets are often encoded in various encoding forms. For example, the GB 2312-80 character set can be encoded using the GBK encoding, using ISO 2022 encoding, or using EUC encoding. Also, some encoding forms have been applied to multiple character sets. For example, there are variants of EUC encoding that correspond to the GB 2312-80 character set, CNS 11643-1992, JIS X 0208, and several other character sets.

An encoding form may even be used for multiple character sets at once. The ISO 2022 standard was created with this specifically in mind. ISO maintains a registry of character sets that can be used with ISO 2022 encoding, and assigns to each a particular escape sequence. By default, ISO 2022 data is interpreted in terms of the ASCII character set. When any of the various escape sequences are encountered, subsequent data is interpreted in terms of the corresponding character set until a new escape sequence is encountered or the default state is restored.

The ISO 2022 standard was introduced to provide a single encoding that could potentially support all character sets.13 It uses a number of mechanisms to control the way data is interpreted, and as a result is fairly complicated. Fortunately, there is a better way: Unicode.

Unicode has a single coded character set that can be fully represented in terms of any of three encoding forms. The three encoding forms correspond to three data type sizes: UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units. UTF-32 represents each codepoint in terms of a single code unit, but UTF-8 and UTF-16 use sequences of one or more code units. The Unicode character set and encoding forms are described in detail in Whistler and Davis (2000) and in Understanding Unicode.

5.4 Character encoding scheme (CES)

The last level in the model is the character encoding scheme (CES). As computer technology has evolved, the 8-bit byte has come to have a particular significance in many systems. In particular, at their lowest levels, many systems for data storage and transmission operate in terms of 8-bit bytes. When 16-bit or 32-bit data units are brought into an 8-bit byte context, the data units can easily be split into 8-bit chunks since their size is an integer multiple of 8 bits. There are two logical possibilities for the order in which those chunks can be sequenced, however: little-endian, meaning the low-order byte comes first; and big-endian, meaning the high-order byte comes first. As a result, for 16-bit and 32-bit encoding forms, there is ambiguity regarding how the data is stored or transmitted.14 A character encoding scheme simply resolves this ambiguity by specifying which byte sequencing order is used for the given encoding form.

Apart from Unicode, 16- and 32-bit encoding forms are not that common, though others do exist. As a result, the issue of byte sequence order does not typically come up except in the case of Unicode. Unicode supports both byte orders for the UTF-16 and UTF-32 encoding forms. For further details, see Whistler and Davis (2000) or Constable (2001).

5.5 Character encoding model (summary)

Let me quickly review the four levels of the character encoding model:

  • At the deepest level, we have an abstract character repertoire, which is an unordered set of things we want to represent.
  • At the next level, we have a coded character set, in which we define a set of abstract characters with which to represent everything in the repertoire. We assign each abstract character to a numerical codepoint, resulting in an encoded character. We also specify a codespace, which is the range of possible numerical values that can be used as codepoints.
  • A character encoding form assigns each codepoint in the coded character set to a sequence of code units in a data type of some specific size, usually 8-, 16- or 32-bit integers.
  • In the case of 16- or 32-bit encoding forms, a character encoding scheme resolves the ambiguity in byte order by specifying whether big-endian or little-endian order is used.

6 Know your standards

We have seen that there are a number of industry standards for character sets and for encodings. We have seen the importance of character semantics in determining how software processes data, and how this can create issues for custom-encoded data. We have also gotten a sneak preview at Unicode and seen that it can solve many of the problems we have faced using legacy encodings. In whatever situation we are working, we are likely to be affected by several character set or encoding standards—or non-standards. The relationships that these have to one another can create challenges for us.

It is important, therefore, that we understand the character set and encoding standards that affect us. You need to understand the standards that are assumed by the software systems that you use. This includes operating systems and applications, but it may also include less obvious things like email systems and remote Web servers. You also need to understand the standards that are commonly used in the region in which you work, or that are used by people with whom you exchange data. In particular, you need to understand how those standards relate to or interact with the character set encodings you rely on, especially if you depend upon custom-defined codepages.

For most of us, this means at least understanding the Windows codepages. If you work outside of the Americas or Western Europe, make sure you understand the codepage that applies to your region (if any) in addition to codepage 1252. A good source of reference information on Windows codepages is available at  http://www.microsoft.com/globaldev/. If you work with a Mac or exchange data with people that do, you need to understand the relevant Mac codepages in addition to the relevant Windows codepages.

If you work with higher level data protocols—email protocols, HTML or XML, RTF or the like—then you need to be familiar with the way character set encoding standards are defined or identified in those contexts. On the Internet, many systems including email systems define protocols that relate to character set encodings in RFC documents.15 You may need to read the relevant RFCS to understand what encodings are used in a given context.

In HTML 4.0 and later and in XML, the Unicode character set is always used,16 but different encoding forms can be used. In HTML and HTTP, however, the encoding forms are referred to as charsets. The identifiers are basically the same, however.

The list of valid codepages (charsets) and their identifiers that can be used in Internet and World Wide Web protocols is maintained by the Internet Assigned Numbers Authority (IANA), and can be found at  http://www.iana.org/assignments/character-sets.

Microsoft Internet Explorer understands a fairly large number of different codepages used in HTML and HTTP. Some useful information is provided at  http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset4.asp. Apple provides some similar information at  http://developer.apple.com/techpubs/macos8/TextIntlSvcs/TextEncodingConversionManager/TEC1.5/index.html.

In the Win32 programming interfaces and in RTF as used with Microsoft Word, codepages can be referred to in terms of font variants, in which case they are expressed as charsets. In this case, however, charsets are given numerical identifiers. For example, codepage 1252 corresponds to charset 2, and the “symbol” encoding corresponds to charset 0. Charset values are documented in the RTF specification, though incompletely (it does not tell you how these relate to codepages, though you can probably work that out). See  http://msdn.microsoft.com/library/specs/rtfspec_6.htm#rtfspec_10 (search for the second instance of “fcharset”). For RTF, you should also be familiar with the “ansi” and related keywords. See  http://msdn.microsoft.com/library/specs/rtfspec_6.htm#rtfspec_8.

Information technologies are quickly evolving to embrace Unicode. This is good news, since the details related to legacy character set encodings are difficult to understand, let alone to work with. No matter where you work or what kind of systems you work with, Unicode will begin to have an impact on you soon; most likely, it already has impacted you. Anyone implementing or supporting text processing systems definitely needs to become familiar with Unicode. To that end, I have written a detailed introduction to Unicode in Constable 2001 to help people start to learn the things they need to know.

7 Postscript on terminology

There are two terms that have been used throughout this section that are often used in different senses: codepoint and codepage. In Section 5.3, these were defined as relating strictly to the coded character set level in the character model.

In common, informal usage, however, these terms are used to refer to the character encoding form, or to a combination of the CCS and the CEF levels. As a result, these terms are ambiguous. Strictly speaking, the term “codepage” is best used to mean just a particular CCS, but people often understand it to mean the combination of a CCS and a CEF. For instance, this is how it is used in relation to Microsoft Windows. As a case in point, Microsoft “codepage 932” corresponds to the JIS X 0208:1997 character set with the Shift-JIS encoding. This is also how I was using the term earlier in this section.

Likewise, a codepoint is strictly the numeric designator for a character used in the coded character set. However, it is often used informally to mean an abstract character together with the code unit sequence used to represent it, or to mean just a code unit sequence.

The informal meanings probably stem from two causes. First of all, the people using these terms in this way are involved in user support or are implementing support for particular writing systems, and they need to relate to things that are practical. Thus, they are inclined to think in terms of things that are tangible rather than abstract. For them, the two things that are most tangible about a character are the shape that appears on the screen and the sequence of bytes that get stored in a file to represent it. These correspond closely to the identity of the character, which relates to the CCS, and to the representation of the character, which relates to the CEF. Secondly, most users are familiar with non-East Asian legacy text systems in which characters are encoded in terms of code units that are indistinguishable from codepoints. In these situations, there is no practical distinction between the CCS and the CEF, and so people are not confronted with the need for a distinction.

In working with East Asian standards and also with Unicode, however, the distinction between the CCS and the CEF becomes very important. For these standards, therefore, it makes sense to have terms that are unambiguously associated with one level or the other. The appropriate terms for use in relation to a coded character set are codepage and codepoint. When speaking about a character encoding form, the preferred term is code unit. If you are expecting to start working with or learning about The Unicode Standard, you will probably find it convenient to learn the preferred meanings for these terms and to begin using them consistently.

8 References

Dürst, Martin; François Yergeau; Misha Wolf; Asmus Freytag; and Tex Texin; eds. 2001. Character model for the World Wide Web 1.0. W3C working draft 26 January 2001. Cambridge, MA: The World Wide Web Consortium. Published online at  http://www.w3.org/TR/charmod/.

Lunde, Ken. 1999. CJKV information processing. Sebastopol, CA: O’Reilly & Associates.

Whistler, Ken, and Mark Davis. 2000. Unicode technical report #17: character encoding model. Revision 3.2. Cupertino, CA: The Unicode Consortium. Published online at  http://www.unicode.org/unicode/reports/tr17/.



Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.

"Xiaowei Xu", Tue, Jan 20, 2009 07:37 (EST)
Can I translate it and post it on my personal site?

Hi there,

My name is Xiaowei Xu, a software engineer from China. During my previous years of career, I always felt that understanding the character encoding issues a challenge. Especially when I was on some projects that requires international or multi-national language support.

Thanks to this article which I've just finished reading just now: https://scripts.sil.org/IWS-Chapter03. I have to say that it explains everything so clearly and it turns out so useful to me on learning this topic.

I wonder if I can get permission from you to translate it into my native language which is Chinese of course and post it on my personal website so I can share it with other people who, like me, are still struggling on the various concepts which used to be interpreted ambiguously.

Thank you very much.

martinpk, Tue, Jan 20, 2009 10:26 (EST)
Re: Can I translate it and post it on my personal site?

Glad to hear that you found the article helpful!

Could you e-mail with this request, so that the person who deals with such enquiries will be able to give a reply direct to your e-mail address?

Thanks!

Peter

"Samoria L.", Wed, Oct 12, 2011 19:57 (EDT)
Character Encoding issue

Yes you can post it to help others.

"thulunada balipe", Sun, Jan 11, 2015 03:09 (EST)
Regarding script editor

i want to develop editor software for my native language. there is a asciii code request proposal and not yet confirmed by unicode stardard organisation.

My questions are:

1 ) Can I develop editor by using my own unicode?

2) How to develop editor using c# winform.

3) Any useful links to help me to design?

Thanks.

thulunada balipe

"Lorna", Tue, Jan 27, 2015 09:08 (EST) [modified by martinpk on Tue, Jan 27, 2015 21:40 (EST)]
Re: Regarding script editor

Unicode does allow you to use the Private Use Area (PUA) for encoding until the script is officially in Unicode. We talk about the PUA here: SIL’s Private Use Area (PUA). The main problem is that it is &quot;private&quot; and you cannot consider that anyone else will use the same encoding as you do. So, you can develop your own editor, font and keyboard that uses the PUA, but you should try to convert everything to Unicode encoding once the script is officially in Unicode.

Depending on what you need, SIL has already developed a rendering engine called  Graphite that will render PUA characters. LibreOffice supports Graphite fonts, so if you develop your font using Graphite you'll be able to use it in LibreOffice without having to develop a specialized editor.



1 Of course, as users, we aren’t even aware of the numbers. We see the numbers interpreted by the computer as glyphs. But the user generally isn’t even aware of the individual glyphs. They think in terms of the inferred writing system.
2 The name “Latin 1” should properly only be applied to the ISO 8859-1 character set standard. The relationship between ISO 8859-1 and Microsoft codepage 1252 is explained below in the text.
3 Information on standards for Far East languages was drawn from Lunde (1999). This is the best single source of information currently available on standards for Chinese, Japanese, Korean and Vietnamese.
4 Exact upper and lower limits are difficult to ascertain because definitions for things like byte and their relationship to code units for representing text start to break down. For example, teletype machines used 5-bit Baudot code, but it is unclear to me if the transmission protocol and other aspects of the system architecture manipulated data in units that were 5-bits in length. The limits are also difficult to ascertain because there are unclear boundaries between things that are relevant in our discussion of text and things that are not. For example, the IBM 360-50 reportedly uses 90-bit words in microcode, but that really is not a valid comparison when talking about text.
5 I can explain at this point that by legacy I mean every standard that preceded Unicode.
6 This table represents a working codepage in use in Côte d’Ivoire.
7 This situation is not to be confused with that of a publisher sending an edited and typeset publication to a service bureau. Acrobat was primarily designed for that specific purpose.
8 This model is described in Whistler and Davis (2000) and also in Dürst et al (2001). The model described in UTR#17 includes a fifth level, known as the transfer encoding syntax. This accounts for the use of special encoding formats used in transmission of text data, such as compression schemes. Transfer encoding syntax is not particularly relevant for our purposes here, and so is not discussed further.
9 It should be noted that the term codepage also gets used in some contexts to refer to the next level in the model, the encoding form. This is discussed further in Section 7.
10 EUC-TW refers to the EUC encoding used with the CNS 11643-1992 character set. EUC stands for “Extended Unix Code”. In spite of the name, this encoding is used on other platforms as well.
11 The HZ encoding is used for the GB2312-80 character set and is commonly used for Chinese email and in certain other contexts. It is defined in the Internet recommendation RFC 1843.
12 It is for this reason that the concept of distinguishing codepoints in a coded character set from their encoded representation in an encoding form is new to many people.
13 Even Unicode can be encoded in terms of ISO 2022, though I don’t know why anyone would particularly want to do that.
14 Byte order is not an issue for internal memory representation of data since this is always determined by the specific machine architecture. For example, Intel processors use little-endian representation, while Motorola processors use big-endian representation. Of course, if you are writing software for an Intel processor and need to be able to read data files that are encoded using a big-endian encoding scheme, you will need to add a few lines of code to reorder the bytes—unless you are using a library that handles this for you—before you begin processing the character data. This doesn’t happen automatically.
15 In certain computer industry contexts, particularly in relation to the Internet, a technical proposal is often discussed by publication a documents known as an RFC, meaning “request for comments”. These are generally maintained by a oversight body, such as the Internet Engineering Task Force, and are given numerical identifiers. For example, RFC 2822 ( http://www.ietf.org/rfc/rfc2822.txt) is the current specification for the Internet email message protocol.
16 Development of the Unicode standard began by combining all of the industry standard character sets that existed prior to 1991. As a result, all of the legacy industry standard character sets we have been looking at are by definition subsets of the Unicode character set. As a result, it’s possible to think of the Mac Roman codepage, for example, as an encoding form used to represent a subset of the Unicode character set.

© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.