NRSI: Computers & Writing Systems
Implementing Writing Systems: An introduction
Understanding Unicode™ - II
A general introduction to the Unicode Standard (Sections 6-15)
This article begins at: Understanding Unicode™: A general introduction to the Unicode Standard (Sections 1-5).
Implementing Writing Systems
6 Backwards compatibility: the soft underbelly of Unicode
6 Backwards compatibility: the soft underbelly of Unicode
Thus far, we have looked at idealistic principles. It should not be surprising that reality involves other practical considerations that might compromise those ideals. In fact, most or all of the original design ideals for Unicode have been compromised to some extent.
The main practical consideration had to do with backward compatibility. In order for Unicode to succeed, it had to be possible for developers to begin implementing the Standard within the context of a large amount of existing software and data that was based on legacy encoding standards. Thus, one of the essential requirements for Unicode was the ability to convert data reliably from legacy encodings to Unicode and vice versa.
For this purpose, the Unicode character set was initially based on source legacy character sets that included industry standard character sets in wide usage as of or prior to May 1993. Support for these standards was defined in terms of the round-trip rule: it must be possible to convert plain text data from a source encoding into Unicode and then back again and end up with exactly the same data you started with.
In practice, what this has meant is that Unicode must retain any character distinction that was made in a source standard. For example, if a source standard contained a character for the Latin letter “I” and also contained a distinct character with the same shape but used as the Roman numeral for one, then Unicode also needed to contain two distinct characters for the two different uses.
The net effect of maintaining round-trip compatibility is duplication in the Unicode character set. All the instances of duplication are not necessarily comparable to one another, however, and some cases are less obviously instances of duplication than others. These issues are behind all of the things that are most difficult to understand about Unicode—things we have yet to discuss. For someone implementing Unicode, either as a software developer or as someone determining how data should be encoded, these issues are important to understand.
Gaining a complete understanding of all of the characters in Unicode that were added for reasons of backward compatibility is more than can be covered in an introduction. We can get a start, though, by looking at the different ways in which the principles described in Section 5 have been compromised.1
6.1 Compromises in the principle of unification
The first type of duplication we will consider involves a simple violation of unification: Unicode has pairs of characters that are effectively one-for-one duplicates. For instance, U+212A KELVIN SIGN “K” was encoded as a distinct character from U+004B LATIN CAPITAL LETTER K because these were distinguished in some source standard; otherwise, there is no need to distinguish these in shape or in terms of how they are processed. The same situation applies to other pairs, such as U+212B ANGSTROM SIGN versus U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE, and U+037E GREEK QUESTION MARK versus U+003B SEMICOLON.
In the source legacy standards, the two characters in a pair would have been assumed to have distinct functions. In these cases, however, the distinctions can be considered artificial. For example, in the case of the Kelvin sign and the letter K, there is no distinction in shape or in how they are to be handled by text processes. This is purely a matter of interpretation at a higher level. Distinguishing these would be like distinguishing the character “I” when used as an English pronoun from “I” used as an Italian definite article. This is not the type of distinction that needs to be reflected in a character encoding standard.
In TUS 3.1, there are over 800 exact duplicates for Han ideographs. These are located in the CJK Compatibility Ideographs (U+F900..U+FAFF) and the CJK Compatibility Ideographs Supplement (U+2F800..U+2FA1F) blocks. Note, however, that not every character in the former block is a duplicate. As the characters were being collected from various sources, these were prepared as a group and kept together. There are, in fact, twelve characters in this block that are unique and are not duplicates of any other character, such as U+FA0E CJK COMPATIBILITY IDEOGRAPH-FA0E. Thus, you need to pay close attention to details in order to know what the status of any character is. In particular, you cannot assume that a character is a duplicate just because it is in the compatibility area of the BMP (U+F900..U+FFFF) or is from a block that has “compatibility” in the name.
Beyond the duplicate Han characters, there are another 33 of these singleton (one-to-one) duplicates in the lower end of the BMP, most of them in the Greek (U+0370..U+03FF) or Greek Extended (U+1F00..U+1FFF) blocks.
6.2 Compromises in the principle of dynamic composition
The second type of duplication we will look at involves cases in which a character duplicates a sequence of characters. As described in Section 5.6, a text element may be represented as a combining character sequence. For instance, Latin a-circumflex with dot-below “ ” can be represented as a sequence <U+0061 LATIN SMALL LETTER A, U+0323 COMBINING DOT BELOW, U+0302 COMBINING CIRCUMFLEX ACCENT>. In such cases, it is assumed that an appropriate rendering technology will be used that can do the glyph processing needed to correctly position the combining marks. In many cases, however, legacy standards encoded precomposed base-plus-diacritic combinations as characters rather than composing these combinations dynamically. As a result, it was necessary to include precomposed characters such as U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW in Unicode.
Unicode has a very large number of precomposed characters, including 730 for Greek and Latin, and over 11,000 for Korean! Most of the Latin Extended Additional (U+1E00..U+1EFF) and the Greek Extended (U+1F00..U+1FFF) blocks and the entire Hangul Syllables block (U+AC00..U+D7AF) are filled with precomposed characters. There are also over 200 precomposed characters for Japanese, Cyrillic, Hebrew, Arabic and various Indic scripts.Not all of the precomposed characters are from scripts used in writing human languages. There are a few dozen precomposed mathematical operators; for example, U+2260 NOT EQUAL TO, which can also be represented as the sequence <U+003D EQUALS SIGN, usv:0338 usv, name
>. There are also 13 precomposed characters for Western musical symbols in Plane 1; for example U+1D15F MUSICAL SYMBOL QUARTER NOTE, which can also be represented as the sequence <U+1D158 MUSICAL SYMBOL NOTEHEAD BLACK, U+1D165 MUSICAL SYMBOL COMBINING STEM>.
Precomposed characters go against the principle of dynamic composition, but also against the principle that Unicode encodes abstract characters rather than glyphs. In principle, it should be possible for a combining character sequence to be rendered so that the glyphs for the combining marks are correctly positioned in relation to the glyph for the base, or even so that the character sequence is translated into a single precomposed glyph. In these cases, though, that glyph is directly encoded as a distinct character.
There are other cases in which the distinction between characters and glyphs is compromised. Those cases have some significant differences from the ones we have considered thus far. Before continuing, though, there are some important additional points to be covered in relation to the characters described in Sections 6.1 and 6.2. We will return to look at the remaining areas of compromise in Section 6.4.
6.3 Canonical equivalence and the principle of equivalent sequences
In Unicode 3.1, there are over 13,000 instances of the situations described in Sections 6.1 and 6.2. In each of these cases, Unicode provides alternate representations for a given text element. For singleton duplicates, what this means is that there are two codepoints that are effectively equivalent and mean the same thing:
Figure 14. Characters with exact duplicates
Likewise in the case of each precomposed character, there is a dynamically composed sequence that is equivalent and means the same thing as the precomposed character:
Figure 15. Precomposed characters and equivalent dynamically composed sequences
This type of ambiguity is far from ideal, but was a necessary price of maintaining backward compatibility with source standards and, specifically, the round-trip rule. In view of this, one of the original design principles of Unicode was to allow for a text element to be represented by two or more different but equivalent character sequences.
In these situations, Unicode formally defines a relationship of canonical equivalence between the two representations. Essentially, this means that the two representations should generally be treated as if they were identical, though this is slightly overstated. Let me explain in more detail.
In precise terms, the Unicode Standard defines a conformance requirement in relation to canonically equivalent pairs that must be observed by software that claims to conform to the Standard:2
Conformance requirement 9. A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct.
In other words, software can treat the two representations as though they were identical, but it is also allowed to distinguish between them; it just cannot always treat them as distinct.
Since the different sequences are supposed to be equal representations of exactly the same thing, it might seem that this requirement is stated somewhat weakly, and that it ought to be appropriate to make a much stronger requirement: software must always treat canonical-equivalent sequences identically. Most of the time, it does make sense for software to do that, but there may be certain situations in which it is valid to distinguish them. For example, you may want to inspect a set of data to determine if any precomposed characters are used. You could not do that if precomposed characters are never distinguished from the corresponding dynamically composed sequence.
At this point, I can introduce some convenient terminology that is conventionally used. Whereas a character like U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW is referred to as a precomposed character, the corresponding combining character sequence <U+0061 LATIN SMALL LETTER A, U+0323 COMBINING DOT BELOW, U+0302 COMBINING CIRCUMFLEX ACCENT> is often referred to as a decomposed representation or a decomposed character sequence. The formal relationship between a precomposed character and the equivalent decomposed sequence is formally known as canonical decomposition. These mappings are specified as part of the semantic character properties contained in the online data files that were mentioned in Section 2.2.3 is said to be the canonical decomposition of U+212B ANGSTROM SIGN.
In some cases a text element can be represented as either fully composed or fully decomposed sequences. In many cases, however, there will be more than just these two representations. In particular, this can occur if there is a precomposed character that corresponds to a partial representation of another text element. For example, a text element a-ring-acute “ ” can be represented in Unicode using fully composed or fully decomposed representation, but also using a partially decomposed representation:
Figure 16. Equivalent precomposed, decomposed and partially composed representations
Thus, in this example, there are four possible representations that are all canonically equivalent to one another. We will look at the possibility of having multiple equivalent representations further in Section 9.
There is more to be explained regarding the relationship between canonical-equivalent sequences, and we will be looking at further details in Sections 7, 9, 10 and 11. Before returning to discuss ways in which Unicode design principles have been compromised, there are some specific points worth mentioning regarding certain characters for vowels in Indic scripts.
Indic scripts have combining vowel marks that can be written above, below, to the left or to the right of the syllable-initial consonant. In many Indic scripts, certain vowel sounds are written using a combination of these marks, as illustrated in Figure 17:
Figure 17. Tamil vowel written before and after consonant
Such combinations are especially typical in representing the sounds “o” and “au” using combinations of vowel marks corresponding to “e”, “i” and “a”.
What is worth noting is that these vowels are not handled in the same way in Unicode for all Indic scripts. In a number of cases, Unicode includes characters for the precomposed vowel combination in addition to the individual vowel signs. This happens for Bengali, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Tibetan and Myanmar scripts. Thus, for example, in Tamil one can use U+0BCA TAMIL VOWEL SIGN O or the sequence <U+0BC6 TAMIL VOWEL SIGN E, U+0BBE TAMIL VOWEL SIGN AA>. On the other hand, in Thai and Lao, the corresponding vowel sounds can only be represented using the individual component vowel signs. Thus in Thai, the syllable “kau” can only be represented as <U+0E40 THAI CHARACTER SARA E, U+0E01 THAI CHARACTER KO KAI, U+0E32 THAI CHARACTER SARA AA>. Then again, in Khmer, only precomposed characters can be used.4 Thus in Khmer, the corresponding vowel is represented using U+17C5 KHMER VOWEL SIGN AU; Unicode does not include characters for both of the component marks, and so there is no alternate representation. If you need to encode data, therefore, for a language that uses an Indic script, pay close attention to how that particular script is supported in Unicode.
6.4 Compromises in the distinction between characters and glyphs
Returning to our discussion of compromises in the design principles, the next cases we will look at blur the distinction between characters and glyphs. As explained earlier, Unicode assumes that software will support rendering technologies that are capable of the glyph processing needed to handle selection of contextual forms and ligatures. Such capability was not always available in the past, however. The result is that Unicode includes a number of cases of presentation forms—glyphs that are directly encoded as distinct characters but that are really only rendering variants of other characters or combinations of characters. For example, U+FB01 LATIN SMALL LIGATURE FI encodes a glyph that could otherwise be represented as a sequence of the corresponding Latin characters.
There are five blocks in the compatibility area that primarily encode presentation forms. Two of these are the Arabic Presentation Forms-A block (U+FB50..U+FDFF) and the Arabic Presentation Forms-B block (U+FE70..U+FEFF). These blocks mostly contain characters that correspond to glyphs for Arabic ligatures and contextual shapes that would be found in different connective relationships with other glyphs (initial, medial, final and isolate forms). So, for example, U+FEEC ARABIC LETTER HEH MEDIAL FORM encodes the glyph that would be used to present the Arabic letter heh (nominally, the character U+0647 ARABIC LETTER HEH) when it is connected on either side. Unicode includes 730 such characters for Arabic presentation forms.
Characters like this constitute duplicates of other characters. There is a difference between this situation and that described in Sections 6.1 and 6.2, however. In the case of the singleton duplicates, the two characters were, for all intents and purposes, identical: in principle, they could be exchanged in any situation without any impact on text processing or on users. Likewise for the precomposed characters and the corresponding (fully or partially) decomposed sequences. In the case of the Arabic presentation forms, though, they are equivalent only in certain rendering contexts. For example, U+FEEC could not be considered equivalent to a word-initial occurrence of U+0647. For situations like this, Unicode defines a lesser type of equivalence known as compatibility equivalence: two characters are in some sense duplicates but with some limitations and not necessarily in all contexts. Formally, the compatibility equivalence relationship between two characters is shown in a compatibility decomposition mapping that is part of the Unicode character properties.5 I will discuss the relationship between compatibility equivalence and canonical equivalence further in Section 11.
In general, use of the characters in the Arabic Presentation Forms blocks is best avoided whenever possible. Note, though, that these blocks contain three characters that are unique characters and are not duplicates of any others: U+FD3E ORNATE LEFT PARENTHESIS, U+FD3F ORNATE RIGHT PARENTHESIS and U+FEFF ZERO WIDTH NO-BREAK SPACE. As we saw earlier, you need to check the details on characters in the Standard before you make assumptions about them.
The other three blocks that contain primarily presentation forms are the Alphabetic Presentation Forms block (U+FB00..U+FB4F), the CJK Compatibility Forms block (U+FE30..U+FE4F), and the Combining Half Marks block (U+FE20..U+FE2F). The first of these contains various Latin, Armenian and Hebrew ligatures, such as the “6 The CJK Compatibility Forms block contains rotated variants of various punctuation characters for use in vertical text; for example, U+FE35 PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS. Finally, the Combining Half Marks block contains presentation forms representing the left and right halves of glyphs for diacritics that span multiple base characters; for example, U+FE22 COMBINING DOUBLE TILDE LEFT HALF and U+FE23 COMBINING DOUBLE TILDE RIGHT HALF. These are not variants of other characters or sequences of characters. Rather, pairs of these are variants of other characters. Unicode does not formally define any type of equivalence that covers such situations.” ligature mentioned above. This block also has wide variants of certain Hebrew letters, which were used in some legacy systems for justification of text, and two Hebrew characters that are no more than font variants of existing characters.
Outside the compatibility area at the top of the Basic Multilingual Plane, there are no characters for contextual presentation forms (like the Arabic positional variants), and there are but a handful of ligatures, such as U+0587 ARMENIAN SMALL LIGATURE ECH YIWN, and U+0EDC LAO HO NO.
6.5 Compromises in the distinction between characters and graphemes
The next category of compromise has to do with the distinction between characters and graphemes. In particular, Unicode includes 18 characters for Latin digraphs. For example, U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z and U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING. Most of these are to be found in the Latin Extended-B block (U+0180..U+024F). Others are in the Latin Extended-A block (U+0100..U+017F), and there is one (U+1E9A) in the Latin Extended Additional block (U+1E00..U+1EFF).7
There are also what appear to be four Arabic digraphs at U+0675..U+0678 that make use of a spacing hamza and are reportedly used for Kazakh.
All of these digraphs are duplicates for sequences involving the corresponding pairs of characters. In each case, Unicode designates the digraph character to be a compatibility equivalent to the corresponding sequence.
There are over 16,000 characters defined in Unicode that in one way or another go against the basic design principles of the Standard. The categories described in the preceding discussion covers the bulk of them.
There are a number of other characters with compatibility decompositions that are not as easy to characterise. They do not compromise the design principles in any way not yet covered; they just tend to do so in more complex ways, such as violating many principles at once! It is not essential that a beginner know much about them, but as you advance, you may begin to encounter some of them and might find it helpful to be more familiar with them as a group. For that reason, I have provided an overview of that collection of miscellany in “A review of characters with compatibility decompositions”.8
Of the characters we considered in the preceding discussion, the vast majority—over 13,000—were singleton identical duplicates of other characters, or were precomposed base-plus-diacritic or Hangul combinations. The relationship between these characters and the corresponding character sequences that they duplicate was easy to characterise: they are effectively the same. Unicode refers to the relationship between these duplicates as canonical equivalence.
The other 3,000 characters were somewhat harder to characterise. They were equivalent to the characters that they duplicate, but in a more limited way. Sometimes, this was because they could only be considered equivalent in certain contexts, as in the case of the Arabic contextual forms. Sometimes, it was the opposite: in certain contexts, they could be considered distinct from their nominal counterparts because they might be understood to carry extra information, as in the case of the superscripts. Sometimes, this was because they carried some additional non-textual information, such as layout or formatting, as in the case of the enclosed alphanumeric symbols. In these cases, Unicode refers to the relationship between these duplicates as compatibility equivalence.
The difference between canonical and compatibility equivalence is a significant one that plays out in important ways in the implementation of the Standard. The key to this is in the mechanisms that software uses to equate equivalent characters in practice, a category of processes known as normalization. We will discuss this in Section 10. Before we can do that, however, we need to examine character semantics, and also take a closer look at combining marks. These will be covered in Sections 7 and 9.
7 Character semantics and behaviours
As explained in “Character set encoding basics”, software creates the impression of understanding the behaviours of writing systems by attributing semantic character properties to encoded characters. These properties represent parameters that determine how various text processes treat characters. For example, the SPACE character needs to be handled differently by a line-breaking process than, say, the U+002C COMMA character. Thus, U+0020 SPACE and U+002C COMMA have different properties with respect to line-breaking.
One of the distinctive strengths of Unicode is that the Standard not only defines a set of characters, but also defines a number of semantic properties for those characters. Unicode is different from most other character set encoding standards in this regard. In particular, this is one of the key points of difference between Unicode and ISO/IEC 10646.
In addition to the semantic properties, Unicode also provides reference algorithms for certain complex processes for which the correct implementation may not be self evident. In this way, the Standard is not only defining semantics properties for characters, but is also guiding how semantics should be interpreted. This has an important benefit for users in that it leads to more consistent behaviour between software implementations. There is also a benefit for software developers who are suddenly faced with supporting a wide variety of languages and writing systems: they are provided with important information regarding how characters in unfamiliar scripts behave.
7.1 Where the character properties are listed and described
The character properties and behaviours are listed and explained in various places, including the printed book, some of the technical reports and annexes, and in online data files. An obvious starting point for information on character properties is Chapter 4 of TUS 3.0. That chapter describes or lists some of the properties directly, and otherwise indicates the place in which many other character properties are covered. Note that that chapter is valid in relation to Version 3.0, and that some changes with regard to character properties were made in TUS 3.1. Those changes are described in Article III of UAX #27. Even together, though, those two sources do not provide a reference that covers all character properties.
While some properties are listed in the book, the complete listing of character properties is given in the data files that comprise the Unicode Character Database (UCD). These are included on a CD-ROM with the printed book, but the current versions are to be found online at http://www.unicode.org/Public/UNIDATA/.9 A complete reference regarding the files that make up the UCD and the place in which each is described is provided in the document at http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html.10
The original file that contained the character properties is UnicodeData.txt. This is considered the main file in the UCD and is one that is perhaps most commonly referred to. It is just one of several, however. This file was first included as part of the Standard in Version 1.1.5. Like all of the data files, it is a machine readable text database. As new properties were defined, additional files were created. This was done rather than adding new fields to the original file in order to remain compatible with software implementations designed to read that file. There are now 27 data files that comprise the UCD for Version 3.1.
7.2 Normative versus informative properties and behaviours
Unicode distinguishes between normative properties and informative properties. Software that conforms to Unicode is required to observe all normative properties. The informative properties are provided as a guide, and it is recommended that software developers generally follow them, but implementations may override those properties while still conforming to the Standard. All of the properties are defined as part of the Standard, but only the normative properties are required to be followed.
One reason for this distinction is that some properties are provided for documentation purposes and have no particular relevance for implementations. For example, the Unicode 1.0 Name property is provided for the benefit of implementations that may have been based on that version.11 Another similar property is the 10646 comment property, which records notes from that standard consisting primarily of alternate character names and the names of languages that use that character. Another example is the informative notes and cross-references in NamesList.txt (discussed in Section 3.3), which provide supplementary information about characters that is helpful for readers in identifying and distinguishing between characters.
Another reason for the distinction is that it may not be considered necessary to require certain properties to be applied consistently across all implementations. This would apply, for instance, in the case of a property that is not considered important enough to invest the effort in giving careful thought to the definition and assignment of the property for all characters. For example, the kMainlandTelegraph property12 identifies the telegraph code used in the People’s Republic of China that corresponds to a given unified Han character. This property may be valuable in some contexts, but is not likely to be something that is felt to need normative treatment within Unicode.
For some properties, it may also be considered inappropriate to impose particular requirements on conformant implementations. This might be the case if it is felt that a given category of processes is not yet well enough understood to specify what the normative behaviour for those processes should be. For example, Line Breaking properties13 are defined in Unicode for each character, and they reflect what are considered to be best practices to the extent that line breaking is understood. The treatment of line breaking properties is thought to largely reflect best practices that are valid for many situations, but is not considered to have completely covered all aspects of line breaking behaviour. Current knowledge regarding line breaking is not complete enough to make a complete and normative specification of line breaking behaviour that becomes a conformance requirement on software. As a result, some line breaking properties are normative—for instance, a break is always allowed after a ZERO-WIDTH SPACE and is always mandatory after a CARRIAGE RETURN—but most line-breaking properties are informative and can be modified in a given implementation.
It is also inappropriate to impose normative requirements in the case of properties for which the status of some characters is controversial or simply unclear. For example, a set of case-related properties exist that specify the effective case of letter-like symbols. For instance, U+24B6 CIRCLED LATIN CAPITAL LETTER A is assigned the Other_Uppercase property. For some symbols, however, it is unclear what effective case they should be considered to have. The character U+2121 TELEPHONE SIGN, for instance, has been realised using a variety of glyphs. Some glyphs use small caps, as in “”, while others use all caps, as in “”. It is difficult to have any confidence in making a property such as Other_Uppercase normative when the status of some characters in relation to that property is unclear.
The normative and informative distinction applies to the specification of behaviours as well as to character properties. For example, UAX #9 (Davis 2001a) describes behaviour with regard to the layout of bi-directional text (the “bi-directional algorithm”) and that behaviour is a normative part of the Standard. Likewise, the properties in ArabicShaping.txt (described in Section 8.2 of TUS 3.0) that describe the cursive shaping behaviour of Arabic and Syriac characters are also normative. On the other hand, UAX #14 (Freytag 2000) describes an algorithm for processing the line breaking properties, but that algorithm is not normative.14 Similarly, Section 5.13 of TUS 3.0 discusses the handling of non-spacing combining marks in relation to processes such as keyboard input, but the guidelines it presents are informative only.
There is one other point that is important to note in relation to the distinction between normative and informative properties: the fact that a property is normative does not imply that it can never change. Likewise, the fact that a property is informative does not imply that it is open to change. For example, the Unicode 1.0 Names property is informative but is not subject to change. On the other hand, several changes were made in TUS 3.0 to the Bi-directional Category, Canonical Combining Class and other normative properties in order to correct errors and refine the specification of behaviours. As a rule, though, it is the case that changes to normative properties are avoided, and that some informative properties can be more readily changed.
It is also true that some normative properties are not subject to change. In particular, the Character name property is not permitted to change, even in cases in which, after the version of the Standard in which a character is introduced is published, the name is found to be inappropriate. The reason for this is that the sole purpose of the character name is to serve as a unique and fixed identifier.
7.3 A summary of significant properties
The properties that are probably most significant are those found in the main character database file, UnicodeData.txt. I will briefly describe the properties in this file here.
The format for the UnicodeData.txt file is described in the document at http://www.unicode.org/Public/UNIDATA/UnicodeData.html. UnicodeData.txt is a semicolon-delimited text database with one record (i.e. one line) per character. There are 15 fields in each record, each field corresponding to a different property. All of the properties in this file apart from the Unicode 1.0 Name and 10646 comment (described above) are normative.
The first field corresponds to the codepoint for a given character, the significance of which is obvious. The next field contains the character name, which provides a unique identifier in addition to the codepoint. There is an importance to the character name as an identifier over the codepoint in that, while the codepoint is applicable only to Unicode, the character name may be constant across a number of different character set standards, facilitating comparisons between standards.
The next field contains the General Category properties. This categorises all of the characters into a number of useful character types: letter, combining mark, number, punctuation, symbol, control, separator and other. Each of these is further divided into subcategories. Thus, letters are designated to be uppercase, lowercase or titlecase. Each of these subcategories is indicated in the data file using a two-letter abbreviation. The complete list of general category properties and their abbreviations is listed in Table 5:
Table 5. General category properties and their abbreviations
This set of properties forms a partition of the Unicode character set; that is, every character is assigned exactly one of these general category properties.
Space does not permit a detailed description of all of these properties. General information can be found in Section 4.5 of TUS 3.0. Some of these properties are not discussed in detail in the Standard using these explicit names, so information may be difficult to find. For some of the properties, it may be more likely to find information about individual characters than about the groups of characters as a whole. Many of these categories are significant in relation to certain behaviours, though. Several are discussed in Chapter 5 of TUS 3.0. Many of them are particularly relevant in relation to line breaking behaviour, described in UAX #14 (Freytag 2000).
The control, format and other special characters are discussed in Chapter 13 of TUS 3.0. Numbers are described in Chapter 12 and in most of the various sections covering different scripts in Chapters 7–11. Punctuation and spaces are discussed in Chapter 6 of TUS 3.0. Symbols are the topic of Chapter 12 of TUS 3.0. Line and paragraph separators are covered in UAX #13 (Davis 2001b).
It will be worth describing letters and case in a little more detail, and I will do so after finishing this general survey of character properties. I will also discuss combining marks in some detail in Section 9.
Returning to our discussion of the fields in the UnicodeData.txt database file, the fourth, fifth and sixth fields contain particularly important properties: the Canonical Combining Classes, Bi-directional Category and Decomposition Mapping properties. Together with the general category properties, these three properties are the most important character properties defined in Unicode. Accordingly, each of these will be given additional discussion. The canonical combining classes are relevant only for combining marks (characters with general category properties of Mn, Mc and Me), and will be described in more detail in Section 9. The bi-directional categories are used in relation to the bi-directional algorithm, which is specified in UAX #9 (Davis 2001a). I will provide a brief outline of this in Section 8.1. Finally, the character decomposition mappings specify canonical and compatibility equivalence relationships. I will discuss this further in Section 7.5.
Most of the next six fields contain properties that are of more limited significance. Fields seven to nine relate to the numeric value of numbers (characters with general category properties (Nd, Nl and No). These are covered in Section 4.6 of TUS 3.0. The tenth field contains the Mirrored property, which is important for right-to-left scripts, and is described in Section 4.7 of TUS 3.0 and also in UAX #9 (the bi-directional algorithm). I will say more about it in Section 8.1. Fields eleven and twelve contain the Unicode 1.0 Name and 10646 properties.
The last three fields contain case mapping properties: Uppercase Mapping, Lowercase Mapping and Titlecase mapping. These are considered further in the next section.
As mentioned earlier, UnicodeData.txt is a semicolon-delimited text database. Now that I have described each of the fields, let me provide some examples:
0028;LEFT PARENTHESIS;Ps;0;ON;;;;;Y;OPENING PARENTHESIS;;;;
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
0407;CYRILLIC CAPITAL LETTER YI;Lu;0;L;0406 0308;;;;N;;Ukrainian ;;0457;
Figure 18. Four entries from UnicodeData.txt
Note that not every field necessarily contains a value. For example, there is no uppercase mapping property for U+0028. Every entry in this file contains values, however, for the following fields: codepoint, character name, general category, canonical combining class, bi-directional category, and mirrored.
Looking at the first of these entries, we see that U+0028 has a general category of “Ps” (opening punctuation—see Table 5), a canonical combining class of “0”, a bi-directional category of “ON”, and a mirrored property of “Y”. It also has a Unicode 1.0 name of “OPENING PARENTHESIS”.
The entry for U+0031 shows a general category of “Nd” (decimal digit number), a combining class of “0”, a bi-directional category of “EN”, and a mirrored property of “N”. Since this character is a number, fields seven to nine (having to do with numeric values) contain values, each of them “1”.
The character U+0061 has a general category of “Ll” (lowercase letter), a combining class of “0”, a bi-directional category of “L”, and a mirrored property of “N”. It also has uppercase and titlecase mappings to U+0041.
Finally, looking at the last entry, we see that U+0407 has a general category of “Lu” (uppercase letter), a combining class of 0, a bi-directional category of “L”, and a mirrored property of “N”. It also has a canonical decomposition mapping to < U+0406, U+0308 >, a 10646 comment of “Ukrainian”, and a lowercase mapping to U+0457.
I have described the main file in the Unicode character database, UnicodeData.txt, in some detail. There are a number of other files listing character properties in the Unicode character database. Some of the more significant files have been mentioned in earlier sections. Of the rest, it would be beyond the scope of an introduction to explain every one, and all of them are described in Davis and Whistler (2001b). I will be giving more details on those that are most significant in the sections that follow. In particular, additional properties related to case are discussed in Section 7.4 together with a fuller discussion of the case-related properties mentioned in this section; and the properties listed in the ArabicShaping.txt and BidiMirroring.txt files will be described in Section 8.1, together with further details on the mirrored property mentioned here.
7.4 Uppercase, lowercase, titlecase and case mappings
Case is an important property for characters in Latin, Greek, Cyrillic, Armenian and Georgian scripts.15 For these scripts, both upper- and lowercase characters are encoded. Because some Latin and Greek digraphs were included in Unicode, it was necessary to add additional case forms to deal with the situation in which a string has an initial uppercase character. Thus, for these digraphs there are upper-, lower- and titlecase characters; for example, U+01CA LATIN CAPITAL LETTER NJ, U+01CB LATIN CAPITAL LETTER N WITH SMALL LETTER J, and U+01CC LATIN SMALL LETTER NJ. Likewise, there are properties giving uppercase, lowercase and title case mappings for characters. Thus, U+01CA has a lowercase mapping of U+01CC and a titlecase mapping of U+01CB.
Case has been indicated in Unicode by means of the general category properties “Ll”, “Lu” and “Lt”. These have always been normative character properties. Prior to TUS 3.1, however, case mappings were always informative properties. The reason was that, for some characters, case mappings are not constant across all languages. For example, U+0069 LATIN SMALL LETTER I is always lower case, no matter what writing system it is used for, but not all writing systems consider the corresponding uppercase character to be U+0049 LATIN CAPITAL LETTER I. In Turkish and Azeri, for instance, the uppercase equivalent to “i” is U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE. The exceptional cases such as Turkish and Azeri were handled by special casing properties that were listed in a file created for that purpose: SpecialCasing.txt.
The SpecialCasing.txt file was also used to handle other special situations, in particular situations in which a case mapping for a character is not one-to-one. These typically involve encoded ligatures or precomposed character combinations for which the corresponding uppercase equivalent is not encoded. For example, U+01F0 LATIN SMALL LETTER J WITH CARON is a lowercase character, and its uppercase pair is not encoded in Unicode as a precomposed character. Thus, the uppercase mapping for U+01F0 must map to a combining character sequence, <U+004A LATIN CAPITAL LETTER J, U+030C COMBINING CARON>. This mapping is given in SpecialCasing.txt.
Note that not all characters with case are given case mappings. For example, U+207F SUPERSCRIPT LATIN SMALL LETTER N is a lowercase character (indicated by the general category “Ll”), but it is not given uppercase or titlecase mappings in either UnicodeData.txt or in SpecialCasing.txt. This is true for a number of other characters as well. Several of them are characters with compatibility decompositions, like U+207F, but many are not. In particular the characters in the IPA Extensions block are all considered lowercase characters, but many do not have uppercase counterparts.
In TUS 3.1, case mappings have been changed from being informative to normative. The reason for the change was that it was recognised that case mapping was a significant issue for a number of processes and that it was not really satisfactory to have all case mappings be informative. Thus, the mappings given in UnicodeData.txt are now normative properties. Special casing situations are still specified in SpecialCasing.txt, but this file is now considered normative as well.
Note that UnicodeData.txt indicates the case of characters by means of the general categories “Ll”, “Lu” and “Lt”, and not by case properties that are independent of the “letter” category. There are instances, however, in which a character that does not have one of these properties should be treated as having case or as having case mappings. This applies, for example, to U+24D0 CIRCLED LATIN SMALL LETTER A, which has an uppercase mapping of U+24B6 CIRCLED LATIN CAPITAL LETTER A but does not have a case property since it has a general category of “So” (other symbol). As an accident of history in the way that the general category was developed, case was applied to characters that are categorised as letters, but not to characters categorised as symbols. This made case incompatible with being categorised as a symbol, even though case properties should be logically independent of the letter/symbol categories.
To deal with such situations, extended properties Other_Lowercase and Other_Uppercase have been defined. These are two of various extended properties that are listed in the file PropList.txt (described in PropList.htm). Furthermore, derived properties Lowercase and Uppercase have been defined (described in DerivedProperties.htm) that combine characters with the general categories “Ll” and “Lu” together with characters that have the Other_Lowercase and Other_Uppercase properties. Thus, the lowercase property (as listed in DerivedCoreProperties.txt) can be thought of as identifying all characters that can be deemed to be lowercase, regardless of their general category. This may be useful for certain types of processing. Note, however, that these extended and derived case-related properties are as yet only informative, not normative.
For some types of processing, there may be additional issues related to case that need to be considered. See UTR #21 (Davis 2001d) for further discussion of case-related issues.
7.5 Character decomposition mappings
The notions of canonical and compatibility equivalence were introduced in Section 6. There, we saw cases in which a Unicode character is identical in meaning to a decomposed sequence of one or more characters, and that these two representations are said to be canonically equivalent. We also considered other cases in which a Unicode character duplicates a sequence of one or more characters in some more limited sense: the two representations are equivalent in certain contexts only, or the one character is equivalent to the sequence when supplemented with certain non-textual information. In these cases, the two representations are said to be compatibility equivalent.
For both types of equivalence, the relationship between two encoded representations is formally expressed by means of a decomposition mapping. These mappings are given in the sixth field of UnicodeData.txt. So, for example, the character U+1EA1 LATIN SMALL LETTER A WITH DOT BELOW is canonically equivalent to the combining sequence <U+0061 LATIN SMALL LETTER A, U+0323 COMBINING DOT BELOW>. This relationship is indicated by the decomposition mapping given in the entry in UnicodeData.txt for U+1EA1:
1EA1;LATIN SMALL LETTER A WITH DOT BELOW;Ll;0;L;0061 0323;;;;N;;; 1EA0;;1EA0
Figure 19. Decomposition mapping in UnicodeData.txt entry for U+1EA1
The same field in UnicodeData.txt is used to list both canonical and compatibility decomposition mappings. The two types of equivalence do need to be distinguished, however. As noted, characters with compatibility decompositions typically have some additional non-character element of meaning or some specific contextual associations. Accordingly, when giving a decomposition mapping for such a character, it makes sense also to indicate what this additional element of meaning is. This is precisely what is done: in compatibility decomposition mappings, the mapping includes the decomposed character sequence as well as a tag that indicates the additional non-character information contained in the compatibility character. This is illustrated in Figure 20:
02B0;MODIFIER LETTER SMALL H;Lm;0;L;<super> 0068;;;;N;;;;;
2110;SCRIPT CAPITAL I;Lu;0;L;<font> 0049;;;;N;SCRIPT I;;;;
FB54;ARABIC LETTER BEEH INITIAL FORM;Lo;0;AL;<initial> 067B;;;;N;;;;;
FF42;FULLWIDTH LATIN SMALL LETTER B;Ll;0;L;<wide> 0062;;;;N;;;FF22;; FF22
Figure 20. Sample entries from UnicodeData.txt with compatibility decomposition mappings
This additional tag in the decomposition mappings is what distinguishes between canonical equivalence relationships and compatibility equivalence relationships.16
There are a total of 16 tags used to indicate compatibility decompositions. A brief description of each is given in Table 6:
Table 6. Tags used in compatibility decomposition mappings
The following table gives examples showing the use of each of these tags:
Table 7. Examples of different types of compatibility decomposition mappings
Note that the <compat> tag is used for a variety of characters that stand in one of several types of relationship to their corresponding decomposed counterparts. For example, the ligature presentation forms described in Section 6.4 and the digraphs described in Section 6.5 use this tag. It is also used for several of the types of compatibility decomposition described in “A review of characters with compatibility decompositions”.
It was noted in Section 6 that some cases of equivalence involve one-to-one relationships, for example in the case of the exact character duplicates discussed in Section 6.1. Hence, not all of the decomposition mappings contain sequences of two or more characters, as illustrated in Figure 21:
212A;KELVIN SIGN;Lu;0;L;004B;;;;N;DEGREES KELVIN;;;006B;
Figure 21. Single-character decomposition mapping in the entry for U+212A
Compatibility decomposition mappings always gives the completely decomposed representation. This is illustrated by the entry for U+3315:
3315;SQUARE KIROGURAMU;So;0;L;<square> 30AD 30ED 30B0 30E9 30E0;;;; N;SQUARED KIROGURAMU;;;;
Figure 22. Multiple-character compatibility decomposition mapping
As a consequence of this, it should be noted that an instance of compatibility equivalence always involved exactly two representations: the compatibility character and the corresponding decomposed representation given in the mapping.
For canonical decompositions, the decomposition mapping lists a sequence of one or two characters, but never more than two. In the case of precomposed characters that involve multiple diacritic characters, this generally means that the precomposed character decomposes into a partially decomposed sequence. If a fully decomposed sequence is needed, further decomposition mappings can be applied to the partially decomposed sequences.17 So, for example, U+1FA7 GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI has a decomposition involving two characters, one of which is U+1F67 GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI , which in turn has a decomposition of two characters, and so on.
1FA7;GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI;Ll;0;L;1F67 0345;;;;N;;;1FAF;;1FAF
1F67;GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI;Ll;0;L;1F61 0342;;;;N;;;1F6F;;1F6F
1F61;GREEK SMALL LETTER OMEGA WITH DASIA;Ll;0;L;03C9 0314;;;;N;;; 1F69;;1F69
Figure 23. Decomposition of precomposed characters with multiple diacritics
In principle, it would be possible for Unicode to include a precomposed character involving multiple diacritics but not to include any precomposed character involving some subset of those diacritics. In such a situation, the character would decompose directly into the fully decomposed combining sequence, meaning that the decomposition mapping includes more than two characters. In practice, though, this situation does not occur in the Unicode character set. Furthermore, the Unicode Consortium has given an explicit guarantee that canonical decompositions will never involve mapping to more than two characters.18
As explained in Section 6.3, the fact that a precomposed character with multiple diacritics has a decomposition involving partially composed forms means that there may be several canonically equivalent representations for a given text element (as illustrated in Figure 16 on page 90). Canonical equivalence is a transitive relationship, and so we intuitively know that all of the representations are canonically equivalent. Software processes cannot rely on such intuition, however. Software requires explicit and efficient algorithms that let it know that all of these representations are canonically equivalent. This is accomplished by means of normalization, which will be discussed in Section 10.
8 The bi-directional algorithm and other rendering behaviours
As has been mentioned, the Unicode character set assumes a design principle of encoding characters but not glyphs. This in turn implies an assumption that applications that use Unicode will incorporate rendering processes that can do the glyph processing required to make text appear the way it should, in accordance with the rules of each given script and writing system. Unicode does not specify in complete detail how the rendering process should be done. There are a number of approaches that an application might utilise to deal with the many details involved, and it is outside the scope of the Unicode Standard to stipulate how such processing should be done.
This does not mean that Unicode has nothing to say regarding the rendering process, however. There are some key issues that pertain to the more complex aspects of rendering that require common implementation in order to ensure consistency across all implementations. For example, under certain conditions consonant sequences in Devanagari script can be represented as ligatures,19 though in certain circumstances the sequence can also be rendered with the first consonant in a reduced “half” form instead. It is considered necessary that this distinction be reflected in plain text, and thus in the encoding of data. It is important that different implementations employ the same mechanisms for controlling this, and so the encoding mechanism is defined as a normative part of the Unicode Standard. One implication of this is that Unicode specifies how that aspect of the rendering of Devanagari script is to be done.
There are three major areas in which Unicode specifies the rendering behaviour of characters. The control of Indic conjuncts and half-forms is one of these. A second has to do with the connecting behaviour of cursive scripts, specifically Arabic and Syriac. The third has do with bi-directional text, meaning horizontal text with mixed left-to-right and right-to-left line directions. I will give a brief overview of each of these here, beginning with bi-directional text and the bi-directional algorithm.
Those three areas are of particular significance because they affect important scripts in rather pervasive ways; one simply cannot render Arabic script, for example, without addressing the issues of bi-directionality and cursive connection. There are some other less significant rendering-related issues that Unicode addresses. These will also be described briefly.
8.1 Bi-directional text and the bi-directional algorithm
There are two issues that need to be dealt with in relation to bi-directional text. The bigger issue is the mixing of text with different directionality. The other has to do with certain characters that require glyphs for right-to-left contexts that are mirror images of those required for left-to-right contexts. Both of these issues are dealt with in Unicode by the Unicode bi-directional algorithm.
[Section not yet completed]
8.2 Cursive connections
Certain scripts, such as Arabic and Syriac, have a distinctive characteristic of being strictly cursive: they are never written in non-cursive styles. The implication is that characters typically have at least four shapes corresponding to initial, medial, final and isolated (non-connected) contexts.20 Not all characters connect in all contexts, however. Thus, some characters connect to characters on both sides, others connect only on one side, and other never connect. Ultimately, the joining behaviour of a character depends upon the properties of that character as well as the properties of the characters adjacent to it.
Another behaviour that is closely associated with cursive writing is ligature formation. Both Arabic and Syriac scripts make use of a number of ligatures, many of which are optional, though some are obligatory in all uses of those scripts.21
Unicode assigns normative properties to characters in the Arabic and Syriac blocks that specify their joining behaviour.22 These are described and listed in Chapter 8 of TUS 3.0, and are also listed in machine-readable form in the ArabicShaping.txt data file. Chapter 8 also specifies explicit rules for interpreting the joining properties, as well as rules specifying the formation of obligatory ligatures in Arabic. For most of the rules, the action of the rule is essentially predictable from the meaning of the character properties involved. For example,
R2 A right-joining character X that has a right join-causing character on the right will adopt [a right-joining glyph form].
Thus, for those concerned with Arabic or Syriac rendering, it is the joining properties that are most significant.
Most Arabic and Syriac characters are in one of two joining classes: dual-joining, that is, characters that join to characters on both sides; and right-joining, that is characters that join to characters on the right side only. The other significant classes of characters are non-joining, transparent and join-causing. The join-causing characters are U+200D ZERO WIDTH JOINER and U+0640 ARABIC TATWEEL. The non-joining characters include U+0621 ARABIC LETTER HAMZA, U+0674 ARABIC LETTER HIGH HAMZA, U+06D5 ARABIC LETTER AE, U+200C ZERO WIDTH NON-JOINER as well as spaces, digits, punctuation and, of course, characters from other scripts. The transparent characters include all combining marks, such as U+0654 ARABIC HAMZA ABOVE, and all other format control characters.
There are some additional rules and classes for Syriac that pertain to the shaping of U+0710 SYRIAC LETTER ALAPH. Details are presented in Section 8.3 of TUS 3.0.
It should be noted that certain characters happen to occur only word-finally.23 These are classed as right-joining, even though these are derived from other characters that are classed as dual-joining. For example, U+0629 ARABIC LETTER TEH MARBUTA is classified as right-joining, although that Arabic letter is derived from the letter heh (U+0647 ARABIC LETTER HEH), which is a dual-joining character.
The characters U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER are control characters that can be used to control the shape of cursively-connecting characters. ZWNJ can be used to prevent a cursive connection that otherwise would occur. Likewise, ZWJ can be used to cause a connecting shape for a character to be used that otherwise would not occur. This is illustrated in Figure 24:
and ZWJ on cursively-connecting characters
Note, however, that adding U+200D to the left of (following) a right-joining character does not force it to become dual-joining or to join on the left.
The fact that the joining properties of characters are normative can have an important bearing on how to apply characters in the Syriac or Arabic blocks to the writing of lesser-known languages. In particular, if a language community adopts Arabic script for writing their language but uses a character with different joining behaviour than is common, it may not be clear how to encode that character. The two scenarios that would be problematic would be if a character that is classified in Unicode as right-joining only is used in a given writing system with dual-joining behaviour, or if a character that is classified as non-joining is used in a given writing system with joining behaviour. There are no clear guidelines as to what the best course of action might be in such situations. Because the right-joining property of the encoded character is normative, a conformant application is expected to treat it as such. In this event, there are three options: to request that the property of the existing Unicode character be revised, to propose that a new character be added to Unicode, or to convince that language community to revise their orthography. The latter option may be feasible if literacy in that language is just being introduced and orthography decisions are still open. It would not likely be an option, however, if the orthography is in established usage. It should be noted that, although the joining properties are normative, it is possible for them to be amended, though such changes are not generally made lightly as they may have an impact on existing data and implementations.
8.3 Indic shaping behaviour
[Yet to be written.]
8.4 Stacking of non-spacing combining marks
When multiple non-spacing combining marks24 are applied to a single base character, the different marks might each have the same default position in relation to the base character. For example, there may be a pair of marks that would be centred directly over the base character if they occurred on their own. When the occur together, however, they would not both occur in exactly the same position. That would lead to an illegible result.25 In these situations, the usual way in which this is handled is to have the combining marks stack vertically outward from the base character, with the marks that are nearest after the base character in the data corresponding to the glyphs that are positioned most closely to the glyph of the base. This can be seen in the following example (which repeats Figure 13):
Figure 25. Vertical stacking of combining marks
Note that this is not followed in all writing systems, however. In some situations, certain combinations of combining marks that have the same positioning behaviour26 will be positioned side-by-side rather than vertically. This happens, for example, in Vietnamese. The two behaviours are illustrated in the following example:
Figure 26. Typical stacking of diacritics, and alternate side-by-side positioning
8.5 Variation selectors
Unicode includes some special control characters known as variation selectors. The purpose of these characters is to control the actual shape of a graphic character where the choice is not required by the context or otherwise determined by the behaviours for that script as a whole. The kind of variation this might be designed to deal with might be comparable to a change between similar font designs, such as between selecting between “g” and “ ”, yet for certain reasons it is decided in the given instances that this should be controlled by a character-based mechanism. Note however, that these variation selector characters are not open to free use by users or font vendors as they might determine. When Unicode adds a variation selector to the standard, it is done on the basis that it has only a local effect, operating on a single character and not a run of text, and that it can be used only with specific characters and only with specific effects, both of which are specified in the Standard.
There are three of these variation selectors that were introduced for Mongolian in TUS 3.0, such as U+180B MONGOLIAN FREE VARIATION SELECTOR ONE, and additional variation selectors are under consideration for future versions to be applied to other scripts. No details are available as yet, however, regarding exactly which Mongolian characters these can be applied to, or what the specified affect on those characters is.
8.6 Control of ligature formation
It was mentioned above that a script may have obligatory ligatures that are always required, such as the lam-alef ligature in Arabic. Many scripts have discretionary ligatures, which can be added as the user considers appropriate by applying font formatting properties in rich-text data.27 In certain circumstances, however, it may be necessary for the choice of ligatures to be preserved in plain text, implying the need for a character-based control mechanism.
The semantics of U+200C ZERO WIDTH NON-JOINER and of U+200D ZERO WIDTH JOINER were extended in TUS 3.1 in order to control ligature formation as well as their original functions of controlling the shaping behaviour of cursively-connected characters. Specifically, ZWJ can be used to request a ligature form, and ZWNJ can be used to prevent a ligature form. For example, a ligature for “” would not typically be expected in text, but it could be requested by encoding the sequence <U+0063 LATIN SMALL LETTER C, U+200D ZERO WIDTH JOINER, U+0074 LATIN SMALL LETTER T>. Similarly, the “ ” ligature is relatively common, but one could ensure that the ligature is avoided by encoding the sequence <U+0066 LATIN SMALL LETTER F, U+200C ZERO WIDTH NON-JOINER, U+0069 LATIN SMALL LETTER I>.
It is important to note that this use of ZWJ does not require conformant implementations to produce ligatures; it merely requests that a ligature be used if possible. Thus, older fonts that do not support this behaviour are not considered non-conformant. At the same time, font developers should take note of this new mechanism in order to provide support for it in fonts. Because of the nature of ligatures and the mechanism for using ZWNJ to block ligatures, however, it should not require any special steps in order to make it work, as long as an instance of ZWNJ in data does not result in the display of a visible glyph.
It should also be noted that this general principle does not apply to Indic scripts, for which ZWNJ and ZWJ have specific behaviours, as described above.
For more details on the use of ZWJ and ZWNJ to control ligature formation, see Article V of UAX #27.
8.7 Other rendering behaviours
I have focused on particular rendering behaviours in the preceding sections because they are likely to be highly important for many implementations, because they involve important issues that apply in a number of situations, or in the last two cases, because they are specialised behaviours recently introduced into the Standard that may be unfamiliar. There are a number of other rendering behaviours that are relevant to other situations, but which are not being described in any detail here. Some of these do not involve normative behaviours defined by the Standard. For example, there are different preferences for the shapes of Han ideographs in different locales. This is an important issue for certain implementations, but it is not a normative behaviour that is specified by Unicode. Also, some of these other rendering behaviours would be of less importance to many implementations. For example, the details for rendering Khmer script that differ from the common patterns described for Indic scripts in general are fairly particular to that script.
Even for the groups of scripts that pertain to the rendering behaviours described above, only a summary has been given. Thus, for any script, the reader is advised to read the appropriate section within Chapters 6–13 of TUS 3.0 to learn about the complete details that relate to that particular script.28
9 Combining marks and canonical ordering
[Yet to be written. This section will discuss the relationship between combining marks that interact typographically versus those that do not, and the significance with regard to the semantic significance of combining character sequences. It will be shown that combining character sequences that differ only in the ordering of combining marks can be equivalent sequences. The role of combining classes and the canonical ordering algorithm in neutralising the distinctions between equivalent sequences will be explained. A synopsis follows:
When multiple combining marks occur with a single base, those combining marks may or may not interact typographically. If they don’t, then the order in which they occur in the file doesn’t correspond to any change in appearance, and therefore doesn’t correspond to any meaningful difference. This results in another way in which a given text element can have different encoded representations in Unicode.
If the combining marks do interact typographically, however, then different orders do correspond to different appearance—typically the diacritics are stacked out from the base. This will represent a difference in meaning to the user. Thus, there are some cases in which differently ordered sequences are semantically different, and other cases in which they should be considered equivalent.
A canonical combining class is assigned to every character in order to indicate when combining marks do or do not interact typographically—those with the same combining class do interact. This can be used by software to determine when a difference in ordering is significant or not: different orderings only matter when the combining classes are the same.
Canonical ordering is a process that puts combining marks into a specific order. The sole purpose of this is in determining whether or not two combining sequences are canonically equivalent or not: put two combining sequences into canonical order; then if the sequences are identical, they are canonically equivalent. It is not necessary for data to be stored with combining marks in canonical order; software should be able to accommodate combining marks in any order. In many situations, though, it may not be a bad idea to generate data in canonical order. Note, however, that there are some cases in which the canonical order is not what one might normally encounter in actual data. This is not a problem, however, since strictly speaking the only purpose for canonical ordering is in comparing strings to tell whether they are canonically equivalent or not.
The possibility of equivalent sequences of combining marks occurring in different orders can have implications for other processes. For example, a sorting specification or a rendering system might be configured assuming a particular order. As a result, it is important that such processes allow for alternate equivalent orderings, or that canonical ordering be included in a pre-processing stage. Processes should be designed anticipating different but equally valid possibilities in the data.
Canonical ordering is an important part of normalization, which is the topic of the next section.]
[Yet to be written. This section will briefly explain the various Unicode normalization forms that can be used solve the problem of needing to recognise equivalent sequences as being equivalent. A synopsis follows:
In Sections 6.3 and 9, we saw various reasons why a given text element may be represented in Unicode in multiple ways: precomposed characters are canonically equivalent to their full or partial decompositions. Also, different orderings of combining marks may not be semantically significant, again resulting in different sequences that are canonically equivalent.
Software processes need a way to determine whether or not any two arbitrary strings are equivalent or not. The way in which this is done is to fold the strings into a normalized form; that is, remove the differences that are not significant. For example, by putting strings into canonical order, the distinctions that result from different orderings of combining marks is removed.
Unicode defines different normalization forms for software processes to use in string comparisons. One of these, NFD, utilises the maximally decomposed sequence as the reference representation for comparison purposes. There is a second, NFC, that uses a maximally composed representation (following special rules for how the composition is done). In both cases, canonical ordering is applied.
There are two other normalization forms that have been defined: NFKD and NFKC. These use maximally decomposed and maximally composed representations as above. The difference is that these normalization forms also fold the compatibility character distinctions: e.g. the distinction between n and superscript n is removed. Because the compatibility characters carry additional information that may be significant in some situations, these normalization forms must be used with care.
There is no general requirement that data must be represented in any one normal form. On the other hand, one may be the preference in certain contexts, or one may be imposed in some software implementations.]
11 Deciding how to encode data
[Yet to be written. This section will discuss various issues related to choices in how data can be encoded. This includes: deciding when to follow decomposed or composed normalisation forms; deciding whether or not to use compatibility characters / characters with compatibility decompositions; distinguishing between characters that appear to be similar, or equating glyph variants; the importance of semantics in deciding when to distinguish or unify characters. There will also be some discussion as to what can be done when a character simply is not yet supported in Unicode.]
12 The terms of the Unicode Standard:conformance requirements and stability guarantees
[Yet to be written. This section will give an overview of the conformance requirements of the Standard. These are requirements that software must adhere to in order to conform to Unicode. I will also discuss the stability guarantees that the Unicode Consortium makes in relation to the Standard.]
12.1 Conformance requirements
12.2 Stability guarantees
13 Unicode in a larger context
Unicode is an important technology for software internationalisation and for working with multilingual text, but it is not a panacea. It is but one part of an overall system and must interact with other parts. In this section we will consider Unicode within a larger context, looking at it from two perspectives: various levels of text data representation, and various components within an overall system for processing textual data.
13.1 Where Unicode fits in a larger data context
Unicode provides a means for encoding plain text character data. In a sense, then, it represents the lowest level in a three-level hierarchy of text data representation, as shown in Table 8:
Table 8. Three levels of data representation
In a system for processing text data, there must be components that deal with information on each level, and appropriate standards are necessary to define mechanisms for representation and processing on each level. Unicode does this only for the level of character encoding.
In practice, there is not always a clean division between the document and character encoding levels. This is true in particular for mark-up languages, such as RTF, XML and HTML, which utilise character sequences as mechanisms for controlling formatting or for delimiting higher-level structures as part of the document encoding. Because of these mechanisms, some characters require special representation when used within the textual content of a document that is encoded using such a mark-up language. For instance, in XML, the character “<” has special meaning as part of the mark-up formalism, and if you need to add that character as part of the content of a document, then that portion of the content must be represented in some indirect way. For XML, these issues are documented in the XML specification (Bray et al 2000).
In certain situations, some aspects of the semantic content of a document could potentially be encoded either at the character level, in terms of particular character sequences, or at the document encoding level, using some mark-up convention. For example, mathematical formulas use textual characters, but can also involve operators that require certain visual control for presentation purposes. It may not always be clear whether such semantics are best represented in terms of character sequences or in terms of mark-up.
To consider an example more closely related to languages and linguistics, we have seen that some writing systems involve bi-directional text. In certain situations, the directional properties of characters are not sufficient to provide the exact level of control of the text direction that a user may require. As a result, it is necessary to have some encoding mechanism to control the directional behaviour. In principle, this could be done using either character sequences or mark-up. Indeed, both Unicode and HTML provide mechanisms for this very purpose. For example, the Unicode character U+202E RIGHT-TO-LEFT OVERRIDE, forces subsequent text to be treated as though the characters have strict right-to-left directionality. At the same time, the HTML specification (Raggett et al 1999) defines the BDO element to achieve exactly the same result. Given that mechanisms are available at both levels, it would be possible in principle to mix both in a single document. This must be avoided, however, since it can result in ambiguous encoding states.
Another example worth mentioning involves superscripts and subscripts. Various superscript and subscript characters are encoded in Unicode, such as U+02B0 MODIFIER LETTER SMALL H.30 These can be considered formatting variants of their nominal counterparts, and such formatting may well be supported in a markup language. Thus, HTML defines the elements SUP and SUB for this purpose. Once again, mechanisms are available at both the character and document encoding levels. The two mechanisms may interfere with one another, for instance when subscript formatting is applied to a superscript Unicode character, leading to unexpected results. This is shown in Figure 27 with a normal, non-superscripted character for comparison:
Figure 27. Interference between characters and markup
This is clearly not a desired result and can be confusing to users. Users might also experience frustration when processes do not succeed as expected; for example, when a search does not return results because a superscript “U+02B0 but the user is searching for U+0068 contained within a SUP element. In most contexts, is would be preferable to encode superscripts and subscripts using markup such as the HTML SUP or SUB element. In certain situations in which superscripting or subscripting are used with particular semantics attached, however, it may be easier to work with those meanings by having them encoded in the characters rather than in markup. This would be the case with phonetic transcription, for example.” is encoded as
Some other important cases in which Unicode control characters and mark-up control mechanisms may conflict are discussed in UTR #20 (Dürst and Freytag 2000).
13.2 Where Unicode fits in a larger implementation context
In Section 13.1, we saw that Unicode as a character encoding standard interacts with other standards that relate to higher levels of text data representation. If we focus only on the level of character data processing and plain text, Unicode is again but one of a collection of technologies that interact.
As described in Understanding characters, keystrokes, codepoints and glyphs and also in Becker (1984), systems for working with text involve multiple interacting components. It is useful in this regard to consider a five-part text-processing model:
In this model, encoding corresponds to the memory representation and storage of text. Input and rendering correspond to the most fundamental of text-based processes: generating the text (typically using a keyboard), and viewing/displaying the text. Analysis represents a collection of many secondary processes for working with text: sorting, case mapping, hyphenation, morphological parsing, etc. Conversion represents a supporting process of transforming text data between one character encoding and another.31
The encoding component in this model has central importance since the processes in each of the other components are implemented in terms of the character encoding. If the character encoding is to be Unicode, then the various text processes must be implemented in terms of Unicode.
This presents significant implications for any implementations that use Unicode. Firstly, it is necessary to have keyboards or other input methods that generate Unicode-encoded data. Likewise, it is necessary to have fonts that are based on Unicode encoding. The use of Unicode also introduces rendering requirements that relate to complex script support. (This is discussed further in Guidelines for Writing System Support: Technical Details: Smart Rendering: Part 3 and Rendering technologies overview. Similarly, applications that process text and any supporting components for analytical text processes must be implemented using Unicode. In addition, for as long as users need to work with text encoding in legacy encodings, they require tools for encoding conversion that can map between Unicode and other encodings. Operating system software generally plays an important role in many or all of these areas as it provides services to the application, therefore support for Unicode in the operating system is likewise important.
It is important to see Unicode in the broader context of a complete text-processing model. Practical implementation using Unicode requires that these other pieces of the overall puzzle—software, fonts and complex-script rendering support, and keyboards—are all in place and interacting together. If any one of the pieces does not support Unicode, the system is incomplete.32
At the time of writing, the level of support for Unicode in software is advancing, but still has far to go. Many applications have begun to support Unicode encoding of data, though in most cases this still extends only to the Basic Multilingual Plane. Furthermore, an application’s awareness of character semantics and the support for input methods or rendering is typically limited to only certain blocks.
In general, the key areas in which limitations remain are not with the encoding of data, but with the text processes that interact with the encoded data: input, rendering, etc. In particular, the availability of fonts that support Unicode and complex-script rendering technologies and especially of applications that support those rendering technologies are probably the biggest obstacles that stand in the way of being able to work productively with multilingual text using Unicode.
14 Additional resources
[Yet to be written. This section will point the reader to sources of additional information regarding Unicode, and regarding where to find implementation resources such as fonts, input methods or software.]
Becker, Joe. 1984. “Multilingual word processing.” In Scientific American, July 1984: 96–107.
Becker, Joe, and Andy Daniels. 1992. Unicode technical report #2: Sinhala, Tibetan, Mongolian. (Cited on the Unicode Web site as “Proposals for Sinhala, Mongolian, and Tibetan.”) Cupertino, CA: The Unicode Consortium. Available online at http://www.unicode.org/unicode/reports/tr2.html.
Becker, Joe, and Rick McGowan. 1992–1993. Exploratory proposals. Revision 2. (Cited on the Unicode Web site as “Proposals for less common scripts.”) Cupertino, CA: The Unicode Consortium. Available online at http://www.unicode.org/unicode/reports/tr3/.
Bray, Tim; Jean Paoli; C. M. Sperberg-McQueen; and Eve Maler (eds.) 2000. Extensible Markup Language (XML) 1.0 (second edition). W3C recommendation 6 October 2000. Cambridge, MA: The World Wide Web Consortium. The current version is available online at http://www.w3.org/TR/REC-xml.
Constable, Peter. 2000. Understanding Multilingual software on MS Windows: The answer to the ultimate question of fonts, keyboards and everything. ms. Available in CTC Resource Collection 2000 CD-ROM, by SIL International. Dallas: SIL International.
Daniels, Andy; Lloyd Anderson; Glenn Adams; Lee Collins; and Joe Becker. 1992. Unicode technical report #1: Burmese, Khmer, Ethiopian. (Cited on the Unicode Web site as “Proposals for Burmese (Myanmar), Khmer, and Ethiopian.”) Cupertino, CA: The Unicode Consortium. Available online at http://www.unicode.org/unicode/reports/tr1.html.
Davis, Mark. 1991. Unicode technical report #5: Handling non-spacing marks. Cupertino, CA: The Unicode Consortium.
——. 1993. Unicode technical report #4: The Unicode Standard, Version 1.1. Cupertino, CA: The Unicode Consortium.
——. 2000a. Unicode technical report #18: Unicode regular expression guidelines. Version 5.1. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr18/.
——. 2000b. Unicode technical report #22: Character mapping markup language. Version 2.2. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr22/.
——. 2001a. Unicode standard annex #9: The bidirectional algorithm. Version 3.1.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr9/.
——. 2001b. Unicode standard annex #13: Unicode newline guidelines. Version 3.1.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr13/.
——. 2001c. Unicode standard annex #19: UTF-32. Version 3.1.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr19/.
——. 2001d. Unicode technical report #21: Case mappings. Version 4.3. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr21/.
——. 2001e. Unicode technical report #24: Script names. Version 2. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr24/.
Davis, Mark, and Martin Dürst. 2001. Unicode standard annex #15: Unicode normalization forms. Version 3.1.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr15/.
Davis, Mark; Michael Everson; Asmus Freytag; John H. Jenkins; and others (eds.) 2001. Unicode standard annex #27: Unicode 3.1. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr27/.
Davis, Mark, and Ken Whistler. 2001a. Unicode technical standard #10: Unicode collation algorithm. Version 8.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr10/.
——. 2001b. Unicode character database. Revision 3.1.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html.
Dürst, Martin, and Asmus Freytag. 2000. Unicode in XML and other markup languages. Unicode technical report #20, revision 5. W3C note 15 December 2000. Cupertino, CA: The Unicode Consortium; and Cambridge, MA: The World Wide Web Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr20/ or http://www.w3.org/TR/unicode-xml/.
Freytag, Asmus. 2000. Unicode standard annex #14: Line breaking properties. Version 3.1.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr14/.
——. 2001. Unicode standard annex #11: East Asian width. Version 3.1.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr11/.
International Organization for Standardization, and International Electrotechnical Commission. 1993. Information technology—universal multiple-octet coded character set (UCS)—part 1: Architecture and basic multilingual plane (ISO/IEC 10646-1:1993). Geneva: International Organization for Standardization.
——. 1998. Information technology—an operational model for characters and glyphs (ISO/IEC TR 15285). Geneva: International Organization for Standardization. Available online at http://isotc.iso.ch/livelink/livelink/fetch/2000/2489/Ittf_Home/PubliclyAvailableStandards/C027163e.pdf.
——. 2000a. Information technology—universal multiple-octet coded character set (UCS)—part 1: Architecture and basic multilingual plane. 2nd edition. (ISO/IEC 10646-1:2000). Geneva: International Organization for Standardization.
——. 2000b. Information technology—Code for the representation of the names of scripts. DIS for ISO 15924:2000. This draft ISO standard has been unofficially published online at http://www.egt.ie/standards/iso15924/document/dis15924.pdf.
Moore, Lisa. 1999. Unicode technical report #8: The Unicode standard®, version 2.1. Revision 3.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr8/.
Raggett, Dave; Arnaud Le Hors; Ian Jacobs (eds.) 1999. HTML 4.01 specification. W3C recommendation 24 December 1999. Cambridge, MA: The World Wide Web Consortium. The current version of this W3C recommendation is available online at http://www.w3.org/TR/html4/.
Umamaheswaran, V.S. 2001. Unicode technical report #16: UTF-EBCDIC. Version 7.2. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr16/.
The Unicode Consortium. 1991, 1992. The Unicode Standard, Version 1.0. Reading, MA: Addison-Wesley Developers Press.
——. 1996. The Unicode Standard, Version 2.0. Reading, MA: Addison-Wesley Developers Press.
——. 2000. The Unicode Standard, Version 3.0. Reading, MA: Addison-Wesley Developers Press.
Whistler, Ken, and Glenn Adams. 2001. Unicode technical report #7: Plane 14 characters for language tags. Revision 4.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr7/.
Whistler, Ken, and Mark Davis. 2000. Unicode technical report #17: Character encoding model. Revision 3.2. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr17/.
Wolf, Misha; Ken Whistler; Charles Wicksteed; Mark Davis; and Asmus Freytag. 2000. Unicode technical standard #6: A standard compression scheme for Unicode. Version 3.2. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr6/.
Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.
Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.
© 2003-2014 SIL International, all rights
reserved, unless otherwise noted elsewhere on this page.