Computers & Writing Systems
Implementing Writing Systems: An introduction
Understanding Unicode™ - II
This article begins at: Understanding Unicode™: A general introduction to the Unicode Standard (Sections 1-5).
Implementing Writing Systems
6 Backwards compatibility: the soft underbelly of Unicode
Thus far, we have looked at idealistic principles. It should not be surprising that reality involves other practical considerations that might compromise those ideals. In fact, most or all of the original design ideals for Unicode have been compromised to some extent.
The main practical consideration had to do with backward compatibility. In order for Unicode to succeed, it had to be possible for developers to begin implementing the Standard within the context of a large amount of existing software and data that was based on legacy encoding standards. Thus, one of the essential requirements for Unicode was the ability to convert data reliably from legacy encodings to Unicode and vice versa.
For this purpose, the Unicode character set was initially based on source legacy character sets that included industry standard character sets in wide usage as of or prior to May 1993. Support for these standards was defined in terms of the round-trip rule: it must be possible to convert plain text data from a source encoding into Unicode and then back again and end up with exactly the same data you started with.
In practice, what this has meant is that Unicode must retain any character distinction that was made in a source standard. For example, if a source standard contained a character for the Latin letter “I” and also contained a distinct character with the same shape but used as the Roman numeral for one, then Unicode also needed to contain two distinct characters for the two different uses.
The net effect of maintaining round-trip compatibility is duplication in the Unicode character set. All the instances of duplication are not necessarily comparable to one another, however, and some cases are less obviously instances of duplication than others. These issues are behind all of the things that are most difficult to understand about Unicode—things we have yet to discuss. For someone implementing Unicode, either as a software developer or as someone determining how data should be encoded, these issues are important to understand.
Gaining a complete understanding of all of the characters in Unicode that were added for reasons of backward compatibility is more than can be covered in an introduction. We can get a start, though, by looking at the different ways in which the principles described in Section 5 have been compromised.1
6.1 Compromises in the principle of unification
The first type of duplication we will consider involves a simple violation of unification: Unicode has pairs of characters that are effectively one-for-one duplicates. For instance, U+212A KELVIN SIGN “K” was encoded as a distinct character from U+004B LATIN CAPITAL LETTER K because these were distinguished in some source standard; otherwise, there is no need to distinguish these in shape or in terms of how they are processed. The same situation applies to other pairs, such as U+212B ANGSTROM SIGN versus U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE, and U+037E GREEK QUESTION MARK versus U+003B SEMICOLON.
In the source legacy standards, the two characters in a pair would have been assumed to have distinct functions. In these cases, however, the distinctions can be considered artificial. For example, in the case of the Kelvin sign and the letter K, there is no distinction in shape or in how they are to be handled by text processes. This is purely a matter of interpretation at a higher level. Distinguishing these would be like distinguishing the character “I” when used as an English pronoun from “I” used as an Italian definite article. This is not the type of distinction that needs to be reflected in a character encoding standard.
In TUS 3.1, there are over 800 exact duplicates for Han ideographs. These are located in the CJK Compatibility Ideographs (U+F900..U+FAFF) and the CJK Compatibility Ideographs Supplement (U+2F800..U+2FA1F) blocks. Note, however, that not every character in the former block is a duplicate. As the characters were being collected from various sources, these were prepared as a group and kept together. There are, in fact, twelve characters in this block that are unique and are not duplicates of any other character, such as U+FA0E CJK COMPATIBILITY IDEOGRAPH-FA0E. Thus, you need to pay close attention to details in order to know what the status of any character is. In particular, you cannot assume that a character is a duplicate just because it is in the compatibility area of the BMP (U+F900..U+FFFF) or is from a block that has “compatibility” in the name.
Beyond the duplicate Han characters, there are another 33 of these singleton (one-to-one) duplicates in the lower end of the BMP, most of them in the Greek (U+0370..U+03FF) or Greek Extended (U+1F00..U+1FFF) blocks.
6.2 Compromises in the principle of dynamic composition
The second type of duplication we will look at involves cases in which a character duplicates a sequence of characters. As described in Section 5.6, a text element may be represented as a combining character sequence. For instance, Latin a-circumflex with dot-below “ ” can be represented as a sequence <U+0061 LATIN SMALL LETTER A, U+0323 COMBINING DOT BELOW, U+0302 COMBINING CIRCUMFLEX ACCENT>. In such cases, it is assumed that an appropriate rendering technology will be used that can do the glyph processing needed to correctly position the combining marks. In many cases, however, legacy standards encoded precomposed base-plus-diacritic combinations as characters rather than composing these combinations dynamically. As a result, it was necessary to include precomposed characters such as U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW in Unicode.
Unicode has a very large number of precomposed characters, including 730 for Greek and Latin, and over 11,000 for Korean! Most of the Latin Extended Additional (U+1E00..U+1EFF) and the Greek Extended (U+1F00..U+1FFF) blocks and the entire Hangul Syllables block (U+AC00..U+D7AF) are filled with precomposed characters. There are also over 200 precomposed characters for Japanese, Cyrillic, Hebrew, Arabic and various Indic scripts.Not all of the precomposed characters are from scripts used in writing human languages. There are a few dozen precomposed mathematical operators; for example, U+2260 NOT EQUAL TO, which can also be represented as the sequence <U+003D EQUALS SIGN, usv:0338 usv, name
>. There are also 13 precomposed characters for Western musical symbols in Plane 1; for example U+1D15F MUSICAL SYMBOL QUARTER NOTE, which can also be represented as the sequence <U+1D158 MUSICAL SYMBOL NOTEHEAD BLACK, U+1D165 MUSICAL SYMBOL COMBINING STEM>.
Precomposed characters go against the principle of dynamic composition, but also against the principle that Unicode encodes abstract characters rather than glyphs. In principle, it should be possible for a combining character sequence to be rendered so that the glyphs for the combining marks are correctly positioned in relation to the glyph for the base, or even so that the character sequence is translated into a single precomposed glyph. In these cases, though, that glyph is directly encoded as a distinct character.
There are other cases in which the distinction between characters and glyphs is compromised. Those cases have some significant differences from the ones we have considered thus far. Before continuing, though, there are some important additional points to be covered in relation to the characters described in Sections 6.1 and 6.2. We will return to look at the remaining areas of compromise in Section 6.4.
6.3 Canonical equivalence and the principle of equivalent sequences
In Unicode 3.1, there are over 13,000 instances of the situations described in Sections 6.1 and 6.2. In each of these cases, Unicode provides alternate representations for a given text element. For singleton duplicates, what this means is that there are two codepoints that are effectively equivalent and mean the same thing:
Figure 14. Characters with exact duplicates
Likewise in the case of each precomposed character, there is a dynamically composed sequence that is equivalent and means the same thing as the precomposed character:
Figure 15. Precomposed characters and equivalent dynamically composed sequences
This type of ambiguity is far from ideal, but was a necessary price of maintaining backward compatibility with source standards and, specifically, the round-trip rule. In view of this, one of the original design principles of Unicode was to allow for a text element to be represented by two or more different but equivalent character sequences.
In these situations, Unicode formally defines a relationship of canonical equivalence between the two representations. Essentially, this means that the two representations should generally be treated as if they were identical, though this is slightly overstated. Let me explain in more detail.
In precise terms, the Unicode Standard defines a conformance requirement in relation to canonically equivalent pairs that must be observed by software that claims to conform to the Standard:2
Conformance requirement 9. A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct.
In other words, software can treat the two representations as though they were identical, but it is also allowed to distinguish between them; it just cannot always treat them as distinct.
Since the different sequences are supposed to be equal representations of exactly the same thing, it might seem that this requirement is stated somewhat weakly, and that it ought to be appropriate to make a much stronger requirement: software must always treat canonical-equivalent sequences identically. Most of the time, it does make sense for software to do that, but there may be certain situations in which it is valid to distinguish them. For example, you may want to inspect a set of data to determine if any precomposed characters are used. You could not do that if precomposed characters are never distinguished from the corresponding dynamically composed sequence.
At this point, I can introduce some convenient terminology that is conventionally used. Whereas a character like U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW is referred to as a precomposed character, the corresponding combining character sequence <U+0061 LATIN SMALL LETTER A, U+0323 COMBINING DOT BELOW, U+0302 COMBINING CIRCUMFLEX ACCENT> is often referred to as a decomposed representation or a decomposed character sequence. The formal relationship between a precomposed character and the equivalent decomposed sequence is formally known as canonical decomposition. These mappings are specified as part of the semantic character properties contained in the online data files that were mentioned in Section 2.2.3 is said to be the canonical decomposition of U+212B ANGSTROM SIGN.
In some cases a text element can be represented as either fully composed or fully decomposed sequences. In many cases, however, there will be more than just these two representations. In particular, this can occur if there is a precomposed character that corresponds to a partial representation of another text element. For example, a text element a-ring-acute “ ” can be represented in Unicode using fully composed or fully decomposed representation, but also using a partially decomposed representation:
Figure 16. Equivalent precomposed, decomposed and partially composed representations
Thus, in this example, there are four possible representations that are all canonically equivalent to one another. We will look at the possibility of having multiple equivalent representations further in Section 9.
There is more to be explained regarding the relationship between canonical-equivalent sequences, and we will be looking at further details in Sections 7, 9, 10 and 11. Before returning to discuss ways in which Unicode design principles have been compromised, there are some specific points worth mentioning regarding certain characters for vowels in Indic scripts.
Indic scripts have combining vowel marks that can be written above, below, to the left or to the right of the syllable-initial consonant. In many Indic scripts, certain vowel sounds are written using a combination of these marks, as illustrated in Figure 17:
Figure 17. Tamil vowel written before and after consonant
Such combinations are especially typical in representing the sounds “o” and “au” using combinations of vowel marks corresponding to “e”, “i” and “a”.
What is worth noting is that these vowels are not handled in the same way in Unicode for all Indic scripts. In a number of cases, Unicode includes characters for the precomposed vowel combination in addition to the individual vowel signs. This happens for Bengali, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Tibetan and Myanmar scripts. Thus, for example, in Tamil one can use U+0BCA TAMIL VOWEL SIGN O or the sequence <U+0BC6 TAMIL VOWEL SIGN E, U+0BBE TAMIL VOWEL SIGN AA>. On the other hand, in Thai and Lao, the corresponding vowel sounds can only be represented using the individual component vowel signs. Thus in Thai, the syllable “kau” can only be represented as <U+0E40 THAI CHARACTER SARA E, U+0E01 THAI CHARACTER KO KAI, U+0E32 THAI CHARACTER SARA AA>. Then again, in Khmer, only precomposed characters can be used.4 Thus in Khmer, the corresponding vowel is represented using U+17C5 KHMER VOWEL SIGN AU; Unicode does not include characters for both of the component marks, and so there is no alternate representation. If you need to encode data, therefore, for a language that uses an Indic script, pay close attention to how that particular script is supported in Unicode.
6.4 Compromises in the distinction between characters and glyphs
Returning to our discussion of compromises in the design principles, the next cases we will look at blur the distinction between characters and glyphs. As explained earlier, Unicode assumes that software will support rendering technologies that are capable of the glyph processing needed to handle selection of contextual forms and ligatures. Such capability was not always available in the past, however. The result is that Unicode includes a number of cases of presentation forms—glyphs that are directly encoded as distinct characters but that are really only rendering variants of other characters or combinations of characters. For example, U+FB01 LATIN SMALL LIGATURE FI encodes a glyph that could otherwise be represented as a sequence of the corresponding Latin characters.
There are five blocks in the compatibility area that primarily encode presentation forms. Two of these are the Arabic Presentation Forms-A block (U+FB50..U+FDFF) and the Arabic Presentation Forms-B block (U+FE70..U+FEFF). These blocks mostly contain characters that correspond to glyphs for Arabic ligatures and contextual shapes that would be found in different connective relationships with other glyphs (initial, medial, final and isolate forms). So, for example, U+FEEC ARABIC LETTER HEH MEDIAL FORM encodes the glyph that would be used to present the Arabic letter heh (nominally, the character U+0647 ARABIC LETTER HEH) when it is connected on either side. Unicode includes 730 such characters for Arabic presentation forms.
Characters like this constitute duplicates of other characters. There is a difference between this situation and that described in Sections 6.1 and 6.2, however. In the case of the singleton duplicates, the two characters were, for all intents and purposes, identical: in principle, they could be exchanged in any situation without any impact on text processing or on users. Likewise for the precomposed characters and the corresponding (fully or partially) decomposed sequences. In the case of the Arabic presentation forms, though, they are equivalent only in certain rendering contexts. For example, U+FEEC could not be considered equivalent to a word-initial occurrence of U+0647. For situations like this, Unicode defines a lesser type of equivalence known as compatibility equivalence: two characters are in some sense duplicates but with some limitations and not necessarily in all contexts. Formally, the compatibility equivalence relationship between two characters is shown in a compatibility decomposition mapping that is part of the Unicode character properties.5 I will discuss the relationship between compatibility equivalence and canonical equivalence further in Section 11.
In general, use of the characters in the Arabic Presentation Forms blocks is best avoided whenever possible. Note, though, that these blocks contain three characters that are unique characters and are not duplicates of any others: U+FD3E ORNATE LEFT PARENTHESIS, U+FD3F ORNATE RIGHT PARENTHESIS and U+FEFF ZERO WIDTH NO-BREAK SPACE. As we saw earlier, you need to check the details on characters in the Standard before you make assumptions about them.
The other three blocks that contain primarily presentation forms are the Alphabetic Presentation Forms block (U+FB00..U+FB4F), the CJK Compatibility Forms block (U+FE30..U+FE4F), and the Combining Half Marks block (U+FE20..U+FE2F). The first of these contains various Latin, Armenian and Hebrew ligatures, such as the “6 The CJK Compatibility Forms block contains rotated variants of various punctuation characters for use in vertical text; for example, U+FE35 PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS. Finally, the Combining Half Marks block contains presentation forms representing the left and right halves of glyphs for diacritics that span multiple base characters; for example, U+FE22 COMBINING DOUBLE TILDE LEFT HALF and U+FE23 COMBINING DOUBLE TILDE RIGHT HALF. These are not variants of other characters or sequences of characters. Rather, pairs of these are variants of other characters. Unicode does not formally define any type of equivalence that covers such situations.” ligature mentioned above. This block also has wide variants of certain Hebrew letters, which were used in some legacy systems for justification of text, and two Hebrew characters that are no more than font variants of existing characters.
Outside the compatibility area at the top of the Basic Multilingual Plane, there are no characters for contextual presentation forms (like the Arabic positional variants), and there are but a handful of ligatures, such as U+0587 ARMENIAN SMALL LIGATURE ECH YIWN, and U+0EDC LAO HO NO.
6.5 Compromises in the distinction between characters and graphemes
The next category of compromise has to do with the distinction between characters and graphemes. In particular, Unicode includes 18 characters for Latin digraphs. For example, U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z and U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING. Most of these are to be found in the Latin Extended-B block (U+0180..U+024F). Others are in the Latin Extended-A block (U+0100..U+017F), and there is one (U+1E9A) in the Latin Extended Additional block (U+1E00..U+1EFF).7
There are also what appear to be four Arabic digraphs at U+0675..U+0678 that make use of a spacing hamza and are reportedly used for Kazakh.
All of these digraphs are duplicates for sequences involving the corresponding pairs of characters. In each case, Unicode designates the digraph character to be a compatibility equivalent to the corresponding sequence.
There are over 16,000 characters defined in Unicode that in one way or another go against the basic design principles of the Standard. The categories described in the preceding discussion covers the bulk of them.
There are a number of other characters with compatibility decompositions that are not as easy to characterise. They do not compromise the design principles in any way not yet covered; they just tend to do so in more complex ways, such as violating many principles at once! It is not essential that a beginner know much about them, but as you advance, you may begin to encounter some of them and might find it helpful to be more familiar with them as a group. For that reason, I have provided an overview of that collection of miscellany in “A review of characters with compatibility decompositions”.8
Of the characters we considered in the preceding discussion, the vast majority—over 13,000—were singleton identical duplicates of other characters, or were precomposed base-plus-diacritic or Hangul combinations. The relationship between these characters and the corresponding character sequences that they duplicate was easy to characterise: they are effectively the same. Unicode refers to the relationship between these duplicates as canonical equivalence.
The other 3,000 characters were somewhat harder to characterise. They were equivalent to the characters that they duplicate, but in a more limited way. Sometimes, this was because they could only be considered equivalent in certain contexts, as in the case of the Arabic contextual forms. Sometimes, it was the opposite: in certain contexts, they could be considered distinct from their nominal counterparts because they might be understood to carry extra information, as in the case of the superscripts. Sometimes, this was because they carried some additional non-textual information, such as layout or formatting, as in the case of the enclosed alphanumeric symbols. In these cases, Unicode refers to the relationship between these duplicates as compatibility equivalence.
The difference between canonical and compatibility equivalence is a significant one that plays out in important ways in the implementation of the Standard. The key to this is in the mechanisms that software uses to equate equivalent characters in practice, a category of processes known as normalization. We will discuss this in Section 10. Before we can do that, however, we need to examine character semantics, and also take a closer look at combining marks. These will be covered in Sections 7 and 9.
7 Character semantics and behaviours
As explained in “Character set encoding basics”, software creates the impression of understanding the behaviours of writing systems by attributing semantic character properties to encoded characters. These properties represent parameters that determine how various text processes treat characters. For example, the SPACE character needs to be handled differently by a line-breaking process than, say, the U+002C COMMA character. Thus, U+0020 SPACE and U+002C COMMA have different properties with respect to line-breaking.
One of the distinctive strengths of Unicode is that the Standard not only defines a set of characters, but also defines a number of semantic properties for those characters. Unicode is different from most other character set encoding standards in this regard. In particular, this is one of the key points of difference between Unicode and ISO/IEC 10646.
In addition to the semantic properties, Unicode also provides reference algorithms for certain complex processes for which the correct implementation may not be self evident. In this way, the Standard is not only defining semantics properties for characters, but is also guiding how semantics should be interpreted. This has an important benefit for users in that it leads to more consistent behaviour between software implementations. There is also a benefit for software developers who are suddenly faced with supporting a wide variety of languages and writing systems: they are provided with important information regarding how characters in unfamiliar scripts behave.
7.1 Where the character properties are listed and described
The character properties and behaviours are listed and explained in various places, including the printed book, some of the technical reports and annexes, and in online data files. An obvious starting point for information on character properties is Chapter 4 of TUS 3.0. That chapter describes or lists some of the properties directly, and otherwise indicates the place in which many other character properties are covered. Note that that chapter is valid in relation to Version 3.0, and that some changes with regard to character properties were made in TUS 3.1. Those changes are described in Article III of UAX #27. Even together, though, those two sources do not provide a reference that covers all character properties.
While some properties are listed in the book, the complete listing of character properties is given in the data files that comprise the Unicode Character Database (UCD). These are included on a CD-ROM with the printed book, but the current versions are to be found online at http://www.unicode.org/Public/UNIDATA/.9 A complete reference regarding the files that make up the UCD and the place in which each is described is provided in the document at http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html.10
The original file that contained the character properties is UnicodeData.txt. This is considered the main file in the UCD and is one that is perhaps most commonly referred to. It is just one of several, however. This file was first included as part of the Standard in Version 1.1.5. Like all of the data files, it is a machine readable text database. As new properties were defined, additional files were created. This was done rather than adding new fields to the original file in order to remain compatible with software implementations designed to read that file. There are now 27 data files that comprise the UCD for Version 3.1.
7.2 Normative versus informative properties and behaviours
Unicode distinguishes between normative properties and informative properties. Software that conforms to Unicode is required to observe all normative properties. The informative properties are provided as a guide, and it is recommended that software developers generally follow them, but implementations may override those properties while still conforming to the Standard. All of the properties are defined as part of the Standard, but only the normative properties are required to be followed.
One reason for this distinction is that some properties are provided for documentation purposes and have no particular relevance for implementations. For example, the Unicode 1.0 Name property is provided for the benefit of implementations that may have been based on that version.11 Another similar property is the 10646 comment property, which records notes from that standard consisting primarily of alternate character names and the names of languages that use that character. Another example is the informative notes and cross-references in NamesList.txt (discussed in Section 3.3), which provide supplementary information about characters that is helpful for readers in identifying and distinguishing between characters.
Another reason for the distinction is that it may not be considered necessary to require certain properties to be applied consistently across all implementations. This would apply, for instance, in the case of a property that is not considered important enough to invest the effort in giving careful thought to the definition and assignment of the property for all characters. For example, the kMainlandTelegraph property12 identifies the telegraph code used in the People’s Republic of China that corresponds to a given unified Han character. This property may be valuable in some contexts, but is not likely to be something that is felt to need normative treatment within Unicode.
For some properties, it may also be considered inappropriate to impose particular requirements on conformant implementations. This might be the case if it is felt that a given category of processes is not yet well enough understood to specify what the normative behaviour for those processes should be. For example, Line Breaking properties13 are defined in Unicode for each character, and they reflect what are considered to be best practices to the extent that line breaking is understood. The treatment of line breaking properties is thought to largely reflect best practices that are valid for many situations, but is not considered to have completely covered all aspects of line breaking behaviour. Current knowledge regarding line breaking is not complete enough to make a complete and normative specification of line breaking behaviour that becomes a conformance requirement on software. As a result, some line breaking properties are normative—for instance, a break is always allowed after a ZERO-WIDTH SPACE and is always mandatory after a CARRIAGE RETURN—but most line-breaking properties are informative and can be modified in a given implementation.
It is also inappropriate to impose normative requirements in the case of properties for which the status of some characters is controversial or simply unclear. For example, a set of case-related properties exist that specify the effective case of letter-like symbols. For instance, U+24B6 CIRCLED LATIN CAPITAL LETTER A is assigned the Other_Uppercase property. For some symbols, however, it is unclear what effective case they should be considered to have. The character U+2121 TELEPHONE SIGN, for instance, has been realised using a variety of glyphs. Some glyphs use small caps, as in “”, while others use all caps, as in “”. It is difficult to have any confidence in making a property such as Other_Uppercase normative when the status of some characters in relation to that property is unclear.
The normative and informative distinction applies to the specification of behaviours as well as to character properties. For example, UAX #9 (Davis 2001a) describes behaviour with regard to the layout of bi-directional text (the “bi-directional algorithm”) and that behaviour is a normative part of the Standard. Likewise, the properties in ArabicShaping.txt (described in Section 8.2 of TUS 3.0) that describe the cursive shaping behaviour of Arabic and Syriac characters are also normative. On the other hand, UAX #14 (Freytag 2000) describes an algorithm for processing the line breaking properties, but that algorithm is not normative.14 Similarly, Section 5.13 of TUS 3.0 discusses the handling of non-spacing combining marks in relation to processes such as keyboard input, but the guidelines it presents are informative only.
There is one other point that is important to note in relation to the distinction between normative and informative properties: the fact that a property is normative does not imply that it can never change. Likewise, the fact that a property is informative does not imply that it is open to change. For example, the Unicode 1.0 Names property is informative but is not subject to change. On the other hand, several changes were made in TUS 3.0 to the Bi-directional Category, Canonical Combining Class and other normative properties in order to correct errors and refine the specification of behaviours. As a rule, though, it is the case that changes to normative properties are avoided, and that some informative properties can be more readily changed.
It is also true that some normative properties are not subject to change. In particular, the Character name property is not permitted to change, even in cases in which, after the version of the Standard in which a character is introduced is published, the name is found to be inappropriate. The reason for this is that the sole purpose of the character name is to serve as a unique and fixed identifier.
7.3 A summary of significant properties
The properties that are probably most significant are those found in the main character database file, UnicodeData.txt. I will briefly describe the properties in this file here.
The format for the UnicodeData.txt file is described in the document at http://www.unicode.org/Public/UNIDATA/UnicodeData.html. UnicodeData.txt is a semicolon-delimited text database with one record (i.e. one line) per character. There are 15 fields in each record, each field corresponding to a different property. All of the properties in this file apart from the Unicode 1.0 Name and 10646 comment (described above) are normative.
The first field corresponds to the codepoint for a given character, the significance of which is obvious. The next field contains the character name, which provides a unique identifier in addition to the codepoint. There is an importance to the character name as an identifier over the codepoint in that, while the codepoint is applicable only to Unicode, the character name may be constant across a number of different character set standards, facilitating comparisons between standards.
The next field contains the General Category properties. This categorises all of the characters into a number of useful character types: letter, combining mark, number, punctuation, symbol, control, separator and other. Each of these is further divided into subcategories. Thus, letters are designated to be uppercase, lowercase or titlecase. Each of these subcategories is indicated in the data file using a two-letter abbreviation. The complete list of general category properties and their abbreviations is listed in Table 5:
Table 5. General category properties and their abbreviations
This set of properties forms a partition of the Unicode character set; that is, every character is assigned exactly one of these general category properties.
Space does not permit a detailed description of all of these properties. General information can be found in Section 4.5 of TUS 3.0. Some of these properties are not discussed in detail in the Standard using these explicit names, so information may be difficult to find. For some of the properties, it may be more likely to find information about individual characters than about the groups of characters as a whole. Many of these categories are significant in relation to certain behaviours, though. Several are discussed in Chapter 5 of TUS 3.0. Many of them are particularly relevant in relation to line breaking behaviour, described in UAX #14 (Freytag 2000).
The control, format and other special characters are discussed in Chapter 13 of TUS 3.0. Numbers are described in Chapter 12 and in most of the various sections covering different scripts in Chapters 7–11. Punctuation and spaces are discussed in Chapter 6 of TUS 3.0. Symbols are the topic of Chapter 12 of TUS 3.0. Line and paragraph separators are covered in UAX #13 (Davis 2001b).
It will be worth describing letters and case in a little more detail, and I will do so after finishing this general survey of character properties. I will also discuss combining marks in some detail in Section 9.
Returning to our discussion of the fields in the UnicodeData.txt database file, the fourth, fifth and sixth fields contain particularly important properties: the Canonical Combining Classes, Bi-directional Category and Decomposition Mapping properties. Together with the general category properties, these three properties are the most important character properties defined in Unicode. Accordingly, each of these will be given additional discussion. The canonical combining classes are relevant only for combining marks (characters with general category properties of Mn, Mc and Me), and will be described in more detail in Section 9. The bi-directional categories are used in relation to the bi-directional algorithm, which is specified in UAX #9 (Davis 2001a). I will provide a brief outline of this in Section 8.1. Finally, the character decomposition mappings specify canonical and compatibility equivalence relationships. I will discuss this further in Section 7.5.
Most of the next six fields contain properties that are of more limited significance. Fields seven to nine relate to the numeric value of numbers (characters with general category properties (Nd, Nl and No). These are covered in Section 4.6 of TUS 3.0. The tenth field contains the Mirrored property, which is important for right-to-left scripts, and is described in Section 4.7 of TUS 3.0 and also in UAX #9 (the bi-directional algorithm). I will say more about it in Section 8.1. Fields eleven and twelve contain the Unicode 1.0 Name and 10646 properties.
The last three fields contain case mapping properties: Uppercase Mapping, Lowercase Mapping and Titlecase mapping. These are considered further in the next section.
As mentioned earlier, UnicodeData.txt is a semicolon-delimited text database. Now that I have described each of the fields, let me provide some examples:
0028;LEFT PARENTHESIS;Ps;0;ON;;;;;Y;OPENING PARENTHESIS;;;;
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
0407;CYRILLIC CAPITAL LETTER YI;Lu;0;L;0406 0308;;;;N;;Ukrainian ;;0457;
Figure 18. Four entries from UnicodeData.txt
Note that not every field necessarily contains a value. For example, there is no uppercase mapping property for U+0028. Every entry in this file contains values, however, for the following fields: codepoint, character name, general category, canonical combining class, bi-directional category, and mirrored.
Looking at the first of these entries, we see that U+0028 has a general category of “Ps” (opening punctuation—see Table 5), a canonical combining class of “0”, a bi-directional category of “ON”, and a mirrored property of “Y”. It also has a Unicode 1.0 name of “OPENING PARENTHESIS”.
The entry for U+0031 shows a general category of “Nd” (decimal digit number), a combining class of “0”, a bi-directional category of “EN”, and a mirrored property of “N”. Since this character is a number, fields seven to nine (having to do with numeric values) contain values, each of them “1”.
The character U+0061 has a general category of “Ll” (lowercase letter), a combining class of “0”, a bi-directional category of “L”, and a mirrored property of “N”. It also has uppercase and titlecase mappings to U+0041.
Finally, looking at the last entry, we see that U+0407 has a general category of “Lu” (uppercase letter), a combining class of 0, a bi-directional category of “L”, and a mirrored property of “N”. It also has a canonical decomposition mapping to < U+0406, U+0308 >, a 10646 comment of “Ukrainian”, and a lowercase mapping to U+0457.
I have described the main file in the Unicode character database, UnicodeData.txt, in some detail. There are a number of other files listing character properties in the Unicode character database. Some of the more significant files have been mentioned in earlier sections. Of the rest, it would be beyond the scope of an introduction to explain every one, and all of them are described in Davis and Whistler (2001b). I will be giving more details on those that are most significant in the sections that follow. In particular, additional properties related to case are discussed in Section 7.4 together with a fuller discussion of the case-related properties mentioned in this section; and the properties listed in the ArabicShaping.txt and BidiMirroring.txt files will be described in Section 8.1, together with further details on the mirrored property mentioned here.
7.4 Uppercase, lowercase, titlecase and case mappings
Case is an important property for characters in Latin, Greek, Cyrillic, Armenian and Georgian scripts.15 For these scripts, both upper- and lowercase characters are encoded. Because some Latin and Greek digraphs were included in Unicode, it was necessary to add additional case forms to deal with the situation in which a string has an initial uppercase character. Thus, for these digraphs there are upper-, lower- and titlecase characters; for example, U+01CA LATIN CAPITAL LETTER NJ, U+01CB LATIN CAPITAL LETTER N WITH SMALL LETTER J, and U+01CC LATIN SMALL LETTER NJ. Likewise, there are properties giving uppercase, lowercase and title case mappings for characters. Thus, U+01CA has a lowercase mapping of U+01CC and a titlecase mapping of U+01CB.
Case has been indicated in Unicode by means of the general category properties “Ll”, “Lu” and “Lt”. These have always been normative character properties. Prior to TUS 3.1, however, case mappings were always informative properties. The reason was that, for some characters, case mappings are not constant across all languages. For example, U+0069 LATIN SMALL LETTER I is always lower case, no matter what writing system it is used for, but not all writing systems consider the corresponding uppercase character to be U+0049 LATIN CAPITAL LETTER I. In Turkish and Azeri, for instance, the uppercase equivalent to “i” is U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE. The exceptional cases such as Turkish and Azeri were handled by special casing properties that were listed in a file created for that purpose: SpecialCasing.txt.
The SpecialCasing.txt file was also used to handle other special situations, in particular situations in which a case mapping for a character is not one-to-one. These typically involve encoded ligatures or precomposed character combinations for which the corresponding uppercase equivalent is not encoded. For example, U+01F0 LATIN SMALL LETTER J WITH CARON is a lowercase character, and its uppercase pair is not encoded in Unicode as a precomposed character. Thus, the uppercase mapping for U+01F0 must map to a combining character sequence, <U+004A LATIN CAPITAL LETTER J, U+030C COMBINING CARON>. This mapping is given in SpecialCasing.txt.
Note that not all characters with case are given case mappings. For example, U+207F SUPERSCRIPT LATIN SMALL LETTER N is a lowercase character (indicated by the general category “Ll”), but it is not given uppercase or titlecase mappings in either UnicodeData.txt or in SpecialCasing.txt. This is true for a number of other characters as well. Several of them are characters with compatibility decompositions, like U+207F, but many are not. In particular the characters in the IPA Extensions block are all considered lowercase characters, but many do not have uppercase counterparts.
In TUS 3.1, case mappings have been changed from being informative to normative. The reason for the change was that it was recognised that case mapping was a significant issue for a number of processes and that it was not really satisfactory to have all case mappings be informative. Thus, the mappings given in UnicodeData.txt are now normative properties. Special casing situations are still specified in SpecialCasing.txt, but this file is now considered normative as well.
Note that UnicodeData.txt indicates the case of characters by means of the general categories “Ll”, “Lu” and “Lt”, and not by case properties that are independent of the “letter” category. There are instances, however, in which a character that does not have one of these properties should be treated as having case or as having case mappings. This applies, for example, to U+24D0 CIRCLED LATIN SMALL LETTER A, which has an uppercase mapping of U+24B6 CIRCLED LATIN CAPITAL LETTER A but does not have a case property since it has a general category of “So” (other symbol). As an accident of history in the way that the general category was developed, case was applied to characters that are categorised as letters, but not to characters categorised as symbols. This made case incompatible with being categorised as a symbol, even though case properties should be logically independent of the letter/symbol categories.
To deal with such situations, extended properties Other_Lowercase and Other_Uppercase have been defined. These are two of various extended properties that are listed in the file PropList.txt (described in PropList.htm). Furthermore, derived properties Lowercase and Uppercase have been defined (described in DerivedProperties.htm) that combine characters with the general categories “Ll” and “Lu” together with characters that have the Other_Lowercase and Other_Uppercase properties. Thus, the lowercase property (as listed in DerivedCoreProperties.txt) can be thought of as identifying all characters that can be deemed to be lowercase, regardless of their general category. This may be useful for certain types of processing. Note, however, that these extended and derived case-related properties are as yet only informative, not normative.
For some types of processing, there may be additional issues related to case that need to be considered. See UTR #21 (Davis 2001d) for further discussion of case-related issues.
7.5 Character decomposition mappings
The notions of canonical and compatibility equivalence were introduced in Section 6. There, we saw cases in which a Unicode character is identical in meaning to a decomposed sequence of one or more characters, and that these two representations are said to be canonically equivalent. We also considered other cases in which a Unicode character duplicates a sequence of one or more characters in some more limited sense: the two representations are equivalent in certain contexts only, or the one character is equivalent to the sequence when supplemented with certain non-textual information. In these cases, the two representations are said to be compatibility equivalent.
For both types of equivalence, the relationship between two encoded representations is formally expressed by means of a decomposition mapping. These mappings are given in the sixth field of UnicodeData.txt. So, for example, the character U+1EA1 LATIN SMALL LETTER A WITH DOT BELOW is canonically equivalent to the combining sequence <U+0061 LATIN SMALL LETTER A, U+0323 COMBINING DOT BELOW>. This relationship is indicated by the decomposition mapping given in the entry in UnicodeData.txt for U+1EA1:
1EA1;LATIN SMALL LETTER A WITH DOT BELOW;Ll;0;L;0061 0323;;;;N;;; 1EA0;;1EA0
Figure 19. Decomposition mapping in UnicodeData.txt entry for U+1EA1
The same field in UnicodeData.txt is used to list both canonical and compatibility decomposition mappings. The two types of equivalence do need to be distinguished, however. As noted, characters with compatibility decompositions typically have some additional non-character element of meaning or some specific contextual associations. Accordingly, when giving a decomposition mapping for such a character, it makes sense also to indicate what this additional element of meaning is. This is precisely what is done: in compatibility decomposition mappings, the mapping includes the decomposed character sequence as well as a tag that indicates the additional non-character information contained in the compatibility character. This is illustrated in Figure 20:
02B0;MODIFIER LETTER SMALL H;Lm;0;L;<super> 0068;;;;N;;;;;
2110;SCRIPT CAPITAL I;Lu;0;L;<font> 0049;;;;N;SCRIPT I;;;;
FB54;ARABIC LETTER BEEH INITIAL FORM;Lo;0;AL;<initial> 067B;;;;N;;;;;
FF42;FULLWIDTH LATIN SMALL LETTER B;Ll;0;L;<wide> 0062;;;;N;;;FF22;; FF22
Figure 20. Sample entries from UnicodeData.txt with compatibility decomposition mappings
This additional tag in the decomposition mappings is what distinguishes between canonical equivalence relationships and compatibility equivalence relationships.16
There are a total of 16 tags used to indicate compatibility decompositions. A brief description of each is given in Table 6:
Table 6. Tags used in compatibility decomposition mappings
The following table gives examples showing the use of each of these tags:
Table 7. Examples of different types of compatibility decomposition mappings
Note that the <compat> tag is used for a variety of characters that stand in one of several types of relationship to their corresponding decomposed counterparts. For example, the ligature presentation forms described in Section 6.4 and the digraphs described in Section 6.5 use this tag. It is also used for several of the types of compatibility decomposition described in “A review of characters with compatibility decompositions”.
It was noted in Section 6 that some cases of equivalence involve one-to-one relationships, for example in the case of the exact character duplicates discussed in Section 6.1. Hence, not all of the decomposition mappings contain sequences of two or more characters, as illustrated in Figure 21:
212A;KELVIN SIGN;Lu;0;L;004B;;;;N;DEGREES KELVIN;;;006B;
Figure 21. Single-character decomposition mapping in the entry for U+212A
Compatibility decomposition mappings always gives the completely decomposed representation. This is illustrated by the entry for U+3315:
3315;SQUARE KIROGURAMU;So;0;L;<square> 30AD 30ED 30B0 30E9 30E0;;;; N;SQUARED KIROGURAMU;;;;
Figure 22. Multiple-character compatibility decomposition mapping
As a consequence of this, it should be noted that an instance of compatibility equivalence always involved exactly two representations: the compatibility character and the corresponding decomposed representation given in the mapping.
For canonical decompositions, the decomposition mapping lists a sequence of one or two characters, but never more than two. In the case of precomposed characters that involve multiple diacritic characters, this generally means that the precomposed character decomposes into a partially decomposed sequence. If a fully decomposed sequence is needed, further decomposition mappings can be applied to the partially decomposed sequences.17 So, for example, U+1FA7 GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI has a decomposition involving two characters, one of which is U+1F67 GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI , which in turn has a decomposition of two characters, and so on.
1FA7;GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI;Ll;0;L;1F67 0345;;;;N;;;1FAF;;1FAF
1F67;GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI;Ll;0;L;1F61 0342;;;;N;;;1F6F;;1F6F
1F61;GREEK SMALL LETTER OMEGA WITH DASIA;Ll;0;L;03C9 0314;;;;N;;; 1F69;;1F69
Figure 23. Decomposition of precomposed characters with multiple diacritics
In principle, it would be possible for Unicode to include a precomposed character involving multiple diacritics but not to include any precomposed character involving some subset of those diacritics. In such a situation, the character would decompose directly into the fully decomposed combining sequence, meaning that the decomposition mapping includes more than two characters. In practice, though, this situation does not occur in the Unicode character set. Furthermore, the Unicode Consortium has given an explicit guarantee that canonical decompositions will never involve mapping to more than two characters.18
As explained in Section 6.3, the fact that a precomposed character with multiple diacritics has a decomposition involving partially composed forms means that there may be several canonically equivalent representations for a given text element (as illustrated in Figure 16 on page 90). Canonical equivalence is a transitive relationship, and so we intuitively know that all of the representations are canonically equivalent. Software processes cannot rely on such intuition, however. Software requires explicit and efficient algorithms that let it know that all of these representations are canonically equivalent. This is accomplished by means of normalization, which will be discussed in Section 10.
8 The bi-directional algorithm and other rendering behaviours
As has been mentioned, the Unicode character set assumes a design principle of encoding characters but not glyphs. This in turn implies an assumption that applications that use Unicode will incorporate rendering processes that can do the glyph processing required to make text appear the way it should, in accordance with the rules of each given script and writing system. Unicode does not specify in complete detail how the rendering process should be done. There are a number of approaches that an application might utilise to deal with the many details involved, and it is outside the scope of the Unicode Standard to stipulate how such processing should be done.
This does not mean that Unicode has nothing to say regarding the rendering process, however. There are some key issues that pertain to the more complex aspects of rendering that require common implementation in order to ensure consistency across all implementations. For example, under certain conditions consonant sequences in Devanagari script can be represented as ligatures,19 though in certain circumstances the sequence can also be rendered with the first consonant in a reduced “half” form instead. It is considered necessary that this distinction be reflected in plain text, and thus in the encoding of data. It is important that different implementations employ the same mechanisms for controlling this, and so the encoding mechanism is defined as a normative part of the Unicode Standard. One implication of this is that Unicode specifies how that aspect of the rendering of Devanagari script is to be done.
There are three major areas in which Unicode specifies the rendering behaviour of characters. The control of Indic conjuncts and half-forms is one of these. A second has to do with the connecting behaviour of cursive scripts, specifically Arabic and Syriac. The third has do with bi-directional text, meaning horizontal text with mixed left-to-right and right-to-left line directions. I will give a brief overview of each of these here, beginning with bi-directional text and the bi-directional algorithm.
Those three areas are of particular significance because they affect important scripts in rather pervasive ways; one simply cannot render Arabic script, for example, without addressing the issues of bi-directionality and cursive connection. There are some other less significant rendering-related issues that Unicode addresses. These will also be described briefly.
8.1 Bi-directional text and the bi-directional algorithm
There are two issues that need to be dealt with in relation to bi-directional text. The bigger issue is the mixing of text with different directionality. The other has to do with certain characters that require glyphs for right-to-left contexts that are mirror images of those required for left-to-right contexts. Both of these issues are dealt with in Unicode by the Unicode bi-directional algorithm.
[Section not yet completed]
8.2 Cursive connections
Certain scripts, such as Arabic and Syriac, have a distinctive characteristic of being strictly cursive: they are never written in non-cursive styles. The implication is that characters typically have at least four shapes corresponding to initial, medial, final and isolated (non-connected) contexts.20 Not all characters connect in all contexts, however. Thus, some characters connect to characters on both sides, others connect only on one side, and other never connect. Ultimately, the joining behaviour of a character depends upon the properties of that character as well as the properties of the characters adjacent to it.Another behaviour that is closely associated with cursive writing is ligature formation. Both Arabic and Syriac scripts make use of a number of ligatures, many of which are optional, though some are obligatory in all uses of those scripts.21
Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.
Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.