|
|
NRSI: Computers & Writing Systems You are here: Encoding > Unicode A review of characters with compatibility decompositions
Note: This is an Appendix to “Understanding Unicode™”. In Section 6 of “Understanding Unicode™”, I provided an overview of some main groups of characters that have canonical or compatibility decompositions. That discussion covered all of the characters in Unicode that have canonical decompositions. Among those with compatibility decompositions, we looked at presentation forms that would otherwise be handled as glyphs by complex rendering technology (as described in Section 5.4 of “Understanding Unicode™”), and combinations that correspond to digraphs in certain writing systems. Those two groupings only covered a portion of the characters with compatibility decompositions, however. There are over 2,300 others that were not covered. Many of these might possibly be construed as ligatures or as multi-graphs, except for the fact that they generally appear to be not as simple to explain as the cases described in Section 6 of “Understanding Unicode™”. Whereas the characters described in Section 6 are compromising primarily just one or another of the Unicode design principles, many of these remaining characters compromise several principles at once. The only way to classify them is into ad hoc groups. For many users, most of these characters are likely to be unimportant. Some of them can be rather puzzling for someone who is still early on in learning about Unicode, however. It is also easier to find information on well-behaved characters than on all the oddities. In a number of cases, the notes printed in the name lists that accompany the code charts give you what you need to understand how or whether to use a given compatibility-decomposable character. More often, though, the explanation is buried in the text of the Standard within the descriptions of the script blocks (Chapters 6–13). For a number of characters, no background information is provided at all. This discussion is by no means a thorough explanation of all of these characters or a substitute for the content in the Standard. Everything I say here is mentioned somewhere in the Standard. The Standard is large, though, and many readers may not notice the details because they are overwhelmed by the size. This appendix is intended, therefore, to provide an introduction to this set of characters, which constitute perhaps the least principled elements of the Standard. Note that this Appendix is not intended for readers who are still at the beginner stage. There are a number of detailed references to characters and blocks within the character set that a beginner might find confusing. If you have just read “Understanding Unicode™” for the first time, you might want to just skim this Appendix. If you do want to follow it closely, I recommend that you do so with the printed or online code charts at your side. A large portion of these miscellaneous compatibility-decomposable characters are found in a limited number of blocks within the Standard. The following table shows their locations broken down by major areas within the character set. In order to reflect that most of these characters are concentrated within a few small ranges, the table also shows the number of blocks within each region that contain these characters, the contiguous range of those blocks, and the ratio of the number of these characters in that range to the total number of characters in that range.
Table 1. Characters with miscellaneous compatibility decompositions by area of the Unicode character set Note that these numbers do not include the presentation forms and digraphs discussed in Sections 6.4 and 6.5 of “Understanding Unicode™”. The only significant impact on Table 1 of including those additional characters would have been to bring the ratio quoted for the Compatibility area to nearly 80%. A significant number of these characters have one thing in common: they come from legacy character set standards defined for East Asia. Typically, those standards were designed for use with encoding systems that were capable of supporting very large character sets, rather larger than the number of characters commonly used for the writing systems of that region. This provided a luxury of being able to encode a wide variety of forms as individual characters, including things that otherwise would require non-trivial formatting and layout controls. Presumably, this was done because it was much easier to encode some complex combination as a single character and use it with software that had few capabilities with regard to formatting and layout control (especially before the days of operating systems that provide sophisticated graphic device support) rather than to develop more sophisticated software systems. For example, there are 88 characters in the CJK Compatibility block (U+3300..U+33FF) that represent words spelled in Katakana and arranged in a square so as to fill a single em-square display cell. For instance, the character U+3315 The Enclosed Alphanumerics (U+2460..U+24FF) and Enclosed CJK Letters and Months (U+3200..U+32FF) blocks contain another 340 characters that are similar in nature.1 These include combinations for months of the year (like the series for hours and days mentioned above), and over 300 characters for things like numbers and letters enclosed within circles, within parentheses, or punctuated with a full stop. For example, U+2499, The 36 Roman numerals in the Number Forms block (U+2150..U+218F) also came from East Asian standards. Four of them, such as U+2181 There are another handful of characters representing special combinations taken from East Asian standards2 in the Letterlike Symbols block (U+2100..U+214F), such as U+2100 All of these 630 or so East Asian characters are compatibility equivalents to other character sequences, most with additional layout or formatting applied. These all go against several of Unicode’s design principles at once. They blur the distinctions between characters and glyphs, between characters and graphemes or higher-level linguistic units, and between plain text and rich text or higher-level layout-control protocols. They violate the principle of unification, and many of them very much fly in the face of the basic notion that text is encoded in terms of dynamically composed sequences of characters rather than encoding entire strings as a single unit. Nevertheless, they were considered expedient for certain systems in the past and so made their way into national standards, and as a result have found their way into Unicode today. Unless you are working with East Asian legacy data or systems, however, you would not be likely to have any need for them. Certainly someone writing technical documents in English should not consider using a character like U+3392 The characters described above were compatibility equivalents to sequences of multiple characters. There are a number of other characters from East Asian standards that are singleton variants of other characters, but still have special additional characteristics, such as formatting. One large group are the characters in the Halfwidth and Fullwidth Forms block (U+FF00..U+FFEF).3 Characters from Chinese, Japanese and Korean are generally wider than Latin characters, and they also tend to be largely consistent in width. East Asian systems typically implement Latin characters using glyphs that are half the width of East Asian characters. In addition, they generally also include a set of Latin characters that are the full width of the Asian characters, and also a set of Katakana and Hangul characters that are half their normal width to supplement the narrow Latin letters. Early systems used a mixture of double-byte and single-byte character set encodings, and associated the full-width glyphs with the double-byte characters. Similarly, they associated the half-width glyphs with the single-byte characters. Later encoding standards made it possible to merge the character sets, and this was done with the width variations retained. This resulted in distinct full- and half-width variants of a number of characters in those standards, and these have been carried into Unicode. The Halfwidth and Fullwidth Forms block contains wide variants of Latin characters, such as U+FF29 The Small Form Variants block (U+FE50..U+FE6F) is related the half-width and full-width characters. This block contain characters that represent narrow variants of punctuation set within full-width cells, such as U+FE57 The Kanbun block (U+3190..U+319F) is another small set of East Asian singleton variant characters. These are superscripted marks used to indicate Japanese reading order of Classical Chinese texts. Two of these characters are unique, but the other 14 are variants of other existing characters. These were not added for reasons of backward compatibility, but they are considered compatibility equivalents of the corresponding Han characters. There are three other major blocks of Asian characters to be noted. Two of these, the CJK Radicals Supplement block (U+2E80..U+2EFF) and the KangXi Radicals block (U+2F00..U+2FDF), contain Han radicals. Radicals behave and are used differently than the ideographs. In particular, the radicals are used as symbols rather than as words or parts of words. The KangXi radicals all correspond to other existing characters in Unicode, and so are compatibility equivalents of those characters. On the other hand, all but two of the characters in the CJK Radicals Supplement block are considered unique and do not have decompositions. Yet the Standard is clear regarding the usage of both blocks of characters: these are not ideographs and should not be used as such.4 The third block is the Hangul Compatibility Jamo block (U+3130..U+318F). In Hangul script, the jamo are letters that basically represent phonemes (comparable to a typical alphabet), but they combine together graphically by forming an arrangement corresponding to a syllable occupying a square cell. The Hangul Jamo block (U+1100..U+11FF) encodes jamo that combine graphically with one another in this way. The Hangul Compatibility Jamo block, on the other hand, contains duplicates of those characters that differ in that they do not graphically combine with one another or with other combining jamo characters. The characters in this block are compatibility equivalents of the corresponding characters in the Hangul Jamo block. They should not normally be used, except in case you specifically do not want the jamo to combine with one another. This might be appropriate, for example, in a document describing the workings of Hangul script. That covers most of the characters related to East Asian standards. There are another dozen or so in various blocks that I will not elaborate on. Turning at last to non-East Asian characters, the largest group is a set of 1,018 characters that are singleton duplicates of other characters but with special font formatting properties. The special formatting is typically understood in certain contexts to have special meaning. There are two subgroups among these that need to be distinguished. Firstly, there are 27 characters in the Letterlike Symbols block (U+2100..U+214F), such as U+2111 The characters in the Letterlike Symbols block were added because they existed in source standards. The characters in the Mathematical Alphanumeric Symbols block, however, were created de novo. The decision to create new characters that seemingly violate the distinction between plain text and fancy text was not made without considerable debate, and was done only to meet the specific needs of mathematicians on the basis that these characters are used as symbols rather than as letters, and that they are needed to reflect meaningful distinctions in mathematical formulas. Note that these characters are not to be used to achieve the formatting effects of rich text within plain text; it would not a good idea to use them for writing e-mail, for example! Similar to these, there about a dozen characters that are variants of Latin, Greek and Hebrew letters. U+2107 Like the East Asian legacy standards, the non-East Asian standards included characters corresponding to multi-character sequences, though not nearly as many. The largest group of these are characters for vulgar fractions. Fractions can be represented in Unicode as character sequences using U+2044 FRACTION SLASH; for example, There are a handful of additional characters that correspond to combinations within the symbols and punctuations blocks, such as U+2034 Another significant group are the superscripts and subscripts. These are distributed among the Latin-1 Supplement block (U+00A0..U+00FF), the Spacing Modifier Letters block (U+02B0..U+02FF), and the Superscripts and Subscripts block (U+2070..U+209F). In certain contexts, some of these might be understood to have special meanings. For example, U+00B2 One of the smaller groups of characters with compatibility decompositions are spacing variants of combining marks. There are 20 of these, such as U+00A8 An even smaller group are the variants of U+0020 SPACE: there are twelve, most of them having specific widths for typographic use. There are three that prohibit line breaking, though: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, and U+202F NARROW NO-BREAK SPACE. Similar to the latter three, U+2011 NON-BREAKING HYPHEN is, obviously, a variant of U+2010 HYPHEN that prohibits line breaks. The categories discussed above have covered essentially all of the characters with compatibility decompositions defined in TUS 3.1, leaving only a dozen or less. I cannot give an exact count since it is so unclear (at least to me) how the remaining characters are best characterised that another person might have already included them in one of the above sets and perhaps have a different residue. I will just mention two that I have found hard to categorise: U+0E33 Of course, assigning every last compatibility-decomposable character to some category really is not important. Those are mostly ad hoc categories anyway; they were only introduced for instructional purposes. All that truly matters for a user with regard to these or any other character in Unicode is that the user knows when and how a given character should be used. The keys to that are to learn what you can about the characters you think you may need, understanding the semantic properties and the implication of issues such as normalisation or interaction with markup languages, and that you know what is already common practice among users. It is easy to be idealistic and say that the right way to represent a text element is by using a decomposed sequence, but if in actual practice the overwhelming majority of users are using a compatibility character, then it may not be a good idea to do otherwise without carefully considered reasons.
Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website. © 2003-2009 SIL International, all rights reserved, unless otherwise noted elsewhere on this page. |