NRSI: Computers & Writing Systems
A review of characters with compatibility decompositions
This is an Appendix to “Understanding Unicode™”.
In Section 6 of “Understanding Unicode™”, I provided an overview of some main groups of characters that have canonical or compatibility decompositions. That discussion covered all of the characters in Unicode that have canonical decompositions. Among those with compatibility decompositions, we looked at presentation forms that would otherwise be handled as glyphs by complex rendering technology (as described in Section 5.4 of “Understanding Unicode™”), and combinations that correspond to digraphs in certain writing systems. Those two groupings only covered a portion of the characters with compatibility decompositions, however. There are over 2,300 others that were not covered. Many of these might possibly be construed as ligatures or as multi-graphs, except for the fact that they generally appear to be not as simple to explain as the cases described in Section 6 of “Understanding Unicode™”. Whereas the characters described in Section 6 are compromising primarily just one or another of the Unicode design principles, many of these remaining characters compromise several principles at once. The only way to classify them is into ad hoc groups.
For many users, most of these characters are likely to be unimportant. Some of them can be rather puzzling for someone who is still early on in learning about Unicode, however. It is also easier to find information on well-behaved characters than on all the oddities. In a number of cases, the notes printed in the name lists that accompany the code charts give you what you need to understand how or whether to use a given compatibility-decomposable character. More often, though, the explanation is buried in the text of the Standard within the descriptions of the script blocks (Chapters 6–13). For a number of characters, no background information is provided at all.
This discussion is by no means a thorough explanation of all of these characters or a substitute for the content in the Standard. Everything I say here is mentioned somewhere in the Standard. The Standard is large, though, and many readers may not notice the details because they are overwhelmed by the size. This appendix is intended, therefore, to provide an introduction to this set of characters, which constitute perhaps the least principled elements of the Standard.
Note that this Appendix is not intended for readers who are still at the beginner stage. There are a number of detailed references to characters and blocks within the character set that a beginner might find confusing. If you have just read “Understanding Unicode™” for the first time, you might want to just skim this Appendix. If you do want to follow it closely, I recommend that you do so with the printed or online code charts at your side.
A large portion of these miscellaneous compatibility-decomposable characters are found in a limited number of blocks within the Standard. The following table shows their locations broken down by major areas within the character set. In order to reflect that most of these characters are concentrated within a few small ranges, the table also shows the number of blocks within each region that contain these characters, the contiguous range of those blocks, and the ratio of the number of these characters in that range to the total number of characters in that range.
Table 1. Characters with miscellaneous compatibility decompositions by area of the Unicode character set
Note that these numbers do not include the presentation forms and digraphs discussed in Sections 6.4 and 6.5 of “Understanding Unicode™”. The only significant impact on Table 1 of including those additional characters would have been to bring the ratio quoted for the Compatibility area to nearly 80%. These numbers point to the fact that, while there are a large number of these special-case compatibility characters, most of them are in blocks that many users will not be dealing with. There is still a relatively small portion of them that many users will likely have to deal with, though, such as U+00B9 SUPERSCRIPT ONE or U+0E33 THAI CHARACTER SARA AM.
A significant number of these characters have one thing in common: they come from legacy character set standards defined for East Asia. Typically, those standards were designed for use with encoding systems that were capable of supporting very large character sets, rather larger than the number of characters commonly used for the writing systems of that region. This provided a luxury of being able to encode a wide variety of forms as individual characters, including things that otherwise would require non-trivial formatting and layout controls. Presumably, this was done because it was much easier to encode some complex combination as a single character and use it with software that had few capabilities with regard to formatting and layout control (especially before the days of operating systems that provide sophisticated graphic device support) rather than to develop more sophisticated software systems.
For example, there are 88 characters in the CJK Compatibility block (U+3300..U+33FF) that represent words spelled in Katakana and arranged in a square so as to fill a single em-square display cell. For instance, the character U+3315 SQUARE KIROGURAMU corresponds to a Japanese transliteration of “kilogram”, but presented in a square cell. The same block contains 100 other squared characters that correspond to Latin abbreviations for units of measure, such as U+33AF SQUARE RAD OVER S SQUARED, and some for certain Japanese terms involving multiple Kanji, such as U+337F SQUARE CORPORATION (meaning ‘incorporated’). The remainder of this block constitutes characters for combinations of numbers and ideographs designating hours of the day and days of the month, such as U+3367 IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FIFTEEN.
The Enclosed Alphanumerics (U+2460..U+24FF) and Enclosed CJK Letters and Months (U+3200..U+32FF) blocks contain another 340 characters that are similar in nature.1 These include combinations for months of the year (like the series for hours and days mentioned above), and over 300 characters for things like numbers and letters enclosed within circles, within parentheses, or punctuated with a full stop. For example, U+2499, !!unknown USV!!, and U+32D5 CIRCLED KATAKANA KA. There is one other enclosed form in another block: U+3036 CIRCLED POSTAL MARK.
The 36 Roman numerals in the Number Forms block (U+2150..U+218F) also came from East Asian standards. Four of them, such as U+2181 ROMAN NUMERAL FIVE THOUSAND, are unique and do not duplicate any other characters in Unicode. The rest, though, such as U+2166 ROMAN NUMERAL SEVEN, do have compatibility decompositions.
There are another handful of characters representing special combinations taken from East Asian standards2 in the Letterlike Symbols block (U+2100..U+214F), such as U+2100 ACCOUNT OF.
All of these 630 or so East Asian characters are compatibility equivalents to other character sequences, most with additional layout or formatting applied. These all go against several of Unicode’s design principles at once. They blur the distinctions between characters and glyphs, between characters and graphemes or higher-level linguistic units, and between plain text and rich text or higher-level layout-control protocols. They violate the principle of unification, and many of them very much fly in the face of the basic notion that text is encoded in terms of dynamically composed sequences of characters rather than encoding entire strings as a single unit. Nevertheless, they were considered expedient for certain systems in the past and so made their way into national standards, and as a result have found their way into Unicode today. Unless you are working with East Asian legacy data or systems, however, you would not be likely to have any need for them. Certainly someone writing technical documents in English should not consider using a character like U+3392 SQUARE MHZ as a convenient way to represent the abbreviation for “megahertz”!
The characters described above were compatibility equivalents to sequences of multiple characters. There are a number of other characters from East Asian standards that are singleton variants of other characters, but still have special additional characteristics, such as formatting. One large group are the characters in the Halfwidth and Fullwidth Forms block (U+FF00..U+FFEF).3 Characters from Chinese, Japanese and Korean are generally wider than Latin characters, and they also tend to be largely consistent in width. East Asian systems typically implement Latin characters using glyphs that are half the width of East Asian characters. In addition, they generally also include a set of Latin characters that are the full width of the Asian characters, and also a set of Katakana and Hangul characters that are half their normal width to supplement the narrow Latin letters. Early systems used a mixture of double-byte and single-byte character set encodings, and associated the full-width glyphs with the double-byte characters. Similarly, they associated the half-width glyphs with the single-byte characters. Later encoding standards made it possible to merge the character sets, and this was done with the width variations retained. This resulted in distinct full- and half-width variants of a number of characters in those standards, and these have been carried into Unicode. The Halfwidth and Fullwidth Forms block contains wide variants of Latin characters, such as U+FF29 FULLWIDTH LATIN CAPITAL LETTER I, and narrow variants of Katakana and Hangul characters, such as U+FFCE HALFWIDTH HANGUL LETTER WAE.
The Small Form Variants block (U+FE50..U+FE6F) is related the half-width and full-width characters. This block contain characters that represent narrow variants of punctuation set within full-width cells, such as U+FE57 SMALL EXCLAMATION MARK.
The Kanbun block (U+3190..U+319F) is another small set of East Asian singleton variant characters. These are superscripted marks used to indicate Japanese reading order of Classical Chinese texts. Two of these characters are unique, but the other 14 are variants of other existing characters. These were not added for reasons of backward compatibility, but they are considered compatibility equivalents of the corresponding Han characters.
There are three other major blocks of Asian characters to be noted. Two of these, the CJK Radicals Supplement block (U+2E80..U+2EFF) and the KangXi Radicals block (U+2F00..U+2FDF), contain Han radicals. Radicals behave and are used differently than the ideographs. In particular, the radicals are used as symbols rather than as words or parts of words. The KangXi radicals all correspond to other existing characters in Unicode, and so are compatibility equivalents of those characters. On the other hand, all but two of the characters in the CJK Radicals Supplement block are considered unique and do not have decompositions. Yet the Standard is clear regarding the usage of both blocks of characters: these are not ideographs and should not be used as such.4
The third block is the Hangul Compatibility Jamo block (U+3130..U+318F). In Hangul script, the jamo are letters that basically represent phonemes (comparable to a typical alphabet), but they combine together graphically by forming an arrangement corresponding to a syllable occupying a square cell. The Hangul Jamo block (U+1100..U+11FF) encodes jamo that combine graphically with one another in this way. The Hangul Compatibility Jamo block, on the other hand, contains duplicates of those characters that differ in that they do not graphically combine with one another or with other combining jamo characters. The characters in this block are compatibility equivalents of the corresponding characters in the Hangul Jamo block. They should not normally be used, except in case you specifically do not want the jamo to combine with one another. This might be appropriate, for example, in a document describing the workings of Hangul script.
That covers most of the characters related to East Asian standards. There are another dozen or so in various blocks that I will not elaborate on.
Turning at last to non-East Asian characters, the largest group is a set of 1,018 characters that are singleton duplicates of other characters but with special font formatting properties. The special formatting is typically understood in certain contexts to have special meaning. There are two subgroups among these that need to be distinguished. Firstly, there are 27 characters in the Letterlike Symbols block (U+2100..U+214F), such as U+2111 BLACK-LETTER CAPITAL I (which is used in mathematics to denote imaginary numbers). The second subgroup is the Mathematical Alphanumeric Symbols block (U+1D400..U+1D7FF), which was added to Plane 1 in TUS 3.1. This block contains over 900 specially-formatted symbol variants of Latin and Greek letters and numbers, such as U+1D5FC MATHEMATICAL SANS-SERIF BOLD SMALL O.
The characters in the Letterlike Symbols block were added because they existed in source standards. The characters in the Mathematical Alphanumeric Symbols block, however, were created de novo. The decision to create new characters that seemingly violate the distinction between plain text and fancy text was not made without considerable debate, and was done only to meet the specific needs of mathematicians on the basis that these characters are used as symbols rather than as letters, and that they are needed to reflect meaningful distinctions in mathematical formulas. Note that these characters are not to be used to achieve the formatting effects of rich text within plain text; it would not a good idea to use them for writing email, for example!
Similar to these, there about a dozen characters that are variants of Latin, Greek and Hebrew letters. U+2107 EULER CONSTANT is a variant of a Latin letter used as a symbol. The Hebrew characters, like U+2135 ALEF SYMBOL, are distinguished from their nominal counterparts because they likewise are used as symbols and, significantly, have left-to-right directionality. At least some of the Greek characters, such as U+00B5 MICRO SIGN, are clearly to be used as symbols. The others are most likely just variant glyph forms that perhaps should be counted among the presentation forms discussed in Section 6.4 of “Understanding Unicode™”, though it is not currently clear to me just which are in that category. For people working with Greek text, though, all of these characters can probably be safely ignored. If variant glyph forms are needed, it is better to rely upon font feature mechanisms provided by “smart font” rendering technologies, or simply to find a typeface that has the desired glyphs designs using the nominal characters.
Like the East Asian legacy standards, the non-East Asian standards included characters corresponding to multi-character sequences, though not nearly as many. The largest group of these are characters for vulgar fractions. Fractions can be represented in Unicode as character sequences using U+2044 FRACTION SLASH; for example, can be represented as <U+0031 DIGIT ONE, U+2044 FRACTION SLASH, U+0032 DIGIT TWO>. Various precomposed fractions came in from source standards, though, such as U+00BD VULGAR FRACTION ONE HALF. Most are in the Number Forms block (U+2150..U+218F), though three are in the Latin-1 Supplement block (U+00A0..U+00FF). As applications supporting advance rendering technologies emerge, it should eventually become possible to present any fraction in text without needing to encode a precomposed form.
There are a handful of additional characters that correspond to combinations within the symbols and punctuations blocks, such as U+2034 TRIPLE PRIME.
Another significant group are the superscripts and subscripts. These are distributed among the Latin-1 Supplement block (U+00A0..U+00FF), the Spacing Modifier Letters block (U+02B0..U+02FF), and the Superscripts and Subscripts block (U+2070..U+209F). In certain contexts, some of these might be understood to have special meanings. For example, U+00B2 SUPERSCRIPT TWO might denote a footnote reference in text, a pitch in a phonetic transcription, or an exponent in a mathematical expression. In each case, there is an extra element of meaning beyond what is inherent in the counterpart to this character, U+0032 DIGIT TWO.
One of the smaller groups of characters with compatibility decompositions are spacing variants of combining marks. There are 20 of these, such as U+00A8 DIAERESIS, located in a number of different blocks. All of these have decompositions consisting of U+0020 SPACE followed by a corresponding combining mark. These might be useful in talking about the corresponding combining marks, but otherwise are probably not that useful (though presumably they were included in legacy standards for some purpose).
An even smaller group are the variants of U+0020 SPACE: there are twelve, most of them having specific widths for typographic use. There are three that prohibit line breaking, though: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, and U+202F NARROW NO-BREAK SPACE. Similar to the latter three, U+2011 NON-BREAKING HYPHEN is, obviously, a variant of U+2010 HYPHEN that prohibits line breaks.
The categories discussed above have covered essentially all of the characters with compatibility decompositions defined in TUS 3.1, leaving only a dozen or less. I cannot give an exact count since it is so unclear (at least to me) how the remaining characters are best characterised that another person might have already included them in one of the above sets and perhaps have a different residue. I will just mention two that I have found hard to categorise: U+0E33 THAI CHARACTER SARA AM, and U+0EB3 LAO VOWEL SIGN AM. Because these both decompose into a combining mark followed by a base character, with the former combining with some preceding character, I have not known which of the preceding categories, if any, to put them into.
Of course, assigning every last compatibility-decomposable character to some category really is not important. Those are mostly ad hoc categories anyway; they were only introduced for instructional purposes. All that truly matters for a user with regard to these or any other character in Unicode is that the user knows when and how a given character should be used. The keys to that are to learn what you can about the characters you think you may need, understanding the semantic properties and the implication of issues such as normalisation or interaction with markup languages, and that you know what is already common practice among users. It is easy to be idealistic and say that the right way to represent a text element is by using a decomposed sequence, but if in actual practice the overwhelming majority of users are using a compatibility character, then it may not be a good idea to do otherwise without carefully considered reasons.
Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.
Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.