|
Computers & Writing Systems
You are here: Encoding > Unicode Mapping codepoints to Unicode encoding forms
Note: This is an Appendix to “Understanding Unicode™”. See also A review of characters with compatibility decompositions. In Section 4 of “Understanding Unicode™”, we examined each of the three character encoding forms defined within Unicode. This appendix describes in detail the mappings from Unicode codepoints to the code unit sequences used in each encoding form. In this description, the mapping will be expressed in alternate forms, one of which is a mapping of bits between the binary representation of a Unicode scalar value and the binary representation of a code unit. Even though a coded character set encodes characters in terms of numerical values that have no specific computer representation or data type associated with them, for purposes of describing this mapping, we are considering codepoints in the Unicode codespace to have a width of 21 bits. This is the number of bits required for binary representation of the entire numerical range of Unicode scalar values, 0x0 to 0x10FFFF. 1 UTF-32The UTF-32 encoding form was formally incorporated into Unicode as part of TUS 3.1. The definitions for UTF-32 are specified in TUS 3.1 and in UAX#19 (Davis 2001). The mapping for UTF-32 is, essentially, the identity mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself. Thus if U represents the Unicode scalar value for a character and C represents the value of the 32-bit code unit then: U = C The mapping can also be expressed in terms of the relationships between bits in the binary representations of the Unicode scalar values and the 32-bit code units, as shown in Table 1.
Table 1 UTF-32 USV to code unit mapping 2 UTF-16The UTF-16 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-16 are specified in TUS 3.0.1U = (CH – D80016) * 40016 + (CL – DC0016) + 1000016 Likewise, determining the high and low surrogate values for a given Unicode scalar value is fairly straightforward. Assuming the variables CH, CL and U as above, and that U is in the range U+10000..U+10FFFF, CH = (U – 1000016) 40016 + D80016 CL = (U – 1000016) mod 40016 + DC0016 where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the modulo operator. Expressing the mapping in terms of a mapping of bits between the binary representations of scalar values and code units, the UTF-16 mapping is as shown in Table 2:
Table 2 UTF-16 USV to code unit mapping 3 UTF-8The UTF-8 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-8 are specified in TUS 3.1.2 As with the other encoding forms, calculating a Unicode scalar value from the 8-bit code units in a UTF-8 sequence is a matter of simple arithmetic. In this case, however, the calculation depends upon the number of bytes in the sequence. Similarly, the calculation of code units from a scalar value must be expressed differently for different ranges of scalar values. Let us consider first the relationship between bits in the binary representation of codepoints and code units. This is shown for UTF-8 in Table 3:
Table 3 UTF-8 USV to code unit mapping Note There is a slight difference between Unicode and ISO/IEC 10646 in how they define UTF-8 since Unicode limits it to the roughly one million characters possible in Unicode’s codespace, while for the ISO/IEC standard, it can access the entire 31-bit codespace. For all practical purposes, this difference is irrelevant since the ISO/IEC codespace is effectively limited to match that of Unicode, but you may encounter differing descriptions on occasion. As mentioned in Section 4.2 of “Understanding Unicode™”, UTF-8 byte sequences have certain interesting properties. These can be seen from the table above. Firstly, note the high-order bits in non-initial bytes as opposed to sequence-initial bytes. By looking at the first two bits, you can immediately determine whether a code unit is an initial byte in a sequence or is a following byte. Secondly, by looking at the number of non-zero high-order bits of the first byte in the sequence, you can immediately tell how long the sequence is: if no high-order bits are set to one, then the sequence contains exactly one byte. Otherwise, the number of non-zero high-order bits is equal to the total number of bytes in the sequence. Table 3 also reveals the other interesting characteristic of UTF-8 that was described in Section 4.2 of “Understanding Unicode™”. Note that characters in the range U+0000..U+007F are represented using a single byte. The characters in this range match ASCII codepoint for codepoint. Thus, any data encoded in ASCII is automatically also encoded in UTF-8. Having seen how the bits compare, let us consider how code units can be calculated from scalar values, and vice versa. If U represents the value of a Unicode scalar value and C1, C2, C3 and C4 represent bytes in a UTF-8 byte sequence (in order), then the value of a Unicode scalar value U can be calculated as follows: If a sequence has one byte, then U = C1 Else if a sequence has two bytes, then U = (C1 – 192) * 64 + C2 – 128 Else if a sequence has three bytes, then U = (C1 – 224) * 4,096 + (C2 – 128) * 64 + C3 – 128 Else U = (C1 – 240) * 262,144 + (C2 – 128) * 4,096 + (C3 – 128) * 64 + C4 – 128 End if Going the other way, given a Unicode scalar value U, then the UTF-8 byte sequence can be calculated as follows: If U <= U+007F, then C1 = U Else if U+0080 <= U <= U+07FF, then C1 = U 64 + 192 C2 = U mod 64 + 128 Else if U+0800 <= U <= U+D7FF, or if U+E000 <= U <= U+FFFF, then C1 = U 4,096 + 224 C2 = (U mod 4,096) 64 + 128 C3 = U mod 64 + 128 Else C1 = U 262,144 + 240 C2 = (U mod 262,144) 4,096 + 128 C3 = (U mod 4,096) 64 + 128 C4 = U mod 64 + 128 End if where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the modulo operator.
Table 4 “UTF-8” non-shortest sequences for U+0041 Obviously, having these alternate encoded representations for the same character is not desirable. Accordingly, the UTF-8 specification stipulates that the shortest possible representation must be used. In TUS 3.1, this was made more explicitly clear by specifying exactly what UTF-8 byte sequences are or are not legal. Thus, in the example above, each of the sequences other than the first is an illegal code unit sequence. Similarly, a supplementary-plane character can be encoded directly into a four-byte UTF-8 sequence, but someone might (possibly from misunderstanding) choose to map the codepoint into a UTF-16 surrogate pair, and then apply the UTF-8 mapping to each of the surrogate code units to get a pair of three-byte sequences. To illustrate, consider the following:
Table 5 UTF-8-via-surrogates representation of supplementary-plane character Again, the Unicode Standard expects the shortest representation to be used for UTF-8. For certain reasons, non-shortest representations of supplementary-plane characters are referred to as irregular code unit sequences rather than illegal code unit sequences. The distinction here is subtle: software that conforms to the Unicode Standard is allowed to interpret these irregular sequences as the corresponding supplementary-plane characters, but is not allowed to generate these irregular sequences. In certain situations, though, software will want to reject such irregular UTF-8 sequences (for instance, where these might otherwise be used to avoid security systems), and in these cases the Standard allows conformant software to ignore or reject these sequences, or remove them from a data stream. The main motivation for making the distinction and for considering these 6-byte sequences to be irregular rather than illegal is this: suppose a process is re-encoding a data stream from UTF-16 to UTF-8, and suppose that the source data stream had been interrupted so that it ended with the beginning of a surrogate pair. It may be that this segment of the data will later be re-united with the remainder of the data, it also having been re-encoded in UTF-8. So, we are assuming that there are two segments of data out there: one ending with an unpaired high surrogate, and one beginning with an unpaired low surrogate. Now, as each segment of the data is being trans-coded from UTF-16 to UTF-8, the question arises as to what should be done with the unpaired surrogate code units. If they are ignored, then the result after the data is reassembled will be that a character has been lost. A more graceful way to deal with the data would be for the trans-coding process to translate the unpaired surrogate into a corresponding 3-byte UTF-8 sequence, and then leave it up to a later receiving process to decide what to do with it. Then, if the receiving process gets the data segments assembled again, that character will still be part of the information content of the data. The only problem is that now it is in a 6-byte pseudo-UTF-8 sequence. Defining these as irregular rather than illegal is intended to allow that character to be retained over the course of this overall process in a form that conformant software is allowed to interpret, even if it would not be allowed to generate it that way. 4 ReferencesDavis, Mark. 2001. Unicode standard annex #19: UTF-32. Version 3.1.0. Cupertino, CA: The Unicode Consortium. The current version is available online at http://www.unicode.org/unicode/reports/tr19/. Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.
"Rajesh", Thu, Feb 18, 2010 00:36 (EST) For long I was not able to find any info on the exact mechanism of unicode code points to utf-8 (and reverse) mapping. Thanks a lot!
"Dave P", Thu, Apr 28, 2011 23:13 (EDT) I came across this by accident and I'm sure glad that I did. It provides valuable insight, especially with regard to UTF-8, that I have not seen elsewhere. Thanks, Dave
"James Papadakis", Wed, Jul 13, 2011 16:44 (EDT) Can someone give a specific example??? PLEASE! going from simple to difficult? I know I am retarded compared to the gurus, but Im just an amateur, so no bashing please. Your help is appreciated. I had no teacher to show me anything so I don't understand quite a few things. I am trying to recreate keysymdef.h in GNU/Linux Ubuntu10.04 so as to reassign some unicode characters, and in turn change the keyboard mapping bindings. I am not a computer expert or programmer. I know that I need to add i.e. U+1FFD to a hexadecimal value in order to find its value? and what does this "Byte 3=0" or Byte 3=1" or Byte 3=7" mean in this file?? i.e. /* * Latin 1 * (ISO/IEC 8859-1 = Unicode U+0020..U+00FF) * Byte 3 = 0 */ #define XK_0 0x0030 /* U+0030 DIGIT ZERO */ Here U+0030 is the same as the hex value 0x0030, and it also corresponds to the system character map file where all the glyphs are shown (Digit zero is U+0030 and its hex value is UTF-16 0x0030) but: i.e. /* * Latin 2 * Byte 3 = 1 */ #ifdef XK_LATIN2 #define XK_Aogonek 0x01a1 /* U+0104 LATIN CAPITAL LETTER A WITH OGONEK */ Here U+0104 corresponds to the hex value 0x01a1. How did we get to this? How do I use the Byte 3=1 information for all this stuff? I know I am a bozo, but I'm completely self taught, forgive me for asking silly questions to computer gurus, but to me they are rather complicated. For instance, how to I find out what the hexadecimal value would be for U1FFD? (so that it corresponds to the Unicode character when it is called by the machine? Obviously this would not be 0x1FFD would it? How can I find out?) I would assume it would be using this info: /* * Greek * (not quite identical to, ISO/IEC 8859-7) * Byte 3 = 7 */ for example this one was already there: #define XK_Greek_OMICRONaccent 0x07a7 /* U+038C GREEK CAPITAL LETTER OMICRON WITH TONOS */ (so U+038C corresponds to 0x07a7) Any help would be much appreciated
"P. Wormer", Thu, Jan 2, 2014 11:07 (EST) Third line of Table 4: 000000000000001000001 00000zzzzyyyyyyxxxxxx 1110zzzz 10000001 10000001 Shouldn't be 000000000000001000001 00000zzzzyyyyyyxxxxxx 11100000 10000001 10000001 ?
stephanie_smith, Tue, Jan 14, 2014 15:56 (EST) Many thanks for spotting this and letting us know! The error has now been fixed. Best wishes, Stephanie
© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page. |