Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SUPPORT | DONATE

You are here: Encoding > Unicode
Short URL: http://scripts.sil.org/IWS-AppendixA

Mapping codepoints to Unicode encoding forms

Peter Constable, 2001-06-13

In Section 4 of “Understanding Unicode™”, we examined each of the three character encoding forms defined within Unicode. This appendix describes in detail the mappings from Unicode codepoints to the code unit sequences used in each encoding form.

In this description, the mapping will be expressed in alternate forms, one of which is a mapping of bits between the binary representation of a Unicode scalar value and the binary representation of a code unit. Even though a coded character set encodes characters in terms of numerical values that have no specific computer representation or data type associated with them, for purposes of describing this mapping, we are considering codepoints in the Unicode codespace to have a width of 21 bits. This is the number of bits required for binary representation of the entire numerical range of Unicode scalar values, 0x0 to 0x10FFFF.

1 UTF-32

The UTF-32 encoding form was formally incorporated into Unicode as part of TUS 3.1. The definitions for UTF-32 are specified in TUS 3.1 and in UAX#19 (Davis 2001). The mapping for UTF-32 is, essentially, the identity mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself. Thus if U represents the Unicode scalar value for a character and C represents the value of the 32-bit code unit then:

U = C

The mapping can also be expressed in terms of the relationships between bits in the binary representations of the Unicode scalar values and the 32-bit code units, as shown in Table 1.

Codepoint rangeUnicode scalar value (binary)Code units (binary)
U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000000xxxxxxxxxxxxxxxxxxxxx

Table 1 UTF-32 USV to code unit mapping

2 UTF-16

The UTF-16 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-16 are specified in TUS 3.0.1

U = (CH – D80016) * 40016 + (CL – DC0016) + 1000016

Likewise, determining the high and low surrogate values for a given Unicode scalar value is fairly straightforward. Assuming the variables CH, CL and U as above, and that U is in the range U+10000..U+10FFFF,

CH = (U – 1000016) 40016 + D80016

CL = (U – 1000016) mod 40016 + DC0016

where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the modulo operator.

Expressing the mapping in terms of a mapping of bits between the binary representations of scalar values and code units, the UTF-16 mapping is as shown in Table 2:

Codepoint rangeUnicode scalar value (binary)Code units (binary)
U+0000..U+D7FF, U+E000..U+EFFF 00000xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
U+10000..U+10FFFF Uuuuuxxxxxxyyyyyyyyyy 110110wwwwxxxxxx 110111yyyyyyyyyy (where uuuuu = wwww + 1)

Table 2 UTF-16 USV to code unit mapping

3 UTF-8

The UTF-8 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-8 are specified in TUS 3.1.2 As with the other encoding forms, calculating a Unicode scalar value from the 8-bit code units in a UTF-8 sequence is a matter of simple arithmetic. In this case, however, the calculation depends upon the number of bytes in the sequence. Similarly, the calculation of code units from a scalar value must be expressed differently for different ranges of scalar values.

Let us consider first the relationship between bits in the binary representation of codepoints and code units. This is shown for UTF-8 in Table 3:

Codepoint rangeScalar value (binary)Byte 1Byte 2Byte 3Byte 4
U+0000..U+007F 00000000000000xxxxxxx 0xxxxxxx      
U+0080..U+07FF 0000000000yyyyyxxxxxx 110yyyyy 10xxxxxx    
U+0800..U+D7FF, U+E000..U+FFFF 00000zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx  
U+10000..U+10FFFF uuuzzzzzzyyyyyyxxxxxx 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx

Table 3 UTF-8 USV to code unit mapping

Note

There is a slight difference between Unicode and ISO/IEC 10646 in how they define UTF-8 since Unicode limits it to the roughly one million characters possible in Unicode’s codespace, while for the ISO/IEC standard, it can access the entire 31-bit codespace. For all practical purposes, this difference is irrelevant since the ISO/IEC codespace is effectively limited to match that of Unicode, but you may encounter differing descriptions on occasion.

As mentioned in Section 4.2 of “Understanding Unicode™”, UTF-8 byte sequences have certain interesting properties. These can be seen from the table above. Firstly, note the high-order bits in non-initial bytes as opposed to sequence-initial bytes. By looking at the first two bits, you can immediately determine whether a code unit is an initial byte in a sequence or is a following byte. Secondly, by looking at the number of non-zero high-order bits of the first byte in the sequence, you can immediately tell how long the sequence is: if no high-order bits are set to one, then the sequence contains exactly one byte. Otherwise, the number of non-zero high-order bits is equal to the total number of bytes in the sequence.

Table 3 also reveals the other interesting characteristic of UTF-8 that was described in Section 4.2 of “Understanding Unicode™”. Note that characters in the range U+0000..U+007F are represented using a single byte. The characters in this range match ASCII codepoint for codepoint. Thus, any data encoded in ASCII is automatically also encoded in UTF-8.

Having seen how the bits compare, let us consider how code units can be calculated from scalar values, and vice versa. If U represents the value of a Unicode scalar value and C1, C2, C3 and C4 represent bytes in a UTF-8 byte sequence (in order), then the value of a Unicode scalar value U can be calculated as follows:

If a sequence has one byte, then

U = C1

Else if a sequence has two bytes, then

U = (C1 – 192) * 64 + C2 – 128

Else if a sequence has three bytes, then

U = (C1 – 224) * 4,096 + (C2 – 128) * 64 + C3 – 128

Else

U = (C1 – 240) * 262,144 + (C2 – 128) * 4,096 + (C3 – 128) * 64 + C4 – 128

End if

Going the other way, given a Unicode scalar value U, then the UTF-8 byte sequence can be calculated as follows:

If U <= U+007F, then

C1 = U

Else if U+0080 <= U <= U+07FF, then

C1 = U 64 + 192

C2 = U mod 64 + 128

Else if U+0800 <= U <= U+D7FF, or if U+E000 <= U <= U+FFFF, then

C1 = U 4,096 + 224

C2 = (U mod 4,096) 64 + 128

C3 = U mod 64 + 128

Else

C1 = U 262,144 + 240

C2 = (U mod 262,144) 4,096 + 128

C3 = (U mod 4,096) 64 + 128

C4 = U mod 64 + 128

End if

where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the modulo operator.
If you examine the mapping in Table 3 carefully, you may notice that by ignoring the range constraints in the left-hand column, certain codepoints can potentially be represented in more than one way. For example, substituting U+0041 LATIN CAPITAL LETTER A into the table gives the following possibilities:

Codepoint PatternByte 1Byte 2Byte 3Byte 4
000000000000001000001 00000000000000xxxxxxx 01000001      
000000000000001000001 0000000000yyyyyxxxxxx 11000001 10000001    
000000000000001000001 00000zzzzyyyyyyxxxxxx 11100000 10000001 10000001  
000000000000001000001 uuuzzzzzzyyyyyyxxxxxx 11110000 10000000 10000001 10000001

Table 4 “UTF-8” non-shortest sequences for U+0041

Obviously, having these alternate encoded representations for the same character is not desirable. Accordingly, the UTF-8 specification stipulates that the shortest possible representation must be used. In TUS 3.1, this was made more explicitly clear by specifying exactly what UTF-8 byte sequences are or are not legal. Thus, in the example above, each of the sequences other than the first is an illegal code unit sequence.

Similarly, a supplementary-plane character can be encoded directly into a four-byte UTF-8 sequence, but someone might (possibly from misunderstanding) choose to map the codepoint into a UTF-16 surrogate pair, and then apply the UTF-8 mapping to each of the surrogate code units to get a pair of three-byte sequences. To illustrate, consider the following:

Supplementary-plane codepoint U+10011
Normal UTF-8 byte sequence 0xF0 0x90 0x80 0x91
UTF-16 surrogate pair 0xD800 0xDC11
“UTF-8” mapping of surrogates 0xED 0xA0 0x80 0xED 0xB0 0x91

Table 5 UTF-8-via-surrogates representation of supplementary-plane character

Again, the Unicode Standard expects the shortest representation to be used for UTF-8. For certain reasons, non-shortest representations of supplementary-plane characters are referred to as irregular code unit sequences rather than illegal code unit sequences. The distinction here is subtle: software that conforms to the Unicode Standard is allowed to interpret these irregular sequences as the corresponding supplementary-plane characters, but is not allowed to generate these irregular sequences. In certain situations, though, software will want to reject such irregular UTF-8 sequences (for instance, where these might otherwise be used to avoid security systems), and in these cases the Standard allows conformant software to ignore or reject these sequences, or remove them from a data stream.

The main motivation for making the distinction and for considering these 6-byte sequences to be irregular rather than illegal is this: suppose a process is re-encoding a data stream from UTF-16 to UTF-8, and suppose that the source data stream had been interrupted so that it ended with the beginning of a surrogate pair. It may be that this segment of the data will later be re-united with the remainder of the data, it also having been re-encoded in UTF-8. So, we are assuming that there are two segments of data out there: one ending with an unpaired high surrogate, and one beginning with an unpaired low surrogate.

Now, as each segment of the data is being trans-coded from UTF-16 to UTF-8, the question arises as to what should be done with the unpaired surrogate code units. If they are ignored, then the result after the data is reassembled will be that a character has been lost. A more graceful way to deal with the data would be for the trans-coding process to translate the unpaired surrogate into a corresponding 3-byte UTF-8 sequence, and then leave it up to a later receiving process to decide what to do with it. Then, if the receiving process gets the data segments assembled again, that character will still be part of the information content of the data. The only problem is that now it is in a 6-byte pseudo-UTF-8 sequence. Defining these as irregular rather than illegal is intended to allow that character to be retained over the course of this overall process in a form that conformant software is allowed to interpret, even if it would not be allowed to generate it that way.

4 References

Davis, Mark. 2001. Unicode standard annex #19: UTF-32. Version 3.1.0. Cupertino, CA: The Unicode Consortium. The current version is available online at  http://www.unicode.org/unicode/reports/tr19/.



Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.

Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.

 Reply
"Rajesh", Thu, Feb 18, 2010 00:36 (CST)

Very useful article and gives valuable info rare to find on net

For long I was not able to find any info on the exact mechanism of unicode code points to utf-8 (and reverse) mapping. Thanks a lot!

 Reply
"Dave P", Thu, Apr 28, 2011 23:13 (CDT)

Excellent information

I came across this by accident and I'm sure glad that I did. It provides valuable insight, especially with regard to UTF-8, that I have not seen elsewhere.

Thanks,

Dave

 Reply
"James Papadakis", Wed, Jul 13, 2011 16:44 (CDT)

How to calculate hex value using unicode?

Can someone give a specific example??? PLEASE! going from simple to difficult?

I know I am retarded compared to the gurus, but Im just an amateur, so no bashing please. Your help is appreciated. I had no teacher to show me anything so I don't understand quite a few things.

I am trying to recreate keysymdef.h in GNU/Linux Ubuntu10.04 so as to reassign some unicode characters, and in turn change the keyboard mapping bindings. I am not a computer expert or programmer. I know that I need to add i.e. U+1FFD to a hexadecimal value in order to find its value?

and what does this "Byte 3=0" or Byte 3=1" or Byte 3=7" mean in this file??

i.e.

/*

* Latin 1

* (ISO/IEC 8859-1 = Unicode U+0020..U+00FF)

* Byte 3 = 0

*/

#define XK_0 0x0030 /* U+0030 DIGIT ZERO */


Here U+0030 is the same as the hex value 0x0030, and it also corresponds to the system character map file where all the glyphs are shown (Digit zero is U+0030 and its hex value is UTF-16 0x0030)

but:

i.e.

/*

* Latin 2

* Byte 3 = 1

*/

#ifdef XK_LATIN2

#define XK_Aogonek 0x01a1 /* U+0104 LATIN CAPITAL LETTER A WITH OGONEK */


Here U+0104 corresponds to the hex value 0x01a1. How did we get to this? How do I use the Byte 3=1 information for all this stuff? I know I am a bozo, but I'm completely self taught, forgive me for asking silly questions to computer gurus, but to me they are rather complicated.

For instance, how to I find out what the hexadecimal value would be for U1FFD? (so that it corresponds to the Unicode character when it is called by the machine? Obviously this would not be 0x1FFD would it? How can I find out?)

I would assume it would be using this info:

/*

* Greek

* (not quite identical to, ISO/IEC 8859-7)

* Byte 3 = 7

*/


for example this one was already there:

#define XK_Greek_OMICRONaccent 0x07a7 /* U+038C GREEK CAPITAL LETTER OMICRON WITH TONOS */

(so U+038C corresponds to 0x07a7)


Any help would be much appreciated

 Reply
"P. Wormer", Thu, Jan 2, 2014 11:07 (CST)

Typo?

Third line of Table 4:

000000000000001000001 00000zzzzyyyyyyxxxxxx 1110zzzz 10000001 10000001

Shouldn't be

000000000000001000001 00000zzzzyyyyyyxxxxxx 11100000 10000001 10000001  ?

 Reply
stephanie_smith, Tue, Jan 14, 2014 15:56 (CST)

Re: Typo?

Many thanks for spotting this and letting us know! The error has now been fixed.

Best wishes,

Stephanie

Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.



1

No revisions to UTF-16 were made in TUS 3.1. The calculation for converting from the code values for a surrogate pair code unit sequence to the Unicode scalar value of the character being represented is reasonably simple: if CH and CL represent the values of the high and low surrogate code units in a well-formed surrogate pair, then the corresponding Unicode scalar value U is calculated as follows:

2 TUS 3.1 made certain revisions to the definitions for UTF-8 in order to clarify constraints related to non-shortest form representations. These are discussed below.

© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.