Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Encoding > Unicode > PUA
Short URL: http://scripts.sil.org/VendorUseOfPUA

Use of the Unicode Private Use Areas by Others

Peter Constable, 2012-01-04

It is not always easy to find out how software vendors have made use of the Unicode private use areas. Use by software vendors of PUA code points does not necessarily represent a concern for end users that have their own PUA character needs, provided that software products do not assume semantics for PUA code points that affect text processing in ways that are adverse in relation to the user's needs. This has been known to happen in some products, however.

Therefore, we want to document what we do know about use of the PUA by major software vendors and others.

Adobe

“Adobe's PUA usage (fortunately deprecated now) involves U+F634 .. U+F8FE.” [Eric Muller, Unicode List, 2003-4-24]

According to Adobe's specification for glyph naming conventions,1 certain Postscript glyph names are associated with PUA code points. Software that processes textual data in terms of Postscript glyph names may infer these PUA codepoints. For instance, suppose a PDF document contains a glyph reference to tsuperior and that text is being displayed in a PDF reader such as Adobe Acrobat, and suppose the user selects that text and copies it to the clipboard. If the software is using the associations specified in the  Adobe Glyph List, then the code point copied to the clipboard corresponding to the glyph tsuperior will be U+F6F3.

I've documented PUA code points associated with Postscript glyph names on another page: PUA use in the Adobe Glyph List.

Apple

Apple maintains online documentation of their PUA usage at  Apple's PUA usage. As of 2002-12-19, they had made use of the following ranges:
  • U+F700..U+F747
  • U+F803..U+F86B
  • U+F870..U+F8FF

ConScript Unicode Registry

 ConScript Unicode Registry

IBM

“IBM codepage (not CCSID) 1449 assigns IBM-specific characters to U+F83D..U+F8FF (the last 195 BMP-PUA code points)” [Markus Scherer, Unicode List, 2003-4-24]

Microsoft

Microsoft has made use of PUA code points in three ways: for symbol fonts, for presentation-form glyphs in localised implementations involving complex scripts such as Thai and Arabic, and for supporting “end-user-defined character” (EUDC) ranges in Far East character sets.

Symbol fonts2

Symbol fonts are handled in special ways by Windows GDI such that, in theory, a symbol font could have glyphs encoded anywhere in the Basic Multilingual Plane of Unicode. In practice, though, symbol fonts generally use the PUA range U+F020..U+F0FF.3

While we are aware of software that applies special-case handling to text formatted with a symbol font, we are not aware of any software that applies special-case handling to all text encoded in the range U+F020..U+F0FF.

Presentation forms for complex scripts

In localised versions of Windows 95 for Arabic, Hebrew and Thai, PUA codepoints were used for presentation-form glyphs (positional variants of diacritics, contextual variants, and composite base-diacritic forms). Windows GDI would receive calls from TextOut and similar programming interfaces, and transform the string to use PUA code points for whatever presentation forms were needed for the given script. This would be invisible to the application, and so would not imply these PUA code points would be handled in any special way by applications.4

These implementations were carried forward into later versions of Windows and, I believe, were eventually incorporated into the intial versions of the shaping engines for these scripts in the Uniscribe processor (usp10.dll). Microsoft has since re-written these shaping engines, and so newer versions of Uniscribe may not use these PUA ranges in the same way (unless the early implementations have been kept for backward compatibility to be used with certain existing fonts).

At least some of the Arabic fonts that ship with Windows XP (e.g., Arabic Transparent) use these PUA ranges: U+E816, U+E818, U+E820..U+E2D, U+E832..U+E836.

At least some of the Hebrew fonts that ship with Windows XP (e.g., Miriam) use the PUA range U+E802..U+E805.5

The “UPC” Thai fonts that ship with Windows XP use the PUA range U+F700..U+F71D.

The core fonts that ship with Windows XP use some of these same PUA code points in these same ways: U+E801..U+E805 are used for Hebrew presentation forms, and U+E818 is used for an Arabic presentation form. Some of the core fonts (e.g., Tahoma, but not Arial or Times New Roman) use U+F700..U+F71D for Thai presentation forms.

At least some of the core fonts also use additional PUA code points for other presentation forms. In Tahoma, for instance, U+F001 and U+F002 are used for fi and fl ligatures, and U+F004..U+F031 appear to be used for positional variants of various Latin combining marks. Not all of these code points are used in Arial and Times New Roman, however. In fact, Arial uses U+F00A..U+F00E for glyphs that appear to be intended for drawing carets (text insertion-point icons) in various text-direction contexts.

It should be noted that some recent Microsoft products will handle some of the code points in these PUA ranges in special ways in certain situations. For a detailed examination, see Handling of PUA Characters in Microsoft Software.

EUDC

EUDC ranges in Far East character sets have been supported in various Microsoft products using PUA ranges as follows:

Far East character setPUA range
Big-5 (code page 950) U+E000..U+F848
GB2312-80 (code page 936) U+E000..U+EDE7
Shift JIS (code page 932) U+E000..U+E757
Korean (code page 949) U+E000..U+E0BB

Microsoft use of PUA for EUDC support

This is documented in MSDN Library:  Chinese EUDC,  Japanese EUDC, and  Korean EUDC.6

MirBSD operating system and other MirOS Project software

The MirBSD operating system and other MirOS Project software
uses Unicode codepoints U+EF80‥U+EFFF to store the wide-character
representation of raw 8-bit data (octets) which is not UTF-8
encoded, when such temporary conversion is required.

Other information related to private-use characters for Far-East implementations

The following links have been suggested to me for inclusion. They discuss Far-East character sets encodings implemented in various systems, but I have not looked into implications for assumptions made in software products regarding semantics of Unicode PUA codepoints.

 Code Set Conversion between SJIS and eucJP This site discusses variations in different implementations of Shift-JIS and Japanese EUC, with some mention of vendor-defined and user-defined characters. The same authors have provided another page,  Problems and Solutions for Unicode and User/Vendor Defined Characters, that discusses the PUA in relation to Japanese legacy character set encodings in greater detail.


1 See  Unicode and Glyph Names.
2 By “symbol font,” I mean TrueType fonts using a platform ID of 3 and an encoding ID of 0, which technically means “undefined character set or indexing scheme.”
3 I know of only a small number of shipping symbol fonts that do not use this range: Akhbar MT is one of a selection of Arabic fonts that are included in the multilingual proofing-tools add-on for Office XP and that use the PUA range U+F200..U+F2FF.
4 Of course, it is certainly possible that some applications would interpret the same PUA code points as those same presentation forms. We have not tested any applications, such as localised versions of Word 97, in this regard.
5 The Guttman Hebrew-script fonts that are included with the multilingual proofing-tools add-on for Office XP do not use any PUA code points, however.
6 Note, by the way, that the EUDC ranges assumed by Windows for Far East code pages is recorded in the Windows Registry at HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlNlsCodePageEUDCCodeRange. See  EUDC Code Ranges in MSDN Library for more information.

© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.