NRSI: Computers & Writing Systems
Use of the Unicode Private Use Areas by Others
SIL PUA Pages
It is not always easy to find out how software vendors have made use of the Unicode private use areas. Use by software vendors of PUA code points does not necessarily represent a concern for end users that have their own PUA character needs, provided that software products do not assume semantics for PUA code points that affect text processing in ways that are adverse in relation to the user's needs. This has been known to happen in some products, however.
Therefore, we want to document what we do know about use of the PUA by major software vendors and others.
“Adobe's PUA usage (fortunately deprecated now) involves U+F634 .. U+F8FE.” [Eric Muller, Unicode List, 2003-4-24]
According to Adobe's specification for glyph naming conventions,1 certain Postscript glyph names are associated with PUA code points. Software that processes textual data in terms of Postscript glyph names may infer these PUA codepoints. For instance, suppose a PDF document contains a glyph reference to tsuperior and that text is being displayed in a PDF reader such as Adobe Acrobat, and suppose the user selects that text and copies it to the clipboard. If the software is using the associations specified in the Adobe Glyph List, then the code point copied to the clipboard corresponding to the glyph tsuperior will be U+F6F3.
I've documented PUA code points associated with Postscript glyph names on another page: PUA use in the Adobe Glyph List.
AppleApple maintains online documentation of their PUA usage at Apple's PUA usage. As of 2002-12-19, they had made use of the following ranges:
ConScript Unicode Registry
“IBM codepage (not CCSID) 1449 assigns IBM-specific characters to U+F83D..U+F8FF (the last 195 BMP-PUA code points)” [Markus Scherer, Unicode List, 2003-4-24]
Microsoft has made use of PUA code points in three ways: for symbol fonts, for presentation-form glyphs in localised implementations involving complex scripts such as Thai and Arabic, and for supporting “end-user-defined character” (EUDC) ranges in Far East character sets.
Symbol fonts are handled in special ways by Windows GDI such that, in theory, a symbol font could have glyphs encoded anywhere in the Basic Multilingual Plane of Unicode. In practice, though, symbol fonts generally use the PUA range U+F020..U+F0FF.3
While we are aware of software that applies special-case handling to text formatted with a symbol font, we are not aware of any software that applies special-case handling to all text encoded in the range U+F020..U+F0FF.
Presentation forms for complex scripts
In localised versions of Windows 95 for Arabic, Hebrew and Thai, PUA codepoints were used for presentation-form glyphs (positional variants of diacritics, contextual variants, and composite base-diacritic forms). Windows GDI would receive calls from TextOut and similar programming interfaces, and transform the string to use PUA code points for whatever presentation forms were needed for the given script. This would be invisible to the application, and so would not imply these PUA code points would be handled in any special way by applications.4
These implementations were carried forward into later versions of Windows and, I believe, were eventually incorporated into the intial versions of the shaping engines for these scripts in the Uniscribe processor (usp10.dll). Microsoft has since re-written these shaping engines, and so newer versions of Uniscribe may not use these PUA ranges in the same way (unless the early implementations have been kept for backward compatibility to be used with certain existing fonts).
At least some of the Arabic fonts that ship with Windows XP (e.g., Arabic Transparent) use these PUA ranges: U+E816, U+E818, U+E820..U+E2D, U+E832..U+E836.
At least some of the Hebrew fonts that ship with Windows XP (e.g., Miriam) use the PUA range U+E802..U+E805.5
The “UPC” Thai fonts that ship with Windows XP use the PUA range U+F700..U+F71D.
The core fonts that ship with Windows XP use some of these same PUA code points in these same ways: U+E801..U+E805 are used for Hebrew presentation forms, and U+E818 is used for an Arabic presentation form. Some of the core fonts (e.g., Tahoma, but not Arial or Times New Roman) use U+F700..U+F71D for Thai presentation forms.
At least some of the core fonts also use additional PUA code points for other presentation forms. In Tahoma, for instance, U+F001 and U+F002 are used for fi and fl ligatures, and U+F004..U+F031 appear to be used for positional variants of various Latin combining marks. Not all of these code points are used in Arial and Times New Roman, however. In fact, Arial uses U+F00A..U+F00E for glyphs that appear to be intended for drawing carets (text insertion-point icons) in various text-direction contexts.
It should be noted that some recent Microsoft products will handle some of the code points in these PUA ranges in special ways in certain situations. For a detailed examination, see Handling of PUA Characters in Microsoft Software.
EUDC ranges in Far East character sets have been supported in various Microsoft products using PUA ranges as follows:
Microsoft use of PUA for EUDC support
MirBSD operating system and other MirOS Project software
The MirBSD operating system and other MirOS Project software
Other information related to private-use characters for Far-East implementations
The following links have been suggested to me for inclusion. They discuss Far-East character sets encodings implemented in various systems, but I have not looked into implications for assumptions made in software products regarding semantics of Unicode PUA codepoints.
Code Set Conversion between SJIS and eucJP This site discusses variations in different implementations of Shift-JIS and Japanese EUC, with some mention of vendor-defined and user-defined characters. The same authors have provided another page, Problems and Solutions for Unicode and User/Vendor Defined Characters, that discusses the PUA in relation to Japanese legacy character set encodings in greater detail.