Computers & Writing Systems
SIL Corporate PUA Strategy (for Use of the Unicode Private Use Area and Surrogates)
SIL PUA Pages
As long as we have used 8-bit encodings for minority-language text data, it has been necessary to devise non-standard encodings for representing that minority-language data. This has always presented challenges when using hardware and software that was designed to work with a limited number of standard encodings. Unicode represents an important advance for us: as a 16-bit encoding, it can accommodate a very large character set that is potentially adequate for encoding all minority language data.
Unicode has certain limitations in this respect, however. First of all, the Unicode Standard is not yet complete, and development of the Standard is a slow process. There are still a number of minority-language scripts that have not been given standard allocations within Unicode, and it will take many years before some of them are incorporated into Unicode. In the meantime, people will need to work with data using some of those minority-language scripts. Secondly, for various reasons there are some characters our users may need to encode but which are unlikely to ever be included in Unicode. Hence, we anticipate the need to encode characters that are not included in Unicode.
Fortunately, the Unicode Standard provides a resource for such needs to be met. A range of codepoints within Unicode has been set aside for the particular needs of end users and corporations. This range is knows as the Private Use Area (PUA). The Standard specifies that codepoints within the PUA will never be given any standardised definitions.
The PUA is, therefore, a limited Corporate resource. In order to manage that limited resource, the strategy described here will be adopted for use within SIL.
In order to explain this strategy fully, it is necessary to briefly explain some technical background. The Unicode Standard is a 16-bit standard developed and maintained by the Unicode Consortium. It bears a well-defined relationship to a distinct ISO standard, knows as ISO 10646. The ISO standard is a 32-bit standard and so can accommodate a much larger character set than can be defined by using only 16 bits. The total range of codepoints within ISO 10646 has been sub-divided into planes of 64K codepoints each; thus each plane can be represented in terms of 16 bits.
The first 64K codepoints in ISO 10646, Plane 0, is known as the Basic Multilingual Plane (BMP) and is identical to the Unicode Standard. The Unicode Standard also includes a range of codepoints that can be used to extend Unicode beyond the BMP to also include planes 1–16 of ISO 10646, an additional 1,048,576 codepoints. This special range of codepoints in Unicode is known as the Surrogates Area, and pairs of codepoints from this range, known as surrogate pairs, are used to encode characters from planes 1–16 in terms of the 16-bit values of Unicode.
Planes 15 and 16 of ISO 10646 are reserved for private use, like the PUA of Unicode. These two planes contain a total of 131,072 codepoints. There is no question that this will be more than adequate for all of our private-use character assignments. We have not required or suggested that our software must be able to work with 32-bit encodings such as ISO 10646, but the surrogate pairs allowed for by Unicode will enable us to work with characters in planes 15 and 16 of ISO 10646 should we need to.
With that background, it is now possible to succinctly describe the strategy for use of the PUA:
By dividing the PUA into Corporate-wide and entity sub-areas, a balance can be maintained between characters that are useful throughout the Corporation and those that would likely only be needed by a particular entity. In the scenarios of a particular language project requiring a large number of PUA characters or of an entity being in a region in which many minority-language scripts are in use, it is the Entity Sub-Area (ESA) that will have greater demands placed upon it. Accordingly, the ESA is allocated 4,096 codepoints while the Corporate Sub-Area (CSA) contains 2,048.
It will be noted that the range U+F000 to U+F0FF has been excluded from both sub-areas. This is due to the fact that Microsoft has chosen to make particular use of codepoints in this range within their software. Other major software developers may likewise utilise portions of the PUA; for example, Apple has documented their use of codepoints in the range U+F800 to U+F8FF, and Adobe has documented their use of codepoints in the range U+F600 to U+F7FF. Such use of the PUA by major software developers has potential to create problems should we have character assignments that conflict with theirs. Given the importance of Microsoft software to the Corporation at present, we have chosen to avoid any risk of incompatibility between our data and their software by avoiding the particular range that Microsoft has chosen to use.
The use of the PUA by major software developers can and likely will change over time. The NRSI will continue to monitor the use of the PUA by major software developers in order to identify situations that may present concerns to us with regard to incompatibility. The Unicode Standard recommends that major software developers begin making any private assignments at the top end of the PUA (U+F8FF) and working down, and that end users do the opposite, so as to limit potential for incompatibility. This reduces the likelihood of a major developer making assignments in the portion of the PUA we have designated as the ESA.
It is assumed that entities will always report any assignments they make to the NRSI. This permits Corporate documentation of encodings, which is useful for archival purposes. It also allows the NRSI to assign a unique plane 15 or plane 16 codepoint to each entity’s private characters. This is useful for the following reason:
Since each entity can make independent use of the ESA, it is possible, and likely, that different entities will assign different characters to given codepoints. This does not present problems as long as some particular data is used only within the entity of origin. As soon as it is used outside of that entity, however, the interpretation of PUA codepoints becomes ambiguous. Having a unique and common encoding for all language data in terms of ISO private-use characters provides a means for taking data out of the context of the original entity without needing to ensure that it is always accompanied by a font and/or other encoding documentation in order to ensure that the data can be read. This will be useful for exchanging data with colleagues in other entities, for Corporate publications and for archiving.
To facilitate this use of ISO plane 15 or 16 private use characters as a Corporate standard for all private use characters within SIL, whenever an entity reports a ESA character assignment, NRSI will assign the character to a plane 15 or plane 16 codepoint and will report that codepoint to the entity in terms of both a 32-bit scalar value and a 16-bit surrogate pair. NRSI will also maintain conversion tables for that entity to facilitate conversion between the entity’s ESA-based encoding and a Corporate-wide encoding using surrogates.
For further background regarding this strategy, see Hosken 1998, which represents a draft proposal for this strategy.