NRSI: Computers & Writing Systems
A strategy for deprecating SIL PUA characters
SIL PUA Pages
Some of the character assignments made by the SIL PUA committee are temporary — intended to be used only until such time as the character is accepted into the Unicode standard. As part of such acceptance process, however, official (non-PUA) codepoints are assigned. At this point the original PUA character assignments should be deprecated and a strategy effected to transition to using the new codepoints.
It is our perspective that we should never render someone's data as useless. Totally removing a character from the PUA, or even putting a box around it, would make the data impossible to publish or print without going back to find a copy of the font that has the character as it was originally intended.
This paper proposes elements of the strategy to transition to using the new codepoints without losing data that used the PUA codepoints.
This document has been reviewed by the SIL PUA committee and the Non-Roman Field Advisory Board.
Background and rationale
At present there are over 250 characters assigned to the SIL corporate PUA area of Unicode. The Doulos SIL font, along with various keyboards and mapping tables, implement support for these PUA characters. However, included in our PUA assignments are about 195 characters that are now in Unicode and about 13 characters that are in the pipeline for Unicode and are expected to be part of the Unicode 5.1 release. Since Unicode 4.1 we have had a mix of encodings: existing data (and probably some new data generated by existing projects) will use the PUA codepoints, while other projects will begin to generate data with the new codepoints.
We desire tools and procedures that facilitate and encourage the transition to using the official codepoints for such characters without unduly burdening existing projects for which it might be inappropriate to make this transition.
The following components are affected by PUA character deprecation:
PUA Codepoint reuse?
An assumption made in creating this document is that PUA codepoints will never be reused. As in the Unicode standard itself, once a character is assigned as SIL PUA codepoint, that assignment is never changed or altered. We may deprecate the codepoint, but we won't ever assume that all data that may use the codepoint has now disappeared and that, therefore, it is safe to reuse the codepoint for some other character. One of the benefits of Unicode is that archived data can be understood independently of external metadata such as specific fonts or file dates.
SIL PUA Version Identification
Before digging into the details of what has to change when deprecating PUA characters, we should address the issue of version identification. We can expect that with each new version of Unicode, we may have another batch of PUA characters that should, from thence forward, be deprecated. With Unicode 4.1 140+ characters were deprecated. Another 45 characters were deprecated with the release of Unicode 5.0. The next release of Unicode may cause, say, another 13 to be deprecated, and the following release maybe another bunch. When looking at any given release of a component (font, keyboard, etc.), how is the user to know which version of our PUA the component is designed for?
We need a standardized way to identify the PUA version for any particular component. The most obvious moniker to consider is probably the Unicode version. So the earliest PUA assignment collection could be identified as 4.0 (or possibly 4.0.1), and with the subsequent release of Unicode 4.1 we could call that PUA 4.1. A user would then be able to know the difference between a keyboard "Designed for SIL PUA 4.0" and one "Designed for SIL PUA 4.1".
In between Unicode releases, additional assignments can be made into the PUA area, but no assignments should ever be eliminated (in theory at least). Though it is not expected to be as important to keep track of the these additions as it is to keep track of deprecations, we nonetheless recommend use of a suffix letter to identify these additions. Thus with the additions accepted after 4.1, the version would then be identified as "4.1a", with subsequent additions identified as "4.1b", etc.
We have recently added a worksheet (DeprecatedAssignments) to SIL's PUA UCD (Unicode Character Database) spreadsheet (UCD-SIL_PUA_<date>.xls. This worksheet shows the history of every SIL PUA character: what PUA version it was added and what version, if any, it was deprecated. This will help implementers determine which PUA versions a specific keyboard, font, or mapping table supports. The spreadsheet can be downloaded here: SIL Corporate PUA Character Assignments
When updating a font to account for PUA character deprecation, the obvious question is: Should the updated font include both the new and the old codepoints or just the new? If both, should both codepoints map to the same glyph or to different (though perhaps identical looking) glyphs?
If the updated font is going to have the same font name (e.g., "Doulos SIL"), then it won't be possible for users to have both the old font and the new one installed simultaneously. To maximize compatibility, therefore (i.e., in order to support both new and old encodings on the same system), the new font must include both the new and old codepoints.
This leaves us with the question of glyph names: Should the two codepoints point to the same glyph (this is called double-encoding the glyph) or to different glyphs? PUA deprecation is not the only reason one might be tempted to double encode a glyph — U+0041 LATIN CAPITAL LETTER A and U+0410 CYRILLIC CAPITAL LETTER A look identical and we might consider using the same glyph for these characters. As a general rule, we avoid double-encoding glyphs primarily for the reason that there exist some rare situations where it becomes helpful to be able to reconstruct the original character sequence from the rendered glyph sequence. As this is accomplished by inspecting the names of the glyphs, it is normally important that distinct codepoints refer to distinct glyphs.
In the case of deprecated PUA codepoints, however, there is a benefit to double encoding: it would provide a type of normalization in that it wouldn't matter which codepoint was originally used; when the character sequence is deduced from the glyph sequence, then only a single (preferably official) codepoint is generated.
Font implementations based on smart-font technologies such as Graphite may implement a font feature that draws attention (e.g., by an enclosing box) to PUA characters that have been deprecated. If this is done, the feature should be off by default.
Recommendation: We therefore recommend that the following changes occur to the font after PUA codepoint deprecation1:
Any keyboards that allow typing of the deprecated PUA codepoints will need to be upgraded. Because systems can support multiple keyboards simultaneously there isn't the same kind of need as there is with fonts; i.e., it isn't necessary that one keyboard support both encodings.
However, users will need to be aware of the encoding being used in any given document. It will be quite easy to end up with mixed encodings — some use of a now-deprecated PUA character along with use of the now-official codepoint. This, in turn, could cause confusing results search results or word lists, etc.
Recommendation: Keyboards that generate SIL PUA codepoints should be named in ways that identify the version of PUA being generated. Authors should make sure it is feasible to have old and new keyboards installed simultaneously.
Any existing data conversion mappings (e.g., the TECkit mappings for SIL legacy encodings like IPA93) that reference newly deprecated PUA characters will need to be updated. As in the case of keyboards, multiple encoding mappings can be installed simultaneously on the same system, so it will be important for file and mapping names to reflect the PUA version they implement.
However there are two further considerations for mapping tables: PUA version folding and transliteration.
PUA version folding
Consider a mapping rule such as the following from the current IPA93 mapping:
0xCB <> U+F181 ; superscript nya
In Unicode 4.1, this PUA character was allocated to U+1DAE. That means an obvious change to the mapping:
0xCB <> U+1DAE ; superscript nya
But there is a further step we can take. When this mapping is used to convert Unicode to legacy, it should, for maximum benefit, handle either of the possible Unicode codepoints. The above should therefore be changed to:
0xCB <> U+1DAE ; superscript nya 0xCB < U+F181 ; former PUA assignment
(The general name for this kind of mapping construction is "folding" — it folds two Unicode codepoints to a single legacy codepoint, thus eliminating the distinction in the Unicode space. Because this folding is due to PUA versioning, we have called it PUA version folding.)
PUA version transliteration
What should users do with existing documents that use the PUA codes? To facilitate users transitioning to the official encodings, we should supply transliterator mappings that convert between the various versions of our PUA. For example, the change to Unicode 4.1 might be implemented in a TECkit transliterator such as:
LHSName 'SIL-PUA-4.0' RHSName 'SIL-PUA-4.1' pass(Unicode) ... U+F181 <> U+1DAE ; superscript nya
Recommendation: When PUA codes are deprecated, mappings that refer to those codes should be updated not only to reference the official codepoints, but to fold deprecated PUA codes when doing the conversion back to legacy. Updated mappings should be designed to coexist with the older mappings. Further, a transliterator should be created to facilitate conversion of existing data between the PUA and official encoding. For this transliterator, the LHS should be the old PUA version, while the RHS be the new version (so that the default "forward" use of the transliterator updates data to the new version). Finally, in the forward direction, the transliterator should transform all characters deprecated in any previous PUA version (e.g., the 4.1 <> 4.2 transliterator should incorporate rules that transform 4.0 > 4.2).
Software implementers are encouraged to facilitate consistent use of PUA characters within a project by issuing warnings when inconsistent data is detected. This may require software to keep track of the PUA version in use for a given project.
Migrating a project to a new PUA version involves updating several components (keyboards, mapping tables, etc.) simultaneously as well as converting existing project documents. The coordination required is even greater when multiple computers, possibly in several locations, are involved. Because of the complexities involved, software should not automatically upgrade project data to new versions of the PUA.
Training and User Support
Imagine a user who switched to Unicode six months or a year ago. He has been entering data which includes a handful of PUA characters, but he doesn’t know that or doesn’t really understand what it means. He now buys a new laptop, and his computer department puts the latest fonts and keyboard on it. He uses the laptop when on the road, and his year-old desktop when at home. (Or he uses the new machine, and a partner uses the old one.) Soon his data has two mixed encodings, but he is not aware of it. But when he tries to create a word list, or a character frequency count, a spell check, etc., the dual encoding is going to skew the results.
Defensive actions that can be taken include:
2007-09-13 LP & BH: updated with NR FAB changes
2006-10-05 LP & BH: Page created