Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Encoding > Unicode > PUA
Short URL: http://scripts.sil.org/PUA_deprecation

A strategy for deprecating SIL PUA characters

SIL PUA Committee procedures

Bob Hallissy and Lorna Priest, 2007-11-07

Abstract

Some of the character assignments made by the SIL PUA committee are temporary — intended to be used only until such time as the character is accepted into the Unicode standard. As part of such acceptance process, however, official (non-PUA) codepoints are assigned. At this point the original PUA character assignments should be deprecated and a strategy effected to transition to using the new codepoints.

It is our perspective that we should never render someone's data as useless. Totally removing a character from the PUA, or even putting a box around it, would make the data impossible to publish or print without going back to find a copy of the font that has the character as it was originally intended.

This paper proposes elements of the strategy to transition to using the new codepoints without losing data that used the PUA codepoints.

Document status

This document has been reviewed by the SIL PUA committee and the Non-Roman Field Advisory Board.

Background and rationale

At present there are over 250 characters assigned to the SIL corporate PUA area of Unicode. The Doulos SIL font, along with various keyboards and mapping tables, implement support for these PUA characters. However, included in our PUA assignments are about 195 characters that are now in Unicode and about 13 characters that are in the pipeline for Unicode and are expected to be part of the Unicode 5.1 release. Since Unicode 4.1 we have had a mix of encodings: existing data (and probably some new data generated by existing projects) will use the PUA codepoints, while other projects will begin to generate data with the new codepoints.

We desire tools and procedures that facilitate and encourage the transition to using the official codepoints for such characters without unduly burdening existing projects for which it might be inappropriate to make this transition.

The following components are affected by PUA character deprecation:

  • fonts — obviously fonts have to be altered to accommodate the new codepoints. But should the old codepoints be removed? What other design implications are there?
  • keyboards — keyboards need to be modified to generate the new data.
  • conversion tools — existing conversion mappings, such as from legacy to Unicode, that happen to map to deprecated PUA codepoints will need to change. Should old mappings be maintained? How do people migrate existing data from PUA to official encoding?

PUA Codepoint reuse?

An assumption made in creating this document is that PUA codepoints will never be reused. As in the Unicode standard itself, once a character is assigned as SIL PUA codepoint, that assignment is never changed or altered. We may deprecate the codepoint, but we won't ever assume that all data that may use the codepoint has now disappeared and that, therefore, it is safe to reuse the codepoint for some other character. One of the benefits of Unicode is that archived data can be understood independently of external metadata such as specific fonts or file dates.

SIL PUA Version Identification

Before digging into the details of what has to change when deprecating PUA characters, we should address the issue of version identification. We can expect that with each new version of Unicode, we may have another batch of PUA characters that should, from thence forward, be deprecated. With Unicode 4.1 140+ characters were deprecated. Another 45 characters were deprecated with the release of Unicode 5.0. The next release of Unicode may cause, say, another 13 to be deprecated, and the following release maybe another bunch. When looking at any given release of a component (font, keyboard, etc.), how is the user to know which version of our PUA the component is designed for?

We need a standardized way to identify the PUA version for any particular component. The most obvious moniker to consider is probably the Unicode version. So the earliest PUA assignment collection could be identified as 4.0 (or possibly 4.0.1), and with the subsequent release of Unicode 4.1 we could call that PUA 4.1. A user would then be able to know the difference between a keyboard "Designed for SIL PUA 4.0" and one "Designed for SIL PUA 4.1".

In between Unicode releases, additional assignments can be made into the PUA area, but no assignments should ever be eliminated (in theory at least). Though it is not expected to be as important to keep track of the these additions as it is to keep track of deprecations, we nonetheless recommend use of a suffix letter to identify these additions. Thus with the additions accepted after 4.1, the version would then be identified as "4.1a", with subsequent additions identified as "4.1b", etc.

Note

We have recently added a worksheet (DeprecatedAssignments) to SIL's PUA UCD (Unicode Character Database) spreadsheet (UCD-SIL_PUA_<date>.xls. This worksheet shows the history of every SIL PUA character: what PUA version it was added and what version, if any, it was deprecated. This will help implementers determine which PUA versions a specific keyboard, font, or mapping table supports. The spreadsheet can be downloaded here: SIL Corporate PUA Character Assignments

Fonts

When updating a font to account for PUA character deprecation, the obvious question is: Should the updated font include both the new and the old codepoints or just the new? If both, should both codepoints map to the same glyph or to different (though perhaps identical looking) glyphs?

If the updated font is going to have the same font name (e.g., "Doulos SIL"), then it won't be possible for users to have both the old font and the new one installed simultaneously. To maximize compatibility, therefore (i.e., in order to support both new and old encodings on the same system), the new font must include both the new and old codepoints.

This leaves us with the question of glyph names: Should the two codepoints point to the same glyph (this is called double-encoding the glyph) or to different glyphs? PUA deprecation is not the only reason one might be tempted to double encode a glyph — U+0041  LATIN CAPITAL LETTER A and U+0410  CYRILLIC CAPITAL LETTER A look identical and we might consider using the same glyph for these characters. As a general rule, we avoid double-encoding glyphs primarily for the reason that there exist some rare situations where it becomes helpful to be able to reconstruct the original character sequence from the rendered glyph sequence. As this is accomplished by inspecting the names of the glyphs, it is normally important that distinct codepoints refer to distinct glyphs.

In the case of deprecated PUA codepoints, however, there is a benefit to double encoding: it would provide a type of normalization in that it wouldn't matter which codepoint was originally used; when the character sequence is deduced from the glyph sequence, then only a single (preferably official) codepoint is generated.

Font implementations based on smart-font technologies such as Graphite may implement a font feature that draws attention (e.g., by an enclosing box) to PUA characters that have been deprecated. If this is done, the feature should be off by default.

Recommendation: We therefore recommend that the following changes occur to the font after PUA codepoint deprecation1:

  • The names of the glyphs for the deprecated characters should be changed to reflect the now-official Unicode codepoint.
  • The glyph should be double-encoded — at both the original PUA codepoint and the now official codepoint2.
  • If implemented, font features that draw attention to deprecated PUA characters should be off by default.
  • The font name should stay the same (in order to support both new and old encodings on the same system), though its internal version should be changed.

Keyboards

Any keyboards that allow typing of the deprecated PUA codepoints will need to be upgraded. Because systems can support multiple keyboards simultaneously there isn't the same kind of need as there is with fonts; i.e., it isn't necessary that one keyboard support both encodings.

However, users will need to be aware of the encoding being used in any given document. It will be quite easy to end up with mixed encodings — some use of a now-deprecated PUA character along with use of the now-official codepoint. This, in turn, could cause confusing results search results or word lists, etc.

Recommendation: Keyboards that generate SIL PUA codepoints should be named in ways that identify the version of PUA being generated. Authors should make sure it is feasible to have old and new keyboards installed simultaneously.

Conversion

Any existing data conversion mappings (e.g., the TECkit mappings for SIL legacy encodings like IPA93) that reference newly deprecated PUA characters will need to be updated. As in the case of keyboards, multiple encoding mappings can be installed simultaneously on the same system, so it will be important for file and mapping names to reflect the PUA version they implement.

However there are two further considerations for mapping tables: PUA version folding and transliteration.

PUA version folding

Consider a mapping rule such as the following from the current IPA93 mapping:

0xCB   <>   U+F181   ; superscript nya

In Unicode 4.1, this PUA character was allocated to U+1DAE. That means an obvious change to the mapping:

0xCB   <>   U+1DAE   ; superscript nya

But there is a further step we can take. When this mapping is used to convert Unicode to legacy, it should, for maximum benefit, handle either of the possible Unicode codepoints. The above should therefore be changed to:

0xCB   <>   U+1DAE   ; superscript nya
0xCB   <    U+F181   ; former PUA assignment

(The general name for this kind of mapping construction is "folding" — it folds two Unicode codepoints to a single legacy codepoint, thus eliminating the distinction in the Unicode space. Because this folding is due to PUA versioning, we have called it PUA version folding.)

PUA version transliteration

What should users do with existing documents that use the PUA codes? To facilitate users transitioning to the official encodings, we should supply transliterator mappings that convert between the various versions of our PUA. For example, the change to Unicode 4.1 might be implemented in a TECkit transliterator such as:

LHSName    'SIL-PUA-4.0'
RHSName    'SIL-PUA-4.1'

pass(Unicode)
...
U+F181   <>   U+1DAE   ; superscript nya

Recommendation: When PUA codes are deprecated, mappings that refer to those codes should be updated not only to reference the official codepoints, but to fold deprecated PUA codes when doing the conversion back to legacy. Updated mappings should be designed to coexist with the older mappings. Further, a transliterator should be created to facilitate conversion of existing data between the PUA and official encoding. For this transliterator, the LHS should be the old PUA version, while the RHS be the new version (so that the default "forward" use of the transliterator updates data to the new version). Finally, in the forward direction, the transliterator should transform all characters deprecated in any previous PUA version (e.g., the 4.1 <> 4.2 transliterator should incorporate rules that transform 4.0 > 4.2).

Software

Software implementers are encouraged to facilitate consistent use of PUA characters within a project by issuing warnings when inconsistent data is detected. This may require software to keep track of the PUA version in use for a given project.

Migrating a project to a new PUA version involves updating several components (keyboards, mapping tables, etc.) simultaneously as well as converting existing project documents. The coordination required is even greater when multiple computers, possibly in several locations, are involved. Because of the complexities involved, software should not automatically upgrade project data to new versions of the PUA.

Training and User Support

Imagine a user who switched to Unicode six months or a year ago. He has been entering data which includes a handful of PUA characters, but he doesn‎’t know that or doesn‎’‎t really understand what it means. He now buys a new laptop, and his computer department puts the latest fonts and keyboard on it. He uses the laptop when on the road, and his year-old desktop when at home. (Or he uses the new machine, and a partner uses the old one.) Soon his data has two mixed encodings, but he is not aware of it. But when he tries to create a word list, or a character frequency count, a spell check, etc., the dual encoding is going to skew the results.

Defensive actions that can be taken include:

  • Users need to be informed of what PUA codepoints they are using, and what expectation exists for replacing them with standard codepoints.
  • Computer support people have to understand the problem. They need to keep track of what PUA characters are being used by people in their entity. When a new version of Unicode comes out, they need to be told what fonts, keyboards, and mapping tables have been upgraded at the corporate level. They need to know what they need to upgrade. Most important, they need to know how to work with the end user to make the necessary changes.
  • Training for computer support personnel needs to emphasize the importance of consistency of encoding throughout an entire project, and not just routinely installing the newest and greatest on new machines.
  • To the extent that SIL software can alert the user and/or update the data, it will be very helpful. But the user needs to understand enough to know what the software is talking about.
  • Entities should be encouraged to include a long-term PUA strategy in their Unicode transition strategy. A long-term PUA strategy should include plans to become independent of the PUA whenever possible.

Page History

2007-09-13 LP & BH: updated with NR FAB changes

2006-10-05 LP & BH: Page created


1 If a character was PUA-encoded during a beta phase of our Roman font development (that would include the "SIL IPA Unicode" and Doulos SIL fonts) but put into Unicode by the time of official release, we decided to not double-encode it. However, we were not entirely consistent in this policy. Those not double-encoded are: U+F177, U+F183..U+F18A, U+F18C..U+F194, U+F200..U+F207
2 Graphite font implementers should note that, by default, Graphite's "auto-pseudo" mechanism causes it to treat double-mapped glyphs as distinct glyphs for the purposes of its processing. This mechanism can be turned off in the GDL (or coded around).

© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.