NRSI: Computers & Writing Systems
You are here: Encoding > Unicode
Short URL: http://scripts.sil.org/UTConvertQ2
When to Convert to Unicode
Is Unicode ready for you?
Unicode contains a huge inventory of characters, currently over 100,000. But, that doesn’t guarantee that it will have every character that you need. Now, there has been a major effort over almost two decades to identify all the characters that need to be in Unicode, and most people will find that their data can be represented in Unicode with no problem. But, if the language you work in happens to have one of those rare characters that hasn’t yet been added to Unicode, or worse, a whole script that isn’t yet included, then the first order of business is to request that it be added.
What do you do if there is no established orthography for the language? That, of course, gives you much greater flexibility. As long as you choose from among the characters already available in Unicode and follow standard conventions in how those characters are used, there should be no problem. For more details on this subject, see Orthography development in relation to Unicode.
But most people reading this article need to work with an established orthography. So you need to check to see if Unicode will support it. Most major languages are already fully-supported, but minority languages may not be. So, make a list of all the characters that you need to use. You will probably need help from a Unicode expert to know if the characters are in Unicode, but you can get started on your own by listing out what you need, then take your list to the expert.
Here are some of the issues you should consider:
- Upper and lower case
- Borrowed words
- Punctuation and other symbols
- Phonetic transcription
- Other languages and scripts you may use
Inventorying the Character Set and Comparing it to Unicode:
- Include upper-case as well as lower-case. Some people think that if a character never occurs first in a word, they don’t need an upper-case version of it. Then, they run into a situation when they need to use all-caps. If your writing system regularly makes a distinction between upper and lower case (or any analogous difference in a character’s appearance that is not predictable from context), list an upper-case version of it.
- On the other hand, if the shape of the character changes based on its immediate context, then also list all the variant shapes. For example, in Arabic-based scripts, letters change shape depending on whether they occur first, middle, or last in a word or stand alone. Unicode handles this by treating all the variant shapes as the same character and relying on smart fonts to give the right shape in each context. (See here for more information.)
- Include diacritics in your list. List all possible combinations of diacritics with base characters (again, remember to include upper-case) as well as combinations of two or more diacritics on the same letter. Unicode does provide ways to form arbitrary combinations of base characters and diacritics, but the most common combinations are also available pre-assembled as single characters. So, you’ll want to check if these “pre-composed” characters are available for the combinations that you use. If there aren’t, then you need to make sure that the diacritic is included as a separate character that can be combined with other characters.
- Include any characters that only occur in borrowed words.
- Include punctuation characters and any other special symbols other than ordinary word-building characters. Almost certainly they will be included, but if you use anything unusual that isn’t in a major language, you should check this out carefully.
- Consider all languages that you work with, including those that you may only use occasionally. Major languages are already included, so that’s not a problem. But if you want to exchange data with people working in related languages, you may need characters for those other languages.
- Consider symbols you need for phonetic transcription. All symbols currently approved by the International Phonetic Association (IPA) are already included. However, if you use a different transcription system, such as Americanist phonetic characters or phonetic symbols that are only used in a particular part of the world or a certain language family, you need to check.
- If any language that you work with has more than one script, consider each script separately.
- Don’t plan to depend on formatting such as underlining, superscripting, or italics in order to represent your characters. For example, if you need to represent a superscript , don’t plan to use an ordinary h and just apply superscripting to it. Besides being clumsy to type, this method is unreliable. If all the formatting ever gets stripped off your text, then the distinction between h and will disappear. You will need to represent these two as separate characters in Unicode. (And, yes, Unicode does have a separate character for .)
- You don’t need to worry at this point about fine details of appearance, as long as the character in Unicode is recognizable the same character as the one you are using. For example, some languages prefer an upper-case eng () that is just a larger version of the lower case eng ŋ; others prefer one that looks like a regular upper case N with a tail (). These two shapes are considered “glyph variants” of the same character in Unicode and are represented the same way. You control which version you use through the fonts that you use.
- You do, however, have to pay attention to how a character is used. Unicode sometimes includes more than one character with the same appearance. For example, there is an apostrophe that is used for punctuation and a separate apostrophe that is used to represent glottalization. The two characters look alike, but one is a punctuation mark and the other is a word-building character. You might have thought of them as the “same” character until now, but they function differently in a writing system, and makes a difference for functions like selecting whole words, breaking lines, searching and sorting. So, you can’t just pick a character out of a Unicode chart because it looks right; you have to read the descriptions of each character to make sure you’ve got the right one. If you’ve been representing two characters the same way up until now, then in the process of conversion to Unicode, you will need to figure out some way to distinguish them. Besides the difference between punctuation marks and word-building characters, you also need to distinguish characters that are used as diacritics vs. ones with the same appearance which are full characters on their own.
- In general, Unicode itself is only concerned with the computer being able to recognize a character reliably. Matters of appearance, such as variant shapes of letters in different contexts, preferences about letter shapes such as “a” vs. “ɑ”, and fine positioning of diacritics are not distinguished in Unicode itself, because they are either predictable from context (and thus should be handled by smart fonts), or because they are a matter of personal preference (and thus should be handled as formatting, i.e., by choosing what font you want to use to display the data or choosing options within the font).
In listing out the characters you need, you may find it helpful to look at the inventory of characters in the fonts that you are currently using. In general, you should verify that all of them are in Unicode. (If you are using some ISO standard character set, like the standard Windows Latin fonts, or Big 5 for Chinese, then all those characters are in Unicode.)
Now, it could be there are characters in an old custom font that you never use. As long as you can guarantee that they don’t ever occur in your existing data, you don’t need to worry about them being in Unicode. But, if there is any doubt (for example, if they might have been typed by mistake), then it is best to plan to convert them to Unicode along with everything else.
After doing this inventory and consulting with a Unicode expert to make sure all the characters you need are in Unicode, what happens if some characters are missing? If so, you have three options:
- Decide to discontinue using the characters that are missing and use something else that is in Unicode. You can’t always do this, of course, but sometimes it is the best option.
- There may be a font available that includes the character you need in its “private use area”, a section of Unicode that is intended for local customization. Or, you may be able to arrange to have the characters you need added to the private use area of a font.
- Get the missing character(s) or scripts into Unicode.
The last two options require consulting with a font designer and/or someone who is in regular contact with the Unicode consortium. If you are in SIL you may contact the NRSI for help in making Unicode proposals. Non-SIL may find help through the Script Encoding Initiative.
Back to When to Convert to Unicode.
© 2003-2018 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.