NRSI: Computers & Writing Systems
Transitioning an organization to Unicode
This article is intended to help executives and administrators to understand what Unicode is, and to plan for transitioning their organization to using it. Adapted from an article for SIL International Executives, it is relevant to organizations who work in linguistics, minority languages, or who otherwise have a legacy of using custom fonts for their work.
Unicode offers a solution to a basic problem that mother-tongue speakers, linguists, and others working in many minority languages have had to wrestle with for many years. Because these languages contain characters that were outside of any national or international standard, there was no agreement on how to represent data from them in a computer.
The basic goal of Unicode is to solve this problem by establishing a world-wide character encoding standard which assigns a unique code number to every character of every writing system in the world.
Computer characters need unique codes
A computer displays text on the screen as letters, numbers, and other
Before Unicode there weren’t enough character codes available
Prior to the introduction of the Unicode standard, most character sets were limited to 256 characters. Because such a small number of characters could not meet the needs of every language, different character sets were defined for different languages. Some character sets were defined by computer companies, others by national or international standards bodies.
When computers were given the ability to switch between character sets, the character sets became known as code pages. Each code page was assigned a number. For example,
Although computers could switch between code pages, few users knew how to do so, and in any case, they could not use two different code pages at the same time.
The shortage of codes created problems
This inconsistent encoding meant that systems conflicted with one another. That is two code pages could use
Any given computer (especially servers) needed to support many different encodings; yet whenever data created with one code page was read on a computer using a different code page, that data ran the risk of being corrupted.
Additionally, in most parts of the world, minority languages were overlooked by software developers and standards bodies. Many of the characters those languages needed were left out of the standards. (In this regard, linguists recording data with a phonetic alphabet were a minority language community!) This posed a problem for minority-languages speakers and language-development teams who needed to write these languages on their computers.
Custom fonts created even more problems
At that time, the only solution for minority languages was to create custom fonts—often called hacked fonts. The resulting fonts contained the special characters needed to represent the language, but they no longer conformed to any standard, creating a new series of problems. For example, a given character code might represent a punctuation character in the standard character set, but an alphabetic character in the custom font. The result was that the computer continued to treat it as a punctuation character.
Another consequence of using custom fonts is that any documents based on them are meaningless without access to those fonts.
Moreover, there is a continual danger that changes in computer operating systems and computer programs will break or corrupt a custom font. This actually happened a few years ago when Microsoft decided to add the Euro symbol to all of their fonts at the character code 128. The result was that the special characters that had been put at 128 in many custom fonts were no longer displayed correctly.
Unicode will provide codes for every character
“Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.” (The Unicode Standard, Version 4.0)
Unicode has space for over one million characters. In theory, every character in every writing system in the world can be assigned a Unicode number. In actual practice, the Unicode standard currently contains specifications for over 100,000 characters. This covers the major languages of the world and many minority languages. However, some minority languages are not yet covered. In some cases, it is because a handful of characters are missing. In others, entire scripts are missing.
Unicode is growing to accommodate the needs of the world's language communities
Missing pieces are being added to Unicode on a regular basis. Unicode has a place for custom characters used by an organization that have not yet, or may never be accepted into the official standard—the Private Use Area (PUA). Part of SIL's strategy has been to organize the characters in the PUA across the organization (see: Software SIL PUA Corporate Strategy for further details). SIL's PUA has been a temporary home to many minority language characters during the often lengthy Unicode approval process. SIL's importance to the process can be see in that, of some 250 characters in SIL's PUA, over 80% have been subsequently been officially added to Unicode.
At some point in the future, the work of encoding the world's writing systems will be completed and language projects all over the world will be able to complete the transition to a Unicode standard with Unicode compliant fonts. This should eliminate all the problems associated with using non-standard fonts. Using consistent and predictable character codes means that individuals and organizations working in and/or doing research in any number of languages can easily share documents and language data. It also means that those documents will be more permanent, since any Unicode compliant font will correctly represent the data. 1
Adopting the Unicode standard for all documents and data will provide benefits in many areas.
You can properly display special characters, including minority language characters and non-Roman script texts, in numerous software applications. This was not universally possible with computers and fonts that used non-Unicode characters and fonts.
“Unicode is marvelous. It makes it possible for phoneticians throughout the world to use all manner of phonetic symbols in their work and display them on computer screens in the certainty that they will not now be garbled or turned into wingdings (as once used to happen all too often). All alphabetic phonetic symbols officially recognized by the International Phonetic Association are now included in the Unicode Standard.”
—John Wells, President
Highly multilingual documents can be easily represented in Unicode. The old code-page based systems could not support more than one or two scripts at a time without resorting to hacked fonts.
Your documents and data can be shared with a world-wide audience. The information in the documents will be displayed properly on any computer in any part of the world that incorporates the Unicode standard. Web-based publishing in minority languages becomes practical.
Your documents can be read for many years to come, by any computer in any part of the world, as long as that computer incorporates the Unicode standard. That is to say, they will have constancy and permanence.
When you publish your documents (either in print or electronically), the documents can be processed and read by any computer in any part of the world, as long as that computer incorporates the Unicode standard.
Increasingly legacy fonts, particularly custom encoded legacy fonts, tend not to display reliably on new operating systems and software applications.
Adopting the Unicode standard for all documents and data will involve some costs.
Costs for hardware and software
You may need to upgrade computer equipment and update to the latest software application versions. This may be a decreasing cost since operating systems and software are often upgraded for other reasons such as security. Note that, although all Windows operating systems since Windows 2000 are Unicode compliant, some languages that have only recently been added to Unicode may only work properly with the latest version of Windows.
Costs in IT specialists' time
IT specialists may need training to acquire new skills such as
They will then need to
This may require a significant initial investment of time as well as ongoing effort by specialists. For example they will need to work with the Unicode Technical Committee (UTC) to add new characters to Unicode. SIL entities and FOBAI organizations should contact NRSI about adding such characters to SIL's private use area as the first step in adding new characters and scripts to Unicode.
Costs in users' time
There will be an ongoing effort on the part of users as they convert their data and documents to be Unicode-compliant and learn to use new software.
Now or later?
You will eventually need to switch to Unicode for all your work. Now would be the best time to do this for most teams.
Organizations and individuals who use Windows 2000 or a later operating system (XP, Vista, Mac OS X, Linux) should consider switching all their work to Unicode. Fonts and applications are already available for most languages and scripts.
For more information on what is available, see: Software requirements for different levels of Unicode Support.
Certain projects might need to delay if they lack key Unicode-based components—for example, if a language project uses a non-Roman script for which a Unicode font is not yet available.
Large publishing projects may elect to publish first and then convert their data to Unicode for archiving. However, this carries some risk. Once the data is converted to Unicode, how will it be checked for accuracy? Will anyone be willing to proofread it again, even though it has already been published? It is dangerous to convert data and then put it on the shelf unchecked. A better strategy would be to convert the data to Unicode for publishing. The publication process will assure the accuracy of the conversion, and it can then be archived with confidence.
What to do
Download the Unicode Conversion Planning document. This document is intended for Unicode conversion planning for SIL and partner organizations. However, it is made available here as a template for establishing Unicode transition goals and objectives for other organizations as well.