This is an archive of the original scripts.sil.org site, preserved as a historical reference. Some of the content is outdated. Please consult our other sites for more current information: software.sil.org, ScriptSource, FDBP, and silfontdev



Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE | PRIVACY POLICY

You are here: Encoding > Unicode > Training
Short URL: https://scripts.sil.org/Transition2Unicode

Transitioning an organization to Unicode

A guide for administrators

Steve Saylor, Kent Spielmann (Unicode Transition Initiative), 2007-01-11

This article is intended to help executives and administrators to understand what Unicode is, and to plan for transitioning their organization to using it. Adapted from an article for SIL International Executives, it is relevant to organizations who work in linguistics, minority languages, or who otherwise have a legacy of using custom fonts for their work.

Unicode offers a solution to a basic problem that mother-tongue speakers, linguists, and others working in many minority languages have had to wrestle with for many years. Because these languages contain characters that were outside of any national or international standard, there was no agreement on how to represent data from them in a computer.

The basic goal of Unicode is to solve this problem by establishing a world-wide character encoding standard which assigns a unique code number to every character of every writing system in the world.

The problem

Computer characters need unique codes

A computer displays text on the screen as letters, numbers, and other
characters. In the computer’s memory each character is represented by a unique numerical code called a character code or codepoint. An entire set of character codes, along with the characters they represent and the relationship between them, is called a character set. Various fonts are designed to be used with a particular character set.

Before Unicode there weren’t enough character codes available

Prior to the introduction of the Unicode standard, most character sets were limited to 256 characters. Because such a small number of characters could not meet the needs of every language, different character sets were defined for different languages. Some character sets were defined by computer companies, others by national or international standards bodies.

When computers were given the ability to switch between character sets, the character sets became known as code pages. Each code page was assigned a number. For example,

  • English, French, Spanish, and other Western European languages used code page 1252
  • Cyrillic used code page 1251, and
  • Korean used code page 1361.

Although computers could switch between code pages, few users knew how to do so, and in any case, they could not use two different code pages at the same time.

The shortage of codes created problems

This inconsistent encoding meant that systems conflicted with one another. That is two code pages could use

  • the same number for two different characters, or
  • different numbers for the same character.

Any given computer (especially servers) needed to support many different encodings; yet whenever data created with one code page was read on a computer using a different code page, that data ran the risk of being corrupted.

Additionally, in most parts of the world, minority languages were overlooked by software developers and standards bodies. Many of the characters those languages needed were left out of the standards. (In this regard, linguists recording data with a phonetic alphabet were a minority language community!) This posed a problem for minority-languages speakers and language-development teams who needed to write these languages on their computers.

Custom fonts created even more problems

At that time, the only solution for minority languages was to create custom fonts—often called hacked fonts. The resulting fonts contained the special characters needed to represent the language, but they no longer conformed to any standard, creating a new series of problems. For example, a given character code might represent a punctuation character in the standard character set, but an alphabetic character in the custom font. The result was that the computer continued to treat it as a punctuation character.

Another consequence of using custom fonts is that any documents based on them are meaningless without access to those fonts.

  • Text from one language team cannot be shared with another team unless each team has the same custom font.
  • Text posted on the web may be garbled or unreadable.
  • Archived data becomes useless if the font is lost or inaccessible.

Moreover, there is a continual danger that changes in computer operating systems and computer programs will break or corrupt a custom font. This actually happened a few years ago when Microsoft decided to add the Euro symbol to all of their fonts at the character code 128. The result was that the special characters that had been put at 128 in many custom fonts were no longer displayed correctly.

The solution

Unicode will provide codes for every character

“Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.” (The Unicode Standard, Version 4.0)

Unicode has space for over one million characters. In theory, every character in every writing system in the world can be assigned a Unicode number. In actual practice, the Unicode standard currently contains specifications for over 100,000 characters. This covers the major languages of the world and many minority languages. However, some minority languages are not yet covered. In some cases, it is because a handful of characters are missing. In others, entire scripts are missing.

Unicode is growing to accommodate the needs of the world's language communities

Missing pieces are being added to Unicode on a regular basis. Unicode has a place for custom characters used by an organization that have not yet, or may never be accepted into the official standard—the Private Use Area (PUA). Part of SIL's strategy has been to organize the characters in the PUA across the organization (see: Software SIL PUA Corporate Strategy for further details). SIL's PUA has been a temporary home to many minority language characters during the often lengthy Unicode approval process. SIL's importance to the process can be see in that, of some 250 characters in SIL's PUA, over 80% have been subsequently been officially added to Unicode.

At some point in the future, the work of encoding the world's writing systems will be completed and language projects all over the world will be able to complete the transition to a Unicode standard with Unicode compliant fonts. This should eliminate all the problems associated with using non-standard fonts. Using consistent and predictable character codes means that individuals and organizations working in and/or doing research in any number of languages can easily share documents and language data. It also means that those documents will be more permanent, since any Unicode compliant font will correctly represent the data. 1

Benefits

Adopting the Unicode standard for all documents and data will provide benefits in many areas.

Special characters

You can properly display special characters, including minority language characters and non-Roman script texts, in numerous software applications. This was not universally possible with computers and fonts that used non-Unicode characters and fonts.

“Unicode is marvelous. It makes it possible for phoneticians throughout the world to use all manner of phonetic symbols in their work and display them on computer screens in the certainty that they will not now be garbled or turned into wingdings (as once used to happen all too often). All alphabetic phonetic symbols officially recognized by the International Phonetic Association are now included in the Unicode Standard.”

—John Wells, President
International Phonetic Association
See:  http://www.unicode.org/press/quotations.html

Multilingual documents

Highly multilingual documents can be easily represented in Unicode. The old code-page based systems could not support more than one or two scripts at a time without resorting to hacked fonts.

Information sharing

Your documents and data can be shared with a world-wide audience. The information in the documents will be displayed properly on any computer in any part of the world that incorporates the Unicode standard. Web-based publishing in minority languages becomes practical.

Document permanence

Your documents can be read for many years to come, by any computer in any part of the world, as long as that computer incorporates the Unicode standard. That is to say, they will have constancy and permanence.

Publications

When you publish your documents (either in print or electronically), the documents can be processed and read by any computer in any part of the world, as long as that computer incorporates the Unicode standard.

Compatibility

Increasingly legacy fonts, particularly custom encoded legacy fonts, tend not to display reliably on new operating systems and software applications.
Conversely, newer versions of software work increasingly better with Unicode fonts, taking advantage of their smart font capabilities resulting in better looking printed and on-line documents .

Costs

Adopting the Unicode standard for all documents and data will involve some costs.

Costs for hardware and software

You may need to upgrade computer equipment and update to the latest software application versions. This may be a decreasing cost since operating systems and software are often upgraded for other reasons such as security. Note that, although all Windows operating systems since Windows 2000 are Unicode compliant, some languages that have only recently been added to Unicode may only work properly with the latest version of Windows.

Costs in IT specialists' time

IT specialists may need training to acquire new skills such as

  • identifying characters which need to be added to Unicode,
  • developing conversion tables and
  • modifying keyboard layouts.

They will then need to

  • train users to convert their data and documents (or do it for them), and
  • help users switch to new software
  • determine if new characters should be added to Unicode.

This may require a significant initial investment of time as well as ongoing effort by specialists. For example they will need to work with the  Unicode Technical Committee (UTC) to add new characters to Unicode. SIL entities and FOBAI organizations should contact NRSI about adding such characters to SIL's private use area as the first step in adding new characters and scripts to Unicode.

Costs in users' time

There will be an ongoing effort on the part of users as they convert their data and documents to be Unicode-compliant and learn to use new software.

Getting started

Now or later?

You will eventually need to switch to Unicode for all your work. Now would be the best time to do this for most teams.

Organizations and individuals who use Windows 2000 or a later operating system (XP, Vista, Mac OS X, Linux) should consider switching all their work to Unicode. Fonts and applications are already available for most languages and scripts.

For more information on what is available, see: Software requirements for different levels of Unicode Support.

Why delay?

Certain projects might need to delay if they lack key Unicode-based components—for example, if a language project uses a non-Roman script for which a Unicode font is not yet available.

Large publishing projects may elect to publish first and then convert their data to Unicode for archiving. However, this carries some risk. Once the data is converted to Unicode, how will it be checked for accuracy? Will anyone be willing to proofread it again, even though it has already been published? It is dangerous to convert data and then put it on the shelf unchecked. A better strategy would be to convert the data to Unicode for publishing. The publication process will assure the accuracy of the conversion, and it can then be archived with confidence.

What to do

Download the Unicode Conversion Planning document. This document is intended for Unicode conversion planning for SIL and partner organizations. However, it is made available here as a template for establishing Unicode transition goals and objectives for other organizations as well.

Unicode Conversion Planning
Jim Brase, 2007-02-19
Download "Unicode_Conversion_Planning_2007ver05.doc", MS Word document, 281KB [3663 downloads]

1 Although not all Unicode characters are in every Unicode font, a Unicode font will not display the wrong character. Also, when using Unicode, word processing and browser programs are able to avoid displaying "empty boxes" by substituting fonts when necessary.

© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.