Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Encoding > Unicode
Short URL: http://scripts.sil.org/OrthographyDev

Orthography development in relation to Unicode

Lorna A. Priest, 2004-11-18

It is out of our scope to give complete guidelines for developing an orthography. However, we would like to give you a process to work through from a Unicode perspective.

In designing a writing system, one must decide what symbols will be used and how. Here we list Unicode factors that should be taken into account. At least two sets of standards should be considered when developing an orthography:

  • National/local alphabet (local or national standards may have higher priority than phonetic considerations) and
  •  The Unicode Standard.

Our concern here is with the Unicode Standard.

Note

The crucial question now is, “Does the character being considered already exist in the Unicode Standard?”

The Unicode Consortium will not readily accept the addition of randomly created characters. If the character under consideration does not already exist in Unicode you should seriously reconsider an alternative that does. Using a character that is not already in the Unicode Standard leads to the following problems:

  • Standard fonts will not contain that character
  • Diacritics will not be placed properly on that character
  • Words will possibly be broken in inappropriate places

If the character you want to use exists in the Unicode Standard, then there are some further issues to think about.

In developing an orthography (see Unicode Character Properties Excel Workbook),

  • Consideration should be given to the following for each character:
    • What are the character’s properties?
      • If it is a lower case letter it should have a Ll (Letter, Lowercase) category
        • Make sure there is already a matching upper case letter
      • If it is an upper case letter it should have a Lu (Letter, Uppercase) category.
      • Some letters do not have case (no upper and lower case variants). For these, consideration should be given to:
        • Unless it has “case”, the character you choose should have a Lm (Letter, Modifier) category.
        • If the script you are working with does not use case at all, characters will have a Lo (Letter, Other), Mc (Mark, Spacing Combining), or Mn (Mark, Spacing Combining) category.
        • Some people have been creative in using punctuation marks for marking tone, glottal stops and other features that are properly part of a word. Do not consider doing this! If punctuation marks are used, they will not be considered part of the word. You might find what you need in this block:  Spacing Modifier Letters. However, not all of these are considered word building (for instance, some of these have a character property of sk “Symbol, Modifier”). Check the character properties (see, Unicode Character Properties Excel Workbook) of any symbol you are considering using.
    • What is the form of capital letters? If you are uncertain, look at “Upper Case equivalent” and “Lower Case equivalent” in Unicode Character Properties Excel Workbook.
  • How do you represent tone?
    • Diacritics — these are found in the  Combining Diacritical Marks and  Combining Diacritical Marks Supplement sections of the Unicode Standard. Do not use U+0334..U+0338 and U+0340..U+0345.
    • Special letters — follow the guidelines for character properties.
    • Punctuation — Tone is generally considered part of a word and if you use punctuation marks it will not be considered part of the word. Follow the guidelines listed under punctuation.
    • Numbers — “Normal” superscript numbers can be used (U+00B9, U+00B2, U+00B3, U+2074, U+2075, U+2076, U+2077, U+2078, U+2079, U+2070 represent 1,2,3,4,5,6,7,8,9,0 respectively).
  • How do you represent a word boundary? Unless you are using a non-Roman script you can probably count on this being simply a space and will not be an issue in the orthography.

Other Useful Resources



Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.

Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.



© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.