Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | CONTACT US

You are here: Encoding > Unicode
Short URL: http://scripts.sil.org/EncodingFAQ

How do I encode...?

    Glottals
        Q: What character should I use to represent the glottal stop?
    Diacritics
        Q: I want to put a diacritic on a “dotted i” and want to retain the dot on the “i”. Can you add that feature to your fonts?
        Q: I need a “V”, “t”, “n” and “l” with a macron under each. Unicode does not have these characters. Can you add these to your PUA and get them into Unicode for me, or is there another way I can encode this character?
        Q: You have left out one crucial Unicode range of four diacritics which are used within the Latin-script in the library world: U+FE20..U+FE23.
        Q: I need a diacritic on an “i”. Should I use the dotless “i” that I found in Unicode or what should I do? I also need to have a diacritic that will go on the upper case “i” and I can't find different heights for the diacritics.
    Overlays
        Q: I need to use a slash “L” (). I can see that Unicode has a precomposed slash “L”. Would it be better for me to use the precomposed version or make it decomposed?
        Q: I cannot find a barred . Can you add it to your PUA?
    Tone
        Q: I see that Unicode (and your Doulos SIL font) has individual tone letters (U+02E5..U+02E9), but does not have the tone glides. Can you get those encoded in Unicode? They are very important in linguistic work.
        Q: How do I know which version of the schwa to use? There is U+0259  and U+01DD .
        Q: I've noticed that when I'm looking for phonetic characters, not everything I want is in the IPA extensions.
        Q: I want an open o with the serif at the top. I see that Unicode now has U+2183  ROMAN NUMERAL REVERSED ONE HUNDRED and U+2184  LATIN SMALL LETTER REVERSED C. Can I use those instead of U+0186  LATIN CAPITAL LETTER OPEN O and U+0254  LATIN SMALL LETTER OPEN O?
        Q: I want a handwritten style a. Unicode has U+0251  LATIN SMALL LETTER ALPHA. Can I use that?
    Page History

Glottals

Question: What character should I use to represent the glottal stop?

Answer: There are a lot of different things that people have done in the past.

If you want something that looks like a curly quote you should use U+02BC  MODIFIER LETTER APOSTROPHE. You could use U+2019  RIGHT SINGLE QUOTATION MARK, but there are at least two issues with that. It is considered punctuation with different properties than an orthographic character and if you use quote marks there is nothing to distinguish between the two characters. (Our Roman fonts (such as Andika, Charis SIL, Doulos SIL and Gentium Plus) all have an alternate glyph for U+02BC  MODIFIER LETTER APOSTROPHE which is a bit larger than normal to help distinguish the glyph from U+2019  RIGHT SINGLE QUOTATION MARK.)

Many orthographies have used something that looks like the straight quote. There were so many problems with using U+0027  APOSTROPHE for this character that we requested the addition of a character to Unicode for that. You should use U+A78C  LATIN SMALL LETTER SALTILLO (one language even "cases" this and U+A78B is used for the uppercase). (Our Roman fonts (such as Andika, Charis SIL, Doulos SIL and Gentium Plus) all have an alternate glyph for U+A78C  LATIN SMALL LETTER SALTILLO and U+A78B  LATIN CAPITAL LETTER SALTILLO which are a bit larger than normal to help distinguish the glyph from U+0027  APOSTROPHE.)

U+02BE  MODIFIER LETTER RIGHT HALF RING is sometimes used for transliterating Arabic hamza (glottal stop). This looks different from both U+A78C  LATIN SMALL LETTER SALTILLO and U+02BC  MODIFIER LETTER APOSTROPHE and might be a good option for traditions which recognize the transliterated hamza.

Some Saskatchewan orthographies use an upper and lowercase glottal stop. Those are U+0241  LATIN CAPITAL LETTER GLOTTAL STOP and U+0242  LATIN SMALL LETTER GLOTTAL STOP.

Of course, the IPA representation is U+0294  LATIN LETTER GLOTTAL STOP and some languages also use this in their orthographies (where casing is not required).

Diacritics

Question: I want to put a diacritic on a “dotted i” and want to retain the dot on the “i”. Can you add that feature to your fonts?

Answer: The Unicode Standard addresses this in chapter 7 (see  http://www.unicode.org/book/ch07.pdf). You should encode it as U+0069 U+0307 U+0301.



Question: I need a “V”, “t”, “n” and “l” with a macron under each. Unicode does not have these characters. Can you add these to your PUA and get them into Unicode for me, or is there another way I can encode this character?

Answer: Unicode does have some precomposed characters because they already existed in standards. The Unicode Technical Committee will no longer accept precomposed forms unless there is a very convincing argument.

However, each of these can be encoded in Unicode. So, for example “V” with a macron under it should be encoded as two characters (U+0056 LATIN CAPITAL LETTER V + U+0331 COMBINING MACRON BELOW):



The same thing can be done with each of your other characters, and, in fact, any other base + diacritic.

Question: You have left out one crucial Unicode range of four diacritics which are used within the Latin-script in the library world: U+FE20..U+FE23.

U+FE20 COMBINING LIGATURE LEFT HALF
U+FE21 COMBINING LIGATURE RIGHT HALF
U+FE22 COMBINING DOUBLE TILDE LEFT HALF
U+FE23 COMBINING DOUBLE TILDE RIGHT HALF

Transliterated Cyrillic records e.g. make heavy use of the first two.

Answer: Originally we made a deliberate decision not to include the combining half marks in our fonts. We consider U+0360 COMBINING DOUBLE TILDE and U+0361 COMBINING DOUBLE INVERTED BREVE to be the preferred characters to use. Thus, to put the U+0361 COMBINING DOUBLE INVERTED BREVE over an “ia”, the preferred encoding would be to put the U+0361 between “ia” (i + U+0361 + a):



However, we were convinced that the library world does need this range and so they were added to our Unicode Roman fonts (Doulos SIL ver 4.1 and Charis SIL ver 4.1). Positioning of these may not be perfect.



Question: I need a diacritic on an “i”. Should I use the dotless “i” that I found in Unicode or what should I do? I also need to have a diacritic that will go on the upper case “i” and I can't find different heights for the diacritics.

Answer: This is where Unicode is really, really useful. You no longer need to encode two different versions of an “i” and two different versions of a diacritic. In fact, you should not! If you look at the character properties for the character you have suggested (U+0131 ) you will see that this character is only used for Turkish and Azerbaijani.

So, you should just use the base character plus the diacritic. (This makes data analysis much simpler as well.) Unicode, along with smart fonts, will automatically handle the dot removal for the “i” and height adjustment for the upper case “i”. For example, i with acute would be encoded as i + U+0307 + U+0301.



In the following example you can see that the diacritic is shifted down if you have characters that have descenders:



Overlays

Question: I need to use a slash “L” (). I can see that Unicode has a precomposed slash “L”. Would it be better for me to use the precomposed version or make it decomposed?

Answer: Sometimes people get confused about whether to use precomposed or decomposed characters that are in Unicode. A simple rule-of-thumb to go by is that if a character has diacritics (either above or below the character), it can be decomposed. If the character has an “overlay” (superimposed on the character) then the preformed (not precomposed) character should be used.

An easy way to find Unicode characters is to look at: Unicode 5.1 Latin and Cyrillic characters – sorted. This document is sorted alphabetically. However, it does not show character properties and decompositions, so if you find you need that information you will need to go to the Unicode book or the  Unicode website to find that information. You can find charts of all the Unicode characters at this site, but another chart that is particularly useful is the chart for  Combining Diacritical Marks.

In the example we are using (U+0141 LATIN CAPITAL LETTER L WITH STROKE ) you will find that there is no decomposition listed for this character and so you should not use “L” + “/” (U+004C LATIN CAPITAL LETTER L + U+0338 COMBINING LONG SOLIDUS OVERLAY). This also means that we should not be using the term “precomposed” for this character, rather, it is “preformed”.

Question: I cannot find a barred . Can you add it to your PUA?

Answer: Although what you are requesting looks different, fundamentally this is the same character as . You should encode it as U+01E5 LATIN SMALL LETTER G WITH STROKE . The Doulos SIL font allows you to choose the barred bowl form through Font Features (if you have an application which allows for this).

Tone

Question: I see that Unicode (and your Doulos SIL font) has individual tone letters (U+02E5..U+02E9), but does not have the tone glides. Can you get those encoded in Unicode? They are very important in linguistic work.

Answer: Unicode can already handle these. You do need a smart font (like Doulos SIL) to make it work. You should type the tone letters in the correct linguistic order and they should become the correct tone glide. For example:



Question: How do I know which version of the schwa to use? There is U+0259  and U+01DD .

Answer: This one will rise up and bite you if you are not careful! This is where looking at the documentation is important. If you look at U+0259 you will see:



There are a number of useful bits of information here. Firstly, you see that it tells you U+018E  is associated with U+01DD . The second bit of useful information is in the first cross reference you are given: U+018F  LATIN CAPITAL LETTER SCHWA. This tells us that U+018F is the upper case match to this character.

Another interesting test is to type both of the schwas into a word processor (like Word). Select them both and click on Format / Change Case... / UPPER CASE. You should see two different forms of the upper case schwa. This shows you how important it is to match the lower case character (which looks exactly the same) with the correct upper case character (which looks significantly different).



In this example you want to make sure that if you are using U+018F  in your orthography, you should make sure the lower case is U+0259 .

Question: I've noticed that when I'm looking for phonetic characters, not everything I want is in the IPA extensions.

For example, the beta which is used for a voiced bilabial fricative is, I believe, supposed to be encoded as U+03B2 , but that is in the Greek section, and its documentation does not make explicit that it is supposed to be used for a bilabial fricative nor that it is part of the IPA. So, I am still not absolutely sure I've got the right character.

Answer: You are right about the voiced bilabial fricative being encoded as U+03B2 . The bigger question, of course, is a need to know all the characters sanctioned as part of the IPA and what their Unicode codepoints are.

The  official IPA site does not currently do this for us. There are several places you can check for this information. Try Symbols in Phonetic Symbol Guide 2nd edn. in relation to Unicode 5.1,  http://www.travelphrases.info/gallery/Test_IPA.html,  http://web.uvic.ca/ling/resources/ipa/charts/unicode_intro.htm or  http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm. Another resource for answering this kind of question (and maybe avoiding having to figure it out yourself!) is the SIL IPA93 conversion map or the IPA Unicode Keyboards. Typing in the beta using this keyboard results in character U+03B2 .

Question: I want an open o with the serif at the top. I see that Unicode now has U+2183  ROMAN NUMERAL REVERSED ONE HUNDRED and U+2184  LATIN SMALL LETTER REVERSED C. Can I use those instead of U+0186  LATIN CAPITAL LETTER OPEN O and U+0254  LATIN SMALL LETTER OPEN O?

Answer: U+2183  ROMAN NUMERAL REVERSED ONE HUNDRED was added to the Roman numeral block. U+2183  ROMAN NUMERAL REVERSED ONE HUNDRED was added for use as a Claudian letter. We do not recommend their use for anything other than what they were designed for. Please use U+0186 and U+0254 if you need an open o and find a font which has the serif where you want it. Our SIL Roman Unicode fonts now provide an alternate form, so if you have an application that can handle it, you can choose whether you want a top or bottom serif.

Question: I want a handwritten style a. Unicode has U+0251  LATIN SMALL LETTER ALPHA. Can I use that?

Answer: U+0251  LATIN SMALL LETTER ALPHA is in Unicode as an IPA symbol. Please do not use it instead of an "a". You should find a font which has a handwritten style "a" at U+0061  LATIN SMALL LETTER A. If you use U+0251  LATIN SMALL LETTER ALPHA you will have unexpected results with data analysis as well as when using uppercase/lowercase pairs.

Page History

2009-02-25 LP: added glottal question
2007-06-12 LP: updated answer to question about U+FE20..U+FE23
2004-10-06 LP: page creation


© 2003-2014 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us at .