NRSI: Computers & Writing Systems
How do I encode...?
Question: What character should I use to represent the glottal stop?
Answer: There are a lot of different things that people have done in the past.
If you want something that looks like a curly quote you should use U+02BC MODIFIER LETTER APOSTROPHE. You could use U+2019 RIGHT SINGLE QUOTATION MARK, but there are at least two issues with that. It is considered punctuation with different properties than an orthographic character and if you use quote marks there is nothing to distinguish between the two characters. (Our Roman fonts (such as Andika, Charis SIL, Doulos SIL and Gentium Plus) all have an alternate glyph for U+02BC MODIFIER LETTER APOSTROPHE which is a bit larger than normal to help distinguish the glyph from U+2019 RIGHT SINGLE QUOTATION MARK.)
Many orthographies have used something that looks like the straight quote. There were so many problems with using U+0027 APOSTROPHE for this character that we requested the addition of a character to Unicode for that. You should use U+A78C LATIN SMALL LETTER SALTILLO (one language even "cases" this and U+A78B is used for the uppercase). (Our Roman fonts (such as Andika, Charis SIL, Doulos SIL and Gentium Plus) all have an alternate glyph for U+A78C LATIN SMALL LETTER SALTILLO and U+A78B LATIN CAPITAL LETTER SALTILLO which are a bit larger than normal to help distinguish the glyph from U+0027 APOSTROPHE.)
U+02BE MODIFIER LETTER RIGHT HALF RING is sometimes used for transliterating Arabic hamza (glottal stop). This looks different from both U+A78C LATIN SMALL LETTER SALTILLO and U+02BC MODIFIER LETTER APOSTROPHE and might be a good option for traditions which recognize the transliterated hamza.
Some Saskatchewan orthographies use an upper and lowercase glottal stop. Those are U+0241 LATIN CAPITAL LETTER GLOTTAL STOP and U+0242 LATIN SMALL LETTER GLOTTAL STOP.
Of course, the IPA representation is U+0294 LATIN LETTER GLOTTAL STOP and some languages also use this in their orthographies (where casing is not required).
Question: I want to put a diacritic on a “dotted i” and want to retain the dot on the “i”. Can you add that feature to your fonts?
Answer: The Unicode Standard addresses this in chapter 7 (see http://www.unicode.org/versions/latest/ch07.pdf). You should encode it as U+0069 U+0307 U+0301.
Question: I need a “V”, “t”, “n” and “l” with a macron under each. Unicode does not have these characters. Can you add these to your PUA and get them into Unicode for me, or is there another way I can encode this character?
Answer: Unicode does have some precomposed characters because they already existed in standards. The Unicode Technical Committee will no longer accept precomposed forms unless there is a very convincing argument.
However, each of these can be encoded in Unicode. So, for example “V” with a macron under it should be encoded as two characters (U+0056 LATIN CAPITAL LETTER V + U+0331 COMBINING MACRON BELOW):
The same thing can be done with each of your other characters, and, in fact, any other base + diacritic.
Question: You have left out one crucial Unicode range of four diacritics which are used within the Latin-script in the library world: U+FE20..U+FE23.
Transliterated Cyrillic records e.g. make heavy use of the first two.
Answer: Originally we made a deliberate decision not to include the combining half marks in our fonts. We consider U+0360 COMBINING DOUBLE TILDE and U+0361 COMBINING DOUBLE INVERTED BREVE to be the preferred characters to use. Thus, to put the U+0361 COMBINING DOUBLE INVERTED BREVE over an “ia”, the preferred encoding would be to put the U+0361 between “ia” (i + U+0361 + a):
However, we were convinced that the library world does need this range and so they were added to our Unicode Roman fonts (Doulos SIL ver 4.1 and Charis SIL ver 4.1). Positioning of these may not be perfect.
Question: I need a diacritic on an “i”. Should I use the dotless “i” that I found in Unicode or what should I do? I also need to have a diacritic that will go on the upper case “i” and I can't find different heights for the diacritics.
Answer: This is where Unicode is really, really useful. You no longer need to encode two different versions of an “i” and two different versions of a diacritic. In fact, you should not! If you look at the character properties for the character you have suggested (U+0131 ) you will see that this character is only used for Turkish and Azerbaijani.
So, you should just use the base character plus the diacritic. (This makes data analysis much simpler as well.) Unicode, along with smart fonts, will automatically handle the dot removal for the “i” and height adjustment for the upper case “i”. For example, i with acute would be encoded as i + U+0307 + U+0301.
In the following example you can see that the diacritic is shifted down if you have characters that have descenders:
Question: I need to use a slash “L” ( ). I can see that Unicode has a precomposed slash “L”. Would it be better for me to use the precomposed version or make it decomposed?
Answer: Sometimes people get confused about whether to use precomposed or decomposed characters that are in Unicode. A simple rule-of-thumb to go by is that if a character has diacritics (either above or below the character), it can be decomposed. If the character has an “overlay” (superimposed on the character) then the preformed (not precomposed) character should be used.
An easy way to find Unicode characters is to look at: Unicode 8.0 Latin and Cyrillic characters – sorted. This document is sorted alphabetically. However, it does not show character properties and decompositions, so if you find you need that information you will need to go to the Unicode book or the Unicode website to find that information. You can find charts of all the Unicode characters at this site, but another chart that is particularly useful is the chart for Combining Diacritical Marks.
In the example we are using (U+0141 LATIN CAPITAL LETTER L WITH STROKE ) you will find that there is no decomposition listed for this character and so you should not use “L” + “/” (U+004C LATIN CAPITAL LETTER L + U+0338 COMBINING LONG SOLIDUS OVERLAY). This also means that we should not be using the term “precomposed” for this character, rather, it is “preformed”.
Question: I cannot find a barred . Can you add it to your PUA?
Answer: Although what you are requesting looks different, fundamentally this is the same character as . You should encode it as U+01E5 LATIN SMALL LETTER G WITH STROKE . The Doulos SIL font allows you to choose the barred bowl form through (if you have an application which allows for this).
Question: I see that Unicode (and your Doulos SIL font) has individual tone letters (U+02E5..U+02E9), but does not have the tone glides. Can you get those encoded in Unicode? They are very important in linguistic work.
Answer: Unicode can already handle these. You do need a smart font (like Doulos SIL) to make it work. You should type the tone letters in the correct linguistic order and they should become the correct tone glide. For example:
Question: How do I know which version of the schwa to use? There is U+0259 and U+01DD .
Answer: This one will rise up and bite you if you are not careful! This is where looking at the documentation is important. If you look at U+0259 you will see:
There are a number of useful bits of information here. Firstly, you see that it tells you U+018E is associated with U+01DD . The second bit of useful information is in the first cross reference you are given: U+018F LATIN CAPITAL LETTER SCHWA. This tells us that U+018F is the upper case match to this character.
Another interesting test is to type both of the schwas into a word processor (like Word). Select them both and click on looks exactly the same) with the correct upper case character (which looks significantly different).. You should see two different forms of the upper case schwa. This shows you how important it is to match the lower case character (which
In this example you want to make sure that if you are using U+018F in your orthography, you should make sure the lower case is U+0259 .
Question: I've noticed that when I'm looking for phonetic characters, not everything I want is in the IPA extensions.
For example, the beta which is used for a voiced bilabial fricative is, I believe, supposed to be encoded as U+03B2 , but that is in the Greek section, and its documentation does not make explicit that it is supposed to be used for a bilabial fricative nor that it is part of the IPA. So, I am still not absolutely sure I've got the right character.
Answer: You are right about the voiced bilabial fricative being encoded as U+03B2 . The bigger question, of course, is a need to know all the characters sanctioned as part of the IPA and what their Unicode codepoints are.
The official IPA site does not currently do this for us. There are several places you can check for this information. Try Symbols in Phonetic Symbol Guide 2nd edn. in relation to Unicode 5.1, http://www.travelphrases.info/gallery/Test_IPA.html, http://web.uvic.ca/ling/resources/ipa/charts/unicode_intro.htm or http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm. Another resource for answering this kind of question (and maybe avoiding having to figure it out yourself!) is the SIL IPA93 conversion map or the IPA Unicode Keyboards. Typing in the beta using this keyboard results in character U+03B2 .
Question: I want an open o with the serif at the top. I see that Unicode now has U+2183 ROMAN NUMERAL REVERSED ONE HUNDRED and U+2184 LATIN SMALL LETTER REVERSED C. Can I use those instead of U+0186 LATIN CAPITAL LETTER OPEN O and U+0254 LATIN SMALL LETTER OPEN O?
Answer: U+2183 ROMAN NUMERAL REVERSED ONE HUNDRED was added to the Roman numeral block. U+2183 ROMAN NUMERAL REVERSED ONE HUNDRED was added for use as a Claudian letter. We do not recommend their use for anything other than what they were designed for. Please use U+0186 and U+0254 if you need an open o and find a font which has the serif where you want it. Our SIL Roman Unicode fonts now provide an alternate form, so if you have an application that can handle it, you can choose whether you want a top or bottom serif.
Question: I want a handwritten style a. Unicode has U+0251 LATIN SMALL LETTER ALPHA. Can I use that?
Answer: U+0251 LATIN SMALL LETTER ALPHA is in Unicode as an IPA symbol. Please do not use it instead of an "a". You should find a font which has a handwritten style "a" at U+0061 LATIN SMALL LETTER A. If you use U+0251 LATIN SMALL LETTER ALPHA you will have unexpected results with data analysis as well as when using uppercase/lowercase pairs.
2009-02-25 LP: added glottal question