Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Type Design > Resources
Short URL: http://scripts.sil.org/TypeCasterUniCSTs

SIL Encore Fonts version 3.0

Unicode CSTs

Bob Hallissy, 1998-05-05

Introduction

By putting Trans Unicode at the top of your CST, the TypeCaster font compiler (version 3.0 or later) will interpret access codes as Unicode values (rather than Windows or Ventura character codes). This makes it possible to build fonts that cover multiple codepages simultaneously. This is the same mechanism that makes it possible for the single Times New Roman font to supply Western, Greek, Turkish, Baltic, Central European, and Cyrillic scripts.

This document is neither a Unicode tutorial, nor a codepage tutorial, but assumes you already know about these subjects. Rather, this document covers how to use TypeCaster to generate fonts that cover multiple codepages by expressing the access codes as Unicode.

Using the TypeCaster Editor and Catalog with Unicode CSTs

The TypeCaster CST Editor cannot handle CST files that implement Unicode. You will have to edit such CSTs with Notepad or some other simple text editor. Even though the TypeCaster Editor cannot be used directly, there are several techniques to be aware of that will make creating a CST “by hand” a little easier.

Copy/Paste from the Catalog

You can use the TypeCaster Catalog application to obtain the SILID (and comment!) for insertion into simple text editor. Simply select the glyph(s) you want in the catalog view and press  Ctrl - C  (or select Edit/Copy on the menu). Then change to the text editor and press  Ctrl - V  (or select Edit/Paste on menu) to paste the text in. The text pasted into your document will include the SILID and comment field. For example, if you had both SILID 2403 and SILID 6103 selected in the catalog and used this technique, here is what would be pasted into your text editor:

2403    /* hooktop B */

6103    /* grave accent over upper case */

Now add in the access code you want for each of these, converting to a composite if needed:

X1234/    2403    /* hooktop B */

     /    6103    /* grave accent over upper case */

Use TypeCaster Editor, then open CST with Notepad

Another useful technique is to use the TypeCaster CST Editor to create a CST with the characters you need, including any composites, etc. Now save the CST and open it with a text editor like Notepad and use cut and paste to copy the entries to your Unicode CST. Finally you change the access codes to be the Unicode values that you want.

Opening CSTs With Notepad

If you try to open a Unicode CST with the TypeCaster CST Editor, the Editor will display an error about not being able to process the CST file, and offer to open it with Notepad. While this is one way to get to a text editor, it isn’t the most direct. Here are some tricks that you may want to use.

From the main window of TypeCaster Compiler you can select a CST, then right-click on it, and you will get a context menu that includes Open With Notepad. This allows you to bypass the TypeCaster Editor step.

If you use Unicode CSTs a lot, you can configure TypeCaster to use a launch a program other than the TypeCaster Editor when you double-click a CST in the main TypeCaster window. From TypeCaster, select the Options/Environment menu, and then the CST Editor tab. Now select Other and enter (or Browse for) the pathname of the desired application (e.g., C:WindowsNotepad.exe).

Finally, if you like to open CSTs from Windows Explorer, the following registry hack will add an “Open with Notepad” entry to the context menu for CST files, allowing you to bypass the TypeCaster Editor by right clicking on a CST. This assumes you know what you are doing with the registry — please be cautious when editing your registry!

REGEDIT4
[HKEY_CLASSES_ROOTFedit.Documentshelledit]
"EditFlags"=hex:01,00,00,00
@="Open with &Notepad"
[HKEY_CLASSES_ROOTFedit.Documentshelleditcommand]
@="C:WindowsNotepad.exe %1"

A slightly safer way to similar functionality is to put a shortcut to Notepad in your WindowsSendTo folder. This allows you to right click on any file (including CST files), select SendTo/Notepad and have the file opened by Notepad.

Access Codes and Default Fill

In most cases you will probably want to specify your Unicode access codes in hexadecimal since this is the way the Unicode standard documents the character set. If, for example, you were putting the IPA extensions into your font, you might have lines in your CST such as:

/* U+0250 - U+02AF  IPA Extensions */

x0250   1101    /* turned a */

x0251   1002    /* cursive a */

x0252   1202    /* turned cursive a */

x0253   1403    /* hooktop b */

x0254   1104    /* open o */

You may still specify access codes as decimal integers or single character ANSI constants for those values that make sense (usually 32-255). However, the Compiler does something special with 128-159.

Unicode reserves the block from character codes 128 (U+0080) to 159 (U+009F) as control codes, and so there isn’t a need to be able to define these in a font. Therefore, as a convenience to CST authors, access codes in this range (and only in this range) are assumed to be Windows character codes (codepage 1252) and are mapped to the equivalent Unicode character.

One of the useful side effects of this is that if you have not included either an Encode Symbol or a Fill command in your Unicode CST, then any codepage 1252 character codes (except one) that you have not specified in your CST will automatically be included in your font with the correct glyph. The only exception is Windows Character 183 which is an anomaly in the Windows system. In Windows 3.1, character 183 maps to Unicode U+2219 (BULLET OPERATOR) in OpenType fonts. For compatibility this is retained in present Windows systems but if an application asks the OS to convert text from ANSI to Unicode, character 183 will be mapped to U+00B7 (MIDDLE DOT) instead. TypeCaster chooses this latter mapping.

The bottom line is this: If you want to build a font that has all the standard Windows characters (i.e., codepage 1252) plus some extra Unicode characters (e.g., additional codepage coverage), start with the following CST and add the Unicode access codes that you need:

Trans Unicode

Encode Normal

CodepageRange    /* See below for help in completing the   */

UnicodeRange    /* CodepageRange and UnicodeRange entries */

/* This is included to complete the Windows character set:

x2219 0002        /* 1252: B7 ;bullet operator */

Codepage and Unicode coverage

Now that you can build a font containing several hundred glyphs and covering several codepages, you must tell Windows what codepages are actually covered by this font. You must also identify what Unicode blocks are covered. By the way, “coverage” is not a well-defined term. Microsoft does specify what characters, or even what percentage of characters, you have to provide within a given codepage or Unicode block to qualify as covering that range or block. Use your own judgment — if you think you have enough to be useful, then tell Windows that you do cover that range or block.

Codepage and Unicode coverage is specified using two fields in the OpenType font. Each field is a bit vector wherein each bit represents a codepage or Unicode block; a value of 1 indicates the font covers that codepage or block, and a value of zero indicates the font does not. The codepage range vector is 32 bits long, and the Unicode range vector is 64 bits long. The definitions for these bits, that is, what bit indicates what codepage or Unicode block, are given in the Appendix.

In a CST you specify the codepage and Unicode range vectors by two special commands at the top of the CST. Each command requires a comma-separated list of 32-bit numbers, each specified as 8 hexadecimal digits (leading zeros may be omitted). The CodePageRange command can accept up to two numbers (totaling 64 bits), and the UnicodeRange command can accept up to four numbers (totaling 128 bits). As an example, you might have the following commands at the top of your CST:

trans unicode

encode normal

codepage 97

unicode 8000027F,10006079

(All CST commands may be abbreviated to as few as two letters, so co is the same as codepage is the same as CodePageRange.)

The interpretation of the hexadecimal arguments is as follows. Each number represents 32 bits of the coverage vector. The first number represents bits 0-31, the next bits 32-63, etc. Within each number, the least significant bit is the lowest numbered, while the most significant bit is the highest numbered. Each hex digit, of course, represents 4 bits. Pictorially, the Unicode range command in the above example would be interpreted as follows:



In this case, since only two of the 4 possible UnicodeRange parameters have been specified, the remaining two default to zero, so none of bits 64-127 are set.

Referring to the codepage Range Bits table in the Appendix and the above diagram, we see that the line

codepage 97

indicates that this font covers the codepages represented by bits 0, 1, 2, 4, and 7, which would be codepages 1252 (Latin 1), 1250 (Latin 2: Eastern Europe), 1251 (Cyrillic), 1254 (Turkish), and 1257 (Windows Baltic).

For an Excel workbook that can calculate the hexidecimal values for Unicode and OS/2 range values, see OS/2 table Range bit calculation workbook.

Testing your font

If you have built and installed what you think is a Unicode font that covers multiple codepages, there are a number of ways you can test to see if you got the codepage and Unicode ranges right. First off, open WordPad and select your font from the pull-down menu. Then drop down the Font Script list and see what character sets are listed — it should match your CodepageRange data. Compare with the available Times New Roman fonts.

A useful tool available free from Microsoft’s web site is called the Font Properties extension. It adds property pages (“tabs”) to the dialog you get when you right-click on a OpenType font file and select Properties from the menu. In particular, one of the added pages enumerates the codepages and Unicode blocks supported by the font. This is the most direct way to see if you got the UnicodeRange and CodePageRange right.

Appendix

Unicode Range bits

The following table is excerpted from the TrueType font specification, v1.1

BitDescription
0 Basic Latin
1 Latin-1 Supplement
2 Latin Extended-A
3 Latin Extended-B
4 IPA Extensions
5 Spacing Modifier Letters
6 Combining Diacritical Marks
7 Basic Greek
8 Greek Symbols And Coptic
9 Cyrillic
10 Armenian
11 Basic Hebrew
12 Hebrew Extended (A and B blocks)
13 Basic Arabic
14 Arabic Extended
15 Devanagari
16 Bengali
17 Gurmukhi
18 Gujarati
19 Oriya
20 Tamil
21 Telugu
22 Kannada
23 Malayalam
24 Thai
25 Lao
26 Basic Georgian
27 Georgian Extended
28 Hangul Jamo
29 Latin Extended Additional
30 Greek Extended
31 General Punctuation
32 Superscripts And Subscripts
33 Currency Symbols
34 Combining Diacritical Marks For Symbols
35 Letterlike Symbols
36 Number Forms
37 Arrows
38 Mathematical Operators
39 Miscellaneous Technical
40 Control Pictures
41 Optical Character Recognition
42 Enclosed Alphanumerics
43 Box Drawing
44 Block Elements
45 Geometric Shapes
46 Miscellaneous Symbols
47 Dingbats
48 CJK Symbols And Punctuation
49 Hiragana
50 Katakana
51 Bopomofo
52 Hangul Compatibility Jamo
53 CJK Miscellaneous
54 Enclosed CJK Letters And Months
55 CJK Compatibility
56 Hangul
57 Reserved for Unicode SubRanges
58 Reserved for Unicode SubRanges
59 CJK Unified Ideographs
60 Private Use Area
61 CJK Compatibility Ideographs
62 Alphabetic Presentation Forms
63 Arabic Presentation Forms-A
64 Combining Half Marks
65 CJK Compatibility Forms
66 Small Form Variants
67 Arabic Presentation Forms-B
68 Halfwidth And Fullwidth Forms
69 Specials
70–127 Reserved for Unicode SubRanges

TrueType font specification, v1.1

Codepage range bits

The following table is excerpted from the TrueType font specification, v1.1

BitCode PageDescription
ANSI:    
0 1252 Latin 1
1 1250 Latin 2: Eastern Europe
2 1251 Cyrillic
3 1253 Greek
4 1254 Turkish
5 1255 Hebrew
6 1256 Arabic
7 1257 Windows Baltic
8–15   Reserved for Alternate ANSI
ANSI and OEM:    
16 874 Thai
17 932 JIS/Japan
18 936 Chinese: Simplified chars--PRC and Singapore
19 949 Korean Wansung
20 950 Chinese: Traditional chars--Taiwan and Hong Kong
21 1361 Korean Johab
22–28   Reserved for Alternate ANSI & OEM
29   Macintosh Character Set (US Roman)
30   OEM Character Set
31   Symbol Character Set
OEM:    
32-47   Reserved for OEM
48 869 IBM Greek
49 866 MS-DOS Russian
50 865 MS-DOS Nordic
51 864 Arabic
52 863 MS-DOS Canadian French
53 862 Hebrew
54 861 MS-DOS Icelandic
55 860 MS-DOS Portuguese
56 857 IBM Turkish
57 855 IBM Cyrillic; primarily Russian
58 852 Latin 2
59 775 MS-DOS Baltic
60 737 Greek; former 437 G
61 708 Arabic; ASMO 708
62 850 WE/Latin 1
63 437 US

TrueType font specification, v1.1

Internet Resources

 Character sets

 Character sets and codepages

 OpenType font file format

 Font properties extension


© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.