You are here: Type Design > Resources
Short URL: https://scripts.sil.org/TypeCasterUniCSTs

SIL Encore Fonts version 3.0

Unicode CSTs

Bob Hallissy, 1998-05-05

Introduction
Using the TypeCaster Editor and Catalog with Unicode CSTs
Access Codes and Default Fill
Codepage and Unicode coverage
Testing your font
Appendix

Introduction

By putting Trans Unicode at the top of your CST, the TypeCaster font compiler (version 3.0 or later) will interpret access codes as Unicode values (rather than Windows or Ventura character codes). This makes it possible to build fonts that cover multiple codepages simultaneously. This is the same mechanism that makes it possible for the single Times New Roman font to supply Western, Greek, Turkish, Baltic, Central European, and Cyrillic scripts.

This document is neither a Unicode tutorial, nor a codepage tutorial, but assumes you already know about these subjects. Rather, this document covers how to use TypeCaster to generate fonts that cover multiple codepages by expressing the access codes as Unicode.

Using the TypeCaster Editor and Catalog with Unicode CSTs

The TypeCaster CST Editor cannot handle CST files that implement Unicode. You will have to edit such CSTs with Notepad or some other simple text editor. Even though the TypeCaster Editor cannot be used directly, there are several techniques to be aware of that will make creating a CST “by hand” a little easier.

Copy/Paste from the Catalog

You can use the TypeCaster Catalog application to obtain the SILID (and comment!) for insertion into simple text editor. Simply select the glyph(s) you want in the catalog view and press Ctrl - C (or select Edit/Copy on the menu). Then change to the text editor and press Ctrl - V (or select Edit/Paste on menu) to paste the text in. The text pasted into your document will include the SILID and comment field. For example, if you had both SILID 2403 and SILID 6103 selected in the catalog and used this technique, here is what would be pasted into your text editor:

2403    /* hooktop B */

6103    /* grave accent over upper case */

Now add in the access code you want for each of these, converting to a composite if needed:

X1234/    2403    /* hooktop B */

     /    6103    /* grave accent over upper case */

Use TypeCaster Editor, then open CST with Notepad

Another useful technique is to use the TypeCaster CST Editor to create a CST with the characters you need, including any composites, etc. Now save the CST and open it with a text editor like Notepad and use cut and paste to copy the entries to your Unicode CST. Finally you change the access codes to be the Unicode values that you want.

Opening CSTs With Notepad

If you try to open a Unicode CST with the TypeCaster CST Editor, the Editor will display an error about not being able to process the CST file, and offer to open it with Notepad. While this is one way to get to a text editor, it isn’t the most direct. Here are some tricks that you may want to use.

From the main window of TypeCaster Compiler you can select a CST, then right-click on it, and you will get a context menu that includes Open With Notepad. This allows you to bypass the TypeCaster Editor step.

If you use Unicode CSTs a lot, you can configure TypeCaster to use a launch a program other than the TypeCaster Editor when you double-click a CST in the main TypeCaster window. From TypeCaster, select the Options/Environment menu, and then the CST Editor tab. Now select Other and enter (or Browse for) the pathname of the desired application (e.g., C:WindowsNotepad.exe).

Finally, if you like to open CSTs from Windows Explorer, the following registry hack will add an “Open with Notepad” entry to the context menu for CST files, allowing you to bypass the TypeCaster Editor by right clicking on a CST. This assumes you know what you are doing with the registry — please be cautious when editing your registry!

REGEDIT4
[HKEY_CLASSES_ROOTFedit.Documentshelledit]
"EditFlags"=hex:01,00,00,00
@="Open with &Notepad"
[HKEY_CLASSES_ROOTFedit.Documentshelleditcommand]
@="C:WindowsNotepad.exe %1"

A slightly safer way to similar functionality is to put a shortcut to Notepad in your WindowsSendTo folder. This allows you to right click on any file (including CST files), select SendTo/Notepad and have the file opened by Notepad.

Access Codes and Default Fill

In most cases you will probably want to specify your Unicode access codes in hexadecimal since this is the way the Unicode standard documents the character set. If, for example, you were putting the IPA extensions into your font, you might have lines in your CST such as:

/* U+0250 - U+02AF  IPA Extensions */

x0250   1101    /* turned a */

x0251   1002    /* cursive a */

x0252   1202    /* turned cursive a */

x0253   1403    /* hooktop b */

x0254   1104    /* open o */

You may still specify access codes as decimal integers or single character ANSI constants for those values that make sense (usually 32-255). However, the Compiler does something special with 128-159.

Unicode reserves the block from character codes 128 (U+0080) to 159 (U+009F) as control codes, and so there isn’t a need to be able to define these in a font. Therefore, as a convenience to CST authors, access codes in this range (and only in this range) are assumed to be Windows character codes (codepage 1252) and are mapped to the equivalent Unicode character.

One of the useful side effects of this is that if you have not included either an Encode Symbol or a Fill command in your Unicode CST, then any codepage 1252 character codes (except one) that you have not specified in your CST will automatically be included in your font with the correct glyph. The only exception is Windows Character 183 which is an anomaly in the Windows system. In Windows 3.1, character 183 maps to Unicode U+2219 (BULLET OPERATOR) in OpenType fonts. For compatibility this is retained in present Windows systems but if an application asks the OS to convert text from ANSI to Unicode, character 183 will be mapped to U+00B7 (MIDDLE DOT) instead. TypeCaster chooses this latter mapping.

The bottom line is this: If you want to build a font that has all the standard Windows characters (i.e., codepage 1252) plus some extra Unicode characters (e.g., additional codepage coverage), start with the following CST and add the Unicode access codes that you need:

Trans Unicode

Encode Normal

CodepageRange    /* See below for help in completing the   */

UnicodeRange    /* CodepageRange and UnicodeRange entries */

/* This is included to complete the Windows character set:

x2219 0002        /* 1252: B7 ;bullet operator */

Codepage and Unicode coverage

Now that you can build a font containing several hundred glyphs and covering several codepages, you must tell Windows what codepages are actually covered by this font. You must also identify what Unicode blocks are covered. By the way, “coverage” is not a well-defined term. Microsoft does specify what characters, or even what percentage of characters, you have to provide within a given codepage or Unicode block to qualify as covering that range or block. Use your own judgment — if you think you have enough to be useful, then tell Windows that you do cover that range or block.

Codepage and Unicode coverage is specified using two fields in the OpenType font. Each field is a bit vector wherein each bit represents a codepage or Unicode block; a value of 1 indicates the font covers that codepage or block, and a value of zero indicates the font does not. The codepage range vector is 32 bits long, and the Unicode range vector is 64 bits long. The definitions for these bits, that is, what bit indicates what codepage or Unicode block, are given in the Appendix.

In a CST you specify the codepage and Unicode range vectors by two special commands at the top of the CST. Each command requires a comma-separated list of 32-bit numbers, each specified as 8 hexadecimal digits (leading zeros may be omitted). The CodePageRange command can accept up to two numbers (totaling 64 bits), and the UnicodeRange command can accept up to four numbers (totaling 128 bits). As an example, you might have the following commands at the top of your CST:

trans unicode

encode normal

codepage 97

unicode 8000027F,10006079

(All CST commands may be abbreviated to as few as two letters, so co is the same as codepage is the same as CodePageRange.)

The interpretation of the hexadecimal arguments is as follows. Each number represents 32 bits of the coverage vector. The first number represents bits 0-31, the next bits 32-63, etc. Within each number, the least significant bit is the lowest numbered, while the most significant bit is the highest numbered. Each hex digit, of course, represents 4 bits. Pictorially, the Unicode range command in the above example would be interpreted as follows:

In this case, since only two of the 4 possible UnicodeRange parameters have been specified, the remaining two default to zero, so none of bits 64-127 are set.

Referring to the codepage Range Bits table in the Appendix and the above diagram, we see that the line

codepage 97

indicates that this font covers the codepages represented by bits 0, 1, 2, 4, and 7, which would be codepages 1252 (Latin 1), 1250 (Latin 2: Eastern Europe), 1251 (Cyrillic), 1254 (Turkish), and 1257 (Windows Baltic).

For an Excel workbook that can calculate the hexidecimal values for Unicode and OS/2 range values, see OS/2 table Range bit calculation workbook.

Testing your font

If you have built and installed what you think is a Unicode font that covers multiple codepages, there are a number of ways you can test to see if you got the codepage and Unicode ranges right. First off, open WordPad and select your font from the pull-down menu. Then drop down the Font Script list and see what character sets are listed — it should match your CodepageRange data. Compare with the available Times New Roman fonts.

A useful tool available free from Microsoft’s web site is called the Font Properties extension. It adds property pages (“tabs”) to the dialog you get when you right-click on a OpenType font file and select Properties from the menu. In particular, one of the added pages enumerates the codepages and Unicode blocks supported by the font. This is the most direct way to see if you got the UnicodeRange and CodePageRange right.

Appendix

Unicode Range bits

The following table is excerpted from the TrueType font specification, v1.1

Bit	Description
0	Basic Latin
1	Latin-1 Supplement
2	Latin Extended-A
3	Latin Extended-B
4	IPA Extensions
5	Spacing Modifier Letters
6	Combining Diacritical Marks
7	Basic Greek
8	Greek Symbols And Coptic
9	Cyrillic
10	Armenian
11	Basic Hebrew
12	Hebrew Extended (A and B blocks)
13	Basic Arabic
14	Arabic Extended
15	Devanagari
16	Bengali
17	Gurmukhi
18	Gujarati
19	Oriya
20	Tamil
21	Telugu
22	Kannada
23	Malayalam
24	Thai
25	Lao
26	Basic Georgian
27	Georgian Extended
28	Hangul Jamo
29	Latin Extended Additional
30	Greek Extended
31	General Punctuation
32	Superscripts And Subscripts
33	Currency Symbols
34	Combining Diacritical Marks For Symbols
35	Letterlike Symbols
36	Number Forms
37	Arrows
38	Mathematical Operators
39	Miscellaneous Technical
40	Control Pictures
41	Optical Character Recognition
42	Enclosed Alphanumerics
43	Box Drawing
44	Block Elements
45	Geometric Shapes
46	Miscellaneous Symbols
47	Dingbats
48	CJK Symbols And Punctuation
49	Hiragana
50	Katakana
51	Bopomofo
52	Hangul Compatibility Jamo
53	CJK Miscellaneous
54	Enclosed CJK Letters And Months
55	CJK Compatibility
56	Hangul
57	Reserved for Unicode SubRanges
58	Reserved for Unicode SubRanges
59	CJK Unified Ideographs
60	Private Use Area
61	CJK Compatibility Ideographs
62	Alphabetic Presentation Forms
63	Arabic Presentation Forms-A
64	Combining Half Marks
65	CJK Compatibility Forms
66	Small Form Variants
67	Arabic Presentation Forms-B
68	Halfwidth And Fullwidth Forms
69	Specials
70–127	Reserved for Unicode SubRanges

TrueType font specification, v1.1

Codepage range bits

The following table is excerpted from the TrueType font specification, v1.1

Bit	Code Page	Description
ANSI:
0	1252	Latin 1
1	1250	Latin 2: Eastern Europe
2	1251	Cyrillic
3	1253	Greek
4	1254	Turkish
5	1255	Hebrew
6	1256	Arabic
7	1257	Windows Baltic
8–15		Reserved for Alternate ANSI
ANSI and OEM:
16	874	Thai
17	932	JIS/Japan
18	936	Chinese: Simplified chars--PRC and Singapore
19	949	Korean Wansung
20	950	Chinese: Traditional chars--Taiwan and Hong Kong
21	1361	Korean Johab
22–28		Reserved for Alternate ANSI & OEM
29		Macintosh Character Set (US Roman)
30		OEM Character Set
31		Symbol Character Set
OEM:
32-47		Reserved for OEM
48	869	IBM Greek
49	866	MS-DOS Russian
50	865	MS-DOS Nordic
51	864	Arabic
52	863	MS-DOS Canadian French
53	862	Hebrew
54	861	MS-DOS Icelandic
55	860	MS-DOS Portuguese
56	857	IBM Turkish
57	855	IBM Cyrillic; primarily Russian
58	852	Latin 2
59	775	MS-DOS Baltic
60	737	Greek; former 437 G
61	708	Arabic; ASMO 708
62	850	WE/Latin 1
63	437	US

TrueType font specification, v1.1

Internet Resources

Character sets

Character sets and codepages

OpenType font file format

Font properties extension

© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.