NRSI: Computers & Writing Systems
Challenges in publishing with non-Roman scripts
There are a number of challenges in the typesetting of non-Roman scripts. These include problems of interaction between the font and typesetting system, problems of very large character sets, and considerations of typesetting the particular script. Happily, the advent of new computer technology has considerably lessened some of these problems.
Two parts of the needed technology, which are now available, are Unicode and smart fonts. If the application uses Unicode as its underlying encoding then document exchange is easier and the difficulties of very large character sets are lessened. With the advent of smart fonts, much of the behavior of non-Roman scripts can reside in the font rather than the typesetting system. This, of course, means that whatever typesetting application is used must have the ability to use such fonts.
It is very important to study the typesetting, printing, and typographic traditions and history in the region where each particular script is in use. Without this, the typesetter may impose his or her (most likely Roman) understanding of ‘good’ typesetting. If a published set of guidelines cannot be found, studying a variety of published books in that script and talking to publishers to see what the rules of good design are can be invaluable.
Although there may be a typesetting or publishing application already in existence for a particular script, it is unlikely to take into account the differences minority languages will have from the majority language that uses the same script.
This section will address technical and design challenges which can often be found in publishing of texts using non-Roman scripts. It is by no means exhaustive, but is intended as a starting point for those who are interested in the area of publishing.1
1. Technical Challenges
Computers were designed to use left-to-right (LTR) scripts. Examples of LTR scripts include Roman, Cyrillic, Ethiopic, and Indic scripts. There are also absolute right-to-left (RTL) scripts (some instances of Hebrew), mixed RTL (Assyrian and Arabic), top-to-bottom-RTL (Chinese and Japanese are examples of these), and top-to-bottom-LTR (Mongolian). These are the most common types in use today. There are, of course, other possibilities, such as boustrophedon2. When typesetting historic texts, being able to handle this system may also be necessary, but this document will only address issues relating to the most common types.
1.2. Page Design
1.2.1. Overall design
The setup of a page will especially be affected in RTL and top-to-bottom books. Even page numbers will now be on the right side of the page and odd numbers on the left side of the page, as seen in Figure 1 (page numbers are on lower outside margin) and Figure 2. Thus, even/odd pages will have margins which are opposite from a LTR book.
Figure 1. Vertical headers in a top-to-bottom-RTL script (Chinese)
1.2.2. Headers and Footers
Another issue that needs to be examined is what headers, footers and footnotes look like with vertical text. Do they run across the top and bottom of the page or do they run down the far left or far right of the page? The publishing application must be able to handle these unusual header and footer types.
Figure 2. Horizontal headers in a top-to-bottom-RTL script (Japanese)
A vertical text with vertical headers is seen in Figure 1 while the body of Figure 2 is vertical with vertical footnotes but has LTR horizontal headers and footers. Examples of Chinese and Korean with RTL horizontal headers have also been seen.
vertical headers in a top-to-bottom-LTR script (Mongolian)
Figure 3 has a vertical body text with both a horizontal header for the page number and a vertical header on the outside pages (which contain first and last dictionary entries). Figure 4 has a vertical body text with a header across the top of the page while the words in the header are still running vertically LTR. What a rich variety of possibilities!
Figure 4. Columns in a top-to-bottom-LTR script (Xibo)
Columns, of course, must be handled differently when a vertical script is in use. Figure 4 shows a dictionary which is set in three columns. The text reads LTR, then flows to the next column and again begins at the left and flows right. Obviously, column behavior for RTL vertical scripts would be opposite to this.
The ability to typeset diglots is also very important. This could take the form of using the same script (Figure 5) or two differing scripts (Figure 6).
Figure 5. Diglot using the same script (Devanagari) in two languages (Gurung/Nepali)
Figure 6. Diglot using differing scripts (Lanna/Thai)
1.3. Paragraph Formatting
1.3.1. Bullets and Indents
RTL behavior needs to be properly implemented with regards to bullets (see Figure 7) and paragraph indents (see Figure 9). Vertical scripts also need proper bullets (see Figure 8) and paragraph indenting. As with any formatting decisions, the typesetter needs to examine a great variety of examples to see what is considered normal and beautiful. The typesetter who comes from a Roman perspective to typeset Arabic would probably choose to use smaller bullets than those used in Figure 7, but this might be considered beautiful to readers of Arabic.
Figure 7. Right aligned bullets in RTL text (Arabic)
Figure 8. Top aligned bullets in vertical text (Mongolian)
Figure 9. Right indented paragraph in RTL text (Arabic)
1.3.2. “Hanging” verse numbers
While it is fairly unusual to see “hanging” verse numbers in Roman scriptures today, they are commonly used in non-Roman scriptures. Hanging verse numbers3 are never easy to implement, whether in Roman or in non-Roman typesetting. The Lao New Testament in Figure 10 uses hanging chapter and verse numbers (which appear down the far left of the page). Another instance of hanging verse numbers is seen in Figure 11 (the small digits at the top of the page).
Figure 10. Hanging verse numbers in LTR text (Lao)
Figure 11. Hanging verse numbers in top-to-bottom RTL text (Chinese)
1.3.3. Mixed direction
Mixed direction text can be especially interesting. LTR behavior of numbers in Arabic text is seen in Figure 12. Figure 13 shows justification problems the application had when RTL text was at the beginning of a LTR line (but at the end of the RTL text --note the undesirable extra white space at the left edge of lines 4 and 6 and at the right edge of lines 3 and 5). Figure 14 illustrates how Roman text is set vertically with Mongolian. It appears to be standard in Chinese, Japanese and Mongolian to rotate the Roman text 90 degrees rather than stack letters vertically. There are texts which are stacked rather than rotated but these are generally only with very short runs of Roman text (3-4 glyphs).
Figure 12. LTR numbers in RTL text (Arabic)
Figure 13. LTR text (Cyrillic) with RTL words (Arabic)
Figure 14. top-to-bottom-LTR script (Mongolian) containing text normally written horizontally (Roman)
When set horizontally, Chinese is typically LTR as seen in Figure 15, but Figure 16 has Chinese set RTL as a result of the RTL behavior of Qazaq, written with Arabic script. A Chinese text which is set LTR with IPA and Mongolian “in line” is seen in Figure 17.
Figure 15. LTR (Chinese/Roman) and RTL (Arabic)
Figure 16. Text normally set LTR (Chinese) is set RTL because of RTL paragraph (Arabic)
Figure 17. Mongolian, Chinese, and IPA all written LTR
1.3.4. Paragraph breaks
Some scripts do not break paragraphs as in Roman scripts. For example, ancient Ethiopic uses a paragraph separater mark ( ) rather than beginning the next paragraph on a new line.
Baselines can also vary: whether a baseline is sloping (Urdu ), hanging (Devanagari , Tibetan ), or centered (Chinese ) will affect line spacing (otherwise known as leading) as well as the space above and below paragraphs. When the baseline is sloping, the height of a word grows as the length of the word increases. This is especially challenging in setting leading.
The type of baseline will also affect underlining of text: if the baseline is hanging or sloping what is the best way to underline that text? One would need to study conventions used with each script.
In the Tibetan example in Figure 18, there appears to be a wavy underline almost as far down as the top of the next line. Without knowing the rules of the writing system one cannot know the purpose of it. A cursory look at The Unicode Standard Version 3.0 (The Unicode Consortium 2000) shows that the small circles under several of the glyphs actually represent something similar to underlining in Roman typography, thus the wavy underline most likely represents some other form of emphasis. It is important to check this out rather than making assumptions based on design guidelines one is most familiar with. Figure 19 shows another example of underlining. One can see in both the Lanna (left) and Thai (right) titles the underline does not cross the descenders. Although technically more difficult, this is more aesthetically appealing than if the underline crossed the descenders or was set below them.
Figure 18. "Underlining" with long descenders (Tibetan)
Figure 19. Underlining with long descenders (Lanna/Thai)
1.5. Line Breaking
In typesetting with Roman script, text is justified on a line by first seeing how much fits on a line, then checking to see if there is a word break there (for example, a space), next checking to see if a word can be broken (hyphenated) at that point, then adding space between words and finally (although strongly deprecated!) between letters to fill out the line.
1.5.1. Word breaks
Line breaking becomes more difficult if scripts do not have word breaks, as in Tibetan and Thai, or if the word break is represented by a character rather than a space. Ethiopic is an example of using a character rather than a space: the Hulet Neteb ( ) is used between words in place of white space, unless there is other punctuation which acts as a word break. (This is also an excellent illustration of how computers have changed a script. With the advent of computing, Ethiopic is now usually written on computer with white space between words. When handwriting, however, people typically still use the Hulet Neteb.)
1.5.2. Justification and Hyphenation
As mentioned, Tibetan does not have visible word breaks and so the only variable white space is after the shad ( ). In Figure 20 one can see that lines four, five and six are “completed” (or justified) with a series of syllable markers ( ) while the other lines are justified via the variable space after each shad.
Figure 20. Syllable markers help "justify" the line (Tibetan)
Hyphenation is another tool used in Roman typesetting but is not always allowed in other writing systems. Since Arabic does not allow hyphenation, another method for justification is needed: kashidas are used to stretch the line. The kashida (or tatweel) is typeface-dependent and varies for each kind of Arabic script. Thus, the rules for inserting the kashida will be in the font, and the publishing application would need to have the capability of using the font (Milo 2001).
Figure 21. Use of kashida
Another interesting case of different hyphenation methods is Ethiopic. Because Ethiopic was traditionally written using the Hulet Neteb there was no need for a hyphenation character. If a line broke at the end of a word you would see the Hulet Neteb or other punctuation but, if a line broke in the middle of the word, the reader just saw the word break with no hyphen. In Figure 22, line two is “hyphenated,” line one is not.
The difficulty now arises that word breaking still sometimes occurs this way but, with software limitations encouraging the use of white space between words (rather than the Hulet Neteb), it is difficult to know if there is a complete word at the end of a line or if it is a “broken” word (compare Figure 23 with Figure 22).
Figure 22. "Hyphenation" in line two with use of Hulet Neteb (Ethiopic)
Figure 23. "Hyphenation" in line two with use of white space (Ethiopic)
Interesting hyphenation behaviors also occur in some Roman script use, as in the German word “backen” which becomes “bak-ken” when hyphenated (although there are moves for this to be abandoned). Hyphenation rules in the publishing application need to be fully customizable for these types of behaviors.
1.6. Ordered Lists
Most of the behaviors mentioned in this section will be based on the national standard for a particular script. This type of behavior won’t reside in the font though, and the typesetting application will need to be customized to follow national standards.
Footnote numbering must be addressed. Most non-Roman scripts will only use numbers or symbols rather than alphabetic characters, as are often used in Roman typesetting. If symbols are used, decisions on order (dagger, double-dagger, asterisk?) need to be made and implemented. Figure 24 shows the use of Thai digits in footnotes. At small point sizes it becomes difficult to distinguish between digits.
Figure 24. Use of Thai digits in footnotes
In an ordered list, whether an outline, footnotes or a glossary with alphabetic headers, the sort order must be addressed. Punctuation around outline points must also follow national standards.
Sample of outline in Roman Text
Sample of outline in Thai text
A look through various Thai publications reveals that, as in English, there are a variety of ways to number outlines. The Thai example above uses both “alphabetic” characters as well as digits. Once again, the ability to customize an application’s rendering of ordered lists is important.
2. Design Issues
Although possibly not technically challenging for software developers, there are design differences in non-Roman scripts of which the typesetter must be aware. Many design issues are related to readability. The author and graphic designer, of course, want to have their manuscripts read. Emphasis, cultural design issues, optimum line length for a particular typeface and point size, leading, kerning and word spacing are all important in contributing to readability and will be different for each writing system.
2.1. Page Design
2.1.1. Overall design
There are some simple differences in the setup of a page. For instance, most of the world outside of the United States uses A4 paper. Another is that many non-Roman books use a line under the running header to set it off from the text. With sacred writings, especially in RTL scripts, one often needs to add decorative borders around each page (see Figure 25).
Figure 25. Decorative border around sacred text (Brahui)
2.1.2. Multilingual texts
When mixing scripts on a page it is very important to ensure that the body size and feel of the fonts are balanced. If one font is significantly heavier than the other, the page will look unbalanced.
Placement of illustrations will need to be different. Studies show that in reading Roman, the eye naturally scans from the top left to bottom right, and graphic designers keep the normal eye flow in mind when they are designing pages. Further study is needed to know whether there are similar guidelines for placing illustrations with RTL and vertical scripts.
2.2. Paragraph Formatting
2.2.1. Point size and Leading
There are differences in optimum point sizes between scripts. In general the point size needs to be larger for non-Roman scripts to successfully reproduce complex characters or because of longer ascenders and descenders. Clearly this would affect leading as well and, in many computer applications, one would not want to allow the default leading to prevail. Extra leading is allowed for long descenders in the Tibetan text of Figure 26, while enough leading was probably not allowed for in Figure 27 as ascenders and descenders run together in places. Undoubtedly this makes reading the book more difficult.
Figure 26. Leading which is sufficient for long descenders (Tibetan)
Figure 27. Leading which is crowded with long ascenders and descenders (Lanna)
Figure 28 also shows how leading can be affected when one script is used within another: the line spacing in this case is increased when Turkic words are used but remains the same for the rest of the paragraph. If leading had been set to an exact amount, rather than letting the computer application decide, this odd behavior would not have occurred.
Figure 28. Uneven leading with mixed scripts (Cyrillic and Arabic)
The result of having larger point sizes and greater amounts of leading means that the size of the finished document can increase dramatically. For example, an edition of the Thai Bible only has 24 lines of text per page, and weighs in at 2.2 kg!
2.2.2. Display type
It is common for non-Roman scripts to use a large variety of fonts for titles. Stylistic variations, such as serifed fonts, all caps, bold and especially italic do not always lend themselves to non-Roman scripts. Other methods of highlighting information, creating contrasts and emphasis are needed. The method used will vary greatly depending on the script. Figure 29 and Figure 30 show examples with use of various title fonts.
Figure 29. Use of title fonts in a vertical text (Chinese)
Figure 30. Use of title fonts in a RTL text (Arabic)
One would also need to study whether certain typefaces are only used with specific types of literature (for example, used only when talking about certain events), or whether they can be used anywhere. It would be important to know whether certain fonts (such as more ornate fonts) would be used in sacred writings, in newspapers, novels, etc.
At this point the reader will not be surprised to hear that punctuation behaviors vary between scripts as well.
2.3.1. Sentence ending
Although the reader’s eye may not register it, good Roman typography adds a small amount of extra space between a full stop and the next word than between other words. A simple difference in Cyrillic is that full stops don’t have any more extra space after them than between words (Kolodin et al, 2000), thus it would be important to be able to disable any additional spacing.
Tibetan ( ), Devanagari ( ), and Ethiopic ( ) all have different symbols for sentence ending, each with differing amount of space before and after the marker. Some Ethiopic languages use a different question mark ( ) than the more standard Roman style.
Some Roman languages (such as Spanish) use an opening and closing mark for questions and exclamations. If a non-Roman script did this, that would need to be implemented as well.
2.3.2. Abbreviated text
Punctuation around abbreviated text can be different. There may be space between abbreviations or there may not (i.e. A.D. Jones or A. D. Jones), and the punctuation will quite likely not be a full stop (Ethiopic: instead of ) (Yakob, 1999).
There are undoubtedly many other punctuation issues not covered here. Most of these differences in punctuation are fairly straightforward, may already be addressed in smart fonts, and will not take a lot to implement.
2.4. Other Typesetting Concerns
The ability to kern between characters is very important in some scripts. It is more common in non-Roman scripts for glyphs to overlap glyphs before and/or after them. Again, an Ethiopic example:
If there is no kerning, the reader could easily perceive a word-space between some of the glyphs when in fact no space is intended.
Rearrangement (as in Indic and South Asian scripts), substitution, contextual forms, ligatures and diacritic placement can all be handled in smart fonts and thus are not covered in this paper. These directly affect typesetting insofar as they affect justification, hyphenation and leading.
The “Design Issues” mentioned are really just examples of a potentially unending list of things that the typesetter needs to be conscious of. Rather than simply following instincts that derive from another script (typically Roman) and culture (typically Western), one needs to learn culturally appropriate ways to handle these design issues. Mention has been made that with the advent of computers some changes have occurred within scripts, probably because of the difficulties of implementing a particular script behavior. While the typesetter may choose to diverge from traditional typesetting standards for a particular script, it should be done carefully and with good reason.
Many of the design issues are finer-grained than just script; different languages or countries using the same script may have significantly different typesetting conventions. This is true even in Roman script. Some issues, such as spacing of punctuation, may best be handled by fonts that are customized for a specific language or region, while others may mean that a book design needs to be re-thought when crossing boundaries (of whatever kind).
The “Technical Challenges” are then the specific things that are likely to present challenges when it comes to actually implementing a design. Most publishing programs are geared toward Roman typesetting, or at the very least toward horizontal typesetting. Mixing of scripts and directionality complicates the issue. There are currently, of course, publishing programs geared specifically to one type of need, i.e. vertical or RTL, but few, if any4, which are able to handle the great variety of needs for all scripts. However, there is a great need for a publishing system which can handle the various page layout possibilities, line breaking algorithms, mixed direction text, Unicode encoded smart fonts and which can be modified to fit local typographic standards.
Publishing can be a lot of fun, and with non-Roman scripts, never boring. Enjoy the challenges!
1. Yü, Ta-fu. Chung-kuo hsin wen i ta his, Vol. 2, pp. 534-535. Taipei, Ta han chu pan she, Tsung ching hsiao yuan liu chu pan she, min kuo 65- [1976- ].
2. Ebisawa, Arimichi. 1981. Nihon no Seisho (The Japanese Bible: A History), pp. 260-261. Tokyo: Nihon Kirisuto Kyodan Shuppankyoku.
3. Hasbaatar Saranbat. 1985. Baigal gazar zuin toli (Nature dictionary), pp. 2-3. Inner Mongolia: Inner Mongolian People’s Publication Committee.
4. Tong, Qing-funei (Mr.). Chief editor. 1994. fon koolingga sibe shu tacin gisun i buleku bithe (Modern Standardized Xibo Written Language Dictionary). Edited by “Xinjiang Uygur Autonomous Region Working Committee of National Languages and Writings”, p. 400. Urumqi, People’s Republic of China: Xinjiang People’s Publishing House.
5. 1982. Gurung-Nepali (common language version) New Testament, p. 148. New Delhi, India: World Home Bible League. Printed at Ambassador Press.
6. Phayomyong, Manee. 2533 (Buddhist calendar). Learning to Read Lanna Thai (translation), p. 1. Chiang Mai, Thailand: Chiang Mai University. Printed by Sap Karn Pim.
7. 15 June, 1999. Al Hayat Newspaper. Issue No 13247 p. 2.
8. G. Buyanbat. 1985. Mongoliin ertnii bilig surgal, p. 91. Inner Mongolia: Inner Mongolian People’s Publication Committee.
9. 15 June, 1999. Al Hayat Newspaper. 13247:23.
10. 1981. Lao New Testament, p. 726. KBS.
11. 1982. Bible in Chinese Union Version, “Shangti” Edition 2546, p. 619. Hong Kong: The Bible Society in Hong Kong.
12. 15 June, 1999. Al Hayat Newspaper. Issue No 13247 p. 2.
13. Stebleva, I.V. 1971. The Development of Turkic Poetic Forms in the Eleventh Century, p. 37. Moscow: Nauka. Academy of Sciences of the USSR. Institute of Oriental Studies.
14. 1999. Mongol zuv bichgiin toli (dictionary), p. 1595. Inner Mongolian Newspaper Publication Agency, Inner Mongolian People’s Publication Committee, Inner Mongolian national Printing house.
15. Qawuz, Qadir. 1990. Hanzuc¡ä, Ingliz c¡ä Uyghur c¡ä turaqluq ibarilar lughati. (Chinese-English-Uighur Dictionary of Idioms), p. 504. Urumqi: Xinjiang Minzu Chubanshe.
16. Nurbek. Chief editor. 1989. Ha-Han Cidian: Qazaq Xansushu Sözdik (Chinese-Qazak Dictionary), p. 4. Beijing: Minzu Chubanshe.
17. Surizhu. 1985. Mengguyu wenji (Anthology of Mongolian languages), p. 124. Xining: Qinghai People’s Publishing House.
18. China Tibetan Language Department Higher Buddhist Studies Institute (ed.) 1987. Work said to be the key to open the door of snowland wisdom. Book 1, p. 24. Beijing: Nationalities Publishing House.
19. Phayomyong, Manee. 2533 (Buddhist calendar). Learning to Read Lanna Thai (translation), p. 1. Chiang Mai, Thailand: Chiang Mai University. Printed by Sap Karn Pim.
20. Minzu Tushuguan bian (Library of Nationalities). 1984. “Zangwen dianji mulu: wenjilei zimu” (Bibliography of Ancient Tibetan books and documents). Vol. 1, p. 1. Chengdu: Sichuan Minority Press.
24. Pankhueankhat, Ruengdet. 2530 (Buddhist calendar). Aksorn Thai, p. 9. Salaya, Thailand: Mahidol University. Institute for the Study of Language and Culture for Rural Development.
25. 1998. The New Testament in Brahui, First Edition, p. 10. Lahore: Pakistan Bible Society.
26. China Tibetan Language Department Higher Buddhist Studies Institute (ed.) 1987. Work said to be the key to open the door of snowland wisdom. Book 1, p. 24. Beijing: Nationalities Publishing House.
27. Phayomyong, Manee. 2533 (Buddhist calendar). Learning to Read Lanna Thai (translation), p. 69. Chiang Mai, Thailand: Chiang Mai University. Printed by Sap Karn Pim.
28. Stebleva, I.V. 1971. The Development of Turkic Poetic Forms in the Eleventh Century, p. 37. Moscow: Nauka. Academy of Sciences of the USSR. Institute of Oriental Studies.
29. 7 May 1993. Dallas Chinese Times. D:8.
30. 15 June, 1999. Al Hayat Newspaper. 13247:23.
Haralambous, Yannis, John Plaice. November 22, 2000. p. 2. The Omega Typesetting and Document Processing System. ms. http://omega.cse.unsw.edu.au:8080/papers/bite.ps.
Kolodin, M.Y., O.V. Eterevksy, O.G. Lapko, I.A. Maknovaya. 2000. “‘Russian style’ with LaTeX and Babel: what does it look like and how does it work,” TUG 2000 Preprints.
MacKay, Pierre A. 1986. “Typesetting Problem Scripts,” Byte. pp. 201-216.
Milo, Thomas. 2001. “Creating Solutions for Arabic: A Case Study,” Eighteenth International Unicode Conference, Pre-Conference Tutorials. TC5:6.
Poirrier, Jerome. 1995. “Publishing for Middle Eastern Languages: Full-Featured DTP for Middle East Users,” MultiLingual Computing & Technology #8, Vol. 6, Issue 1. pp. 18,19.
Turley, James. 1998. “Computing in Tibetan,” MultiLingual Computing & Technology #22, Vol. 9, Issue 6, pp. 30-33.
The Unicode Consortium. 2000. The Unicode Standard, Version 3.0, p. 438. Reading, MA: Addison-Wesley Developers Press.
Yakob, Daniel. 1999. Notes on Ethiopic Localization. http://www.abyssiniacybergateway.net/fidel/l10n/.
----. 2000. “Computing in Amharic,” MultiLingual Computing & Technology #34, Vol. 11, Issue 6, pp. 29-32.
Copyright © 2001-2002 SIL International