Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SUPPORT | DONATE

You are here: Encoding > Unicode
Short URL: http://scripts.sil.org/ThaiLaoSeq

Sequence Checking in Thai & Lao

Martin Hosken, 2008-04-25

Introduction

With the Unicode character set being so large, it is natural for system and application implementors to want to provide some mechanism for indicating what are clearly illegal sequences of Unicode characters. For example, adding a Devanagari diacritic to a Thai base character makes no sense. Further, because modern smart fonts are designed with certain sequences in mind, it makes sense for applications to disallow other sequences as not being renderable. The result is that increasingly, system and application developers are programming what sequences of Unicode characters their systems will support. They do this with a particular language or set of languages in mind.

Unfortunately, it is often necessary when representing a minority language in a script to use novel character sequences to represent the sounds in that language. The language may be from a very different language family than the national language and so is not easily represented by the script in question. The result is that various sequences that implementors may consider illegal for the national language, are in fact legal and necessary for a minority language using a national script.

On this page, we list some of the more problematic sequences that have been encountered so that system and application developers may take them into account as they design.

Thai

In Northern Khmer we find the following:

  • U+0E21 U+0E39 U+0E4D U+0E22
  • U+0E22 U+0E338 U+0E4D U+0E1A `night'
  • U+0E01 U+0E34 U+0E3A U+0E08 `step away'
  • U+0E25 U+0E37 U+0E3A U+0E2D
  • U+0E2D U+0E36 U+0E3A U+0E01 (syllable pattern)

Precise code order is not clear in the above cases. The diacritics could occur in reverse order. See Mon-Khmer Studies 16-17:255-272 (1990)

In Bru we find the following:

  • U+0E41 U+0E15 U+0E47 U+0E48 U+0E07 `to spread`
  • U+0E40 U+0E08 U+0E3A U+0E48 U+0E2D `already`
  • U+0E40 U+0E1B U+0E23 U+0E3A U+0E34 U+0E48 U+0E2B U+0E4C `dirty`
  • U+0E42 U+0E08 U+0E4A U+0E48 `bunch of bamboo`
  • U+0E40 U+0E1B U+0E3A U+0E35 U+0E48 U+0E22 `to mix`

In addition, in So we find:

  • U+0E42 U+0E3A U+0E17 U+0E23 `So (the name of the language group)`

In Lwa we have:

  • U+0E2D U+0E32 U+0E37 `chase`
  • U+0E22 U+0E32 U+0E36 `mine`

Presentations

The following was presented at NECTEC, Bangkok, Thailand with respect to the impact minority language use of the Thai script has on the Thai script itself.

Presentation on Minority use of Thai Script, given to NECTEC, March 2006
Martin Hosken, 2006-03-03
Download "thailaoseq_mar06.pdf", Acrobat PDF document, 1MB [2171 downloads]

Lao

Very often the sequence rules for Thai are used for Lao script as well. Thus there are similar problems for languages using Lao script:

In Bru we find:

  • U+0EA3 U+0EBD U+0ECD U+0E87 `like, as`
  • U+0E9E U+0ECD U+0EC9 U+0EBD U+0E81 or U+0E9E U+0EC9 U+0EBD U+0ECD U+0E81 `smoke`

In Phunoi we find:

  • U+0E9A U+0EBD U+0EB9 `flower bud`
  • U+0E9B U+0EC9 U+0EBD U+0EB2 U+0E99 `run`

In Tai Dam there are cases of two stacking tone marks.

  • U+0E81 U+0EC9 U+0ECB U+0EB2 U+0E99 `to collapse`
  • U+0E81 U+0EB1 U+0EC8 U+0ECB U+0E87 `to hit, to pound`

Conclusion

It is noticeable that the approach taken for sequence constraint in Thai and Lao is one of disallowing everything except that which is clearly allowed. For the benefit of minority languages and their development it may be that taking an approach of first allowing anything that can be rendered and then adding perhaps some further constraints will provide longer term stability as more scripts are developed and the needs of minority languages cause software maintanence issues.



Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.

Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.

 Reply
"Robert Batzinger", Fri, Jun 20, 2008 01:48 (CDT)

I am happy to see someone bringing this problem to light

In my experience in working with minority languages of Asia, this use of characters in ways that violate national majority language rules is to be expected. As many of the writing systems are phonetic in nature, the system of phonetics being recorded in transcriptions of minority languages do not fit comfortably with the set of phonetics used for the national language. Even if no new diacritical marks are added you will find that character interpretation will change. For example, in the use of Roman script rendering of the Mien language in Chiang Rai, you will find the symbols x,c, and q to be used in syllabyl-final position to designate tone markers. The Cyrillic rendering of Mongolian uses some characters code points to represent sounds that do not exist in Russian.

Restricting the set of input sequences to only acceptable character combinations to reduce typos is not useful if the goal is to allow all languages to share a common encoding. Checking of appropriate and correct sequences needs to be language specific feature and not a feature of the character set within Unicode. However, this highlights a weakness in the way Unicode has been implemented in most systems.

Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.



© 2003-2014 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.