NRSI: Computers & Writing Systems
Comments or suggestions?
Please use the comment mechanism at the end of this page. You can comment on existing definitions, or suggest additions along with a draft definition!
abjad — a form of writing in which the vowels are omitted or optional, such as Hebrew and Arabic scripts.
abstract character — a unit of information used for the organization, control or representation of textual data. Abstract characters may be non-graphic characters used in textual information systems to control the organization of textual data (e.g. U+FFF9 INTERLINEAR ANNOTATION ANCHOR), or to control the presentation of textual data (e.g. U+200D ZERO WIDTH JOINER).
abstract character repertoire — a collection of abstract characters compiled for the purposes of encoding. See also charset.
abugida — a form of writing in which the consonants and vowels in a syllable are treated as a cluster or unit; typical of scripts from South Asia.
advance height — the amount by which the current display position is adjusted vertically after rendering a given glyph. This number is generally only meaningful for vertical writing systems, and is usually zero within fonts used for horizontal writing systems.
advance width — the amount by which the current display position is adjusted horizontally after rendering a given glyph.
affrication — the phonological process by which a simple stop, such as [t], is converted to an affricate, such as [tʃ]. For example, in some dialects of British English the word "tuna" is pronounced [tʃu:na], the first consonant having been affricated.
allophone — a variant of a phoneme. It is not distinctive, that is, substituting one allophone for another of the same phoneme will not change the meaning of the word, although it will sound unnatural. Broadly speaking, the test to determine whether two sounds are allophones of the same phoneme, or separate phonemes, is to see whether they are in complementary distribution, that is, when two phonological elements are found only in two complementary environments. For example, in English /ph/ only occurs syllable-initially when followed by a stressed vowel, but /p/ occurs in all other environments. This is illustrated by the words pin /phin/ and spin /spin/. Therefore, /ph/ and /p/ are seen to be in complementary distribution, and therefore allophones of the phoneme [p]. This test is not foolproof; some sounds are in complementary distribution but are not considered to be allophones. For example, in English /h/ only occurs syllable-initially and /ŋ/ only occurs syllable-finally. However they are phonetically so different that they are still considered to be separate phonemes. One allophone can be assigned to more than one phoneme, as illustrated in some North American English dialects, where the phonemes /t/ and /d/ can both be changed into the allophone [ɾ].
alphabet — a segmental writing system having symbols for individual sounds, rather than for syllables or morphemes. In a true alphabet, consonants and vowels are written as independent letters, in contrast to an abugida or an abjad. In a perfectly phonemic alphabet, phonemes and letters would be predictable in both directions; that is, the sound of a word could be predicted from its spelling and vice-versa. A phonetic alphabet is also predictable in this way, however it uses separate letters for separate allophones, whereas a phonemic alphabet may describe allophones of the same phoneme using a single letter.
anchor point — see attachment point.
ASCII — a standard that defines the 7-bit numbers (codepoints) needed for most of the U.S. English writing system. The initials stand for American Standard Code for Information Interchange. Also specified as ISO 646-IRV.
attachment point — a point defined relative to a glyph outline such that if two attachment points on two glyphs are positioned on top of each other, the glyphs are positioned correctly relative to each other. For example, a base character may have an attachment point used to position a diacritic, which would also have an attachment point. Also called anchor point.
baseline — the vertical point of origin for all the glyphs rendered on a single line. Roman scripts have a baseline on which the glyphs appear to “sit,” with occasional descenders below. Many Indic scripts have a “hanging” baseline, in which the bulk of the letters are placed below the baseline, with occasional ascenders above the line. Some scripts, such as Chinese, use a centered baseline, where the glyphs are all positioned with their centers on the baseline.
Basic Multilingual Plane (BMP) — the portion of Unicode’s codespace in which all of the most commonly used characters are encoded, corresponding to codepoints U+0000 to U+FFFF, abbreviated as BMP. Also known as Plane 0. See also Supplementary Planes.
bicameral — describes a script with two sets of symbols that correspond to each phoneme, most often upper- and lower-case. See also unicameral. Examples of bicameral scripts include Roman (or Latin), Greek, and Cyrillic.
bidirectionality — the characteristic of some writing systems to contain ranges of text that are written left-to-right as well as ranges that are written right-to-left. Specifically, in Arabic and Hebrew scripts, most text is written right-to-left, but numbers are written left-to-right. This can also be used to refer to text containing runs in multiple writing systems, some RTL and some LTR.
BMP — see Basic Multilingual Plane.
BOM — see byte order mark.
boustrophedon — a way of writing in which successive lines of text alternate between left-to-right and right-to-left directionality.
byte order mark (BOM) — the Unicode character U+FEFF ZERO WIDTH NO-BREAK SPACE when used as the first character in a UTF-16 or UTF-32 plain text file to indicate the byte serialization order, i.e. whether the least significant byte comes first (little-endian) or the most significant byte comes first (big-endian). Byte order is not an issue for UTF-8, though the byte order mark is sometimes added to the beginning of UTF-8 encoded files as an encoding signature that applications can look for to detect that the file is encoded in UTF-8. See http://www.unicode.org/unicode/faq/utf_bom.html.
cascading style sheets (CSS) — one of two stylesheet languages used in Web-based protocols (the other is XSL). CSS is mainly used for rendering HTML, but can also be used for rendering XML. It is much less complex than XSL, i.e., it can only be used when the structure of the source document is already very close to what is desired in the final form.
character — (1) a symbol used in writing, distinguished from others by its meaning, not its specific shape; similar to grapheme. It relates to the domain of orthographies and writing. See orthographic character.
character encoding form — a system for representing the codepoints associated with a particular coded character set in terms of code values of a particular datatype or size. For many situations, this is a trivial mapping: codepoints are represented by bytes with the same integer value as the codepoint. Some encoding forms may represent codepoints in terms of 16- or 32-bit values, though, and some 8-bit encoding forms may be able to represent a codespace that has more than 256 codepoints by using multiple-byte sequences. Most encoding forms are designed specifically for use in connection with a particular coded character set; e.g. UTF-8 is used specifically for encoded representation of the Universal Character Set defined by Unicode and ISO/IEC 10646. Some encoding forms may be designed for use with multiple repertoires, however. For example, the ISO 2022 encoding form supports an open collection of coded character sets and specifies changes between character sets in a data stream using escape sequences.
character encoding scheme — a character encoding form with a specific byte order serialization (relevant mainly for 16- or 32-bit encoding forms).
character set encoding — a system for encoded representation of textual data that specifies the following: (1) a coded character set, (2) one or more character encoding forms and (3) one or more character encoding schemes.
charset — an identifier used to specify a set of characters. Used particularly in Microsoft Windows and TrueType fonts, and in HTML and other Internet or Web protocols to refer to identifiers for particular subsets of the Universal Character Set.
cmap — character-glyph map: the table within a font containing a mapping of codepoints (characters) to glyph ID numbers. In an Unicode-based font the codepoints are Unicode values; in other fonts they correspond to other encodings.
codepage — (1) synonym for coded character set.
(2) synonym for character set encoding; i.e. In some contexts, codepage is used to refer to a specification of a character repertoire and an encoding form for representing that repertoire.
(3) In some systems, a mapping between encoded characters in Unicode and a non-Unicode encoding form; e.g. Microsoft Windows codepage 1252.
codepoint — a numeric value used as an encoded representation of some abstract character within a computer or information system. Codepoints are integer values used to represent particular characters within a particular encoding.
colometry — in writing, the distribution of text into sense lines, so that a new clause starts on new line.
complex script — a script characterized by one or more of the following: a very large set of characters, right-to-left or vertical rendering, bidirectionality, contextual glyph selection (shaping), use of ligatures, complex glyph positioning, glyph reordering, and splitting characters into multiple glyphs.
conjunct — a ligature, in particular, a ligature representing a consonant cluster in an Indic script.
CSS — see cascading style sheets.
dead key — a key in a particular keyboard layout that does not generate a character, but rather changes the character generated by a following keystroke. Dead keys are commonly used to enter accented forms of letters in writing systems based on Roman script.
deep encoding — see semantic encoding.
defective — with regard to writing systems, a writing system which does not represent all the distinctive sounds of the language it represents.
determinative — in semantics, a class of words that indicates, specifies or
limits a noun, such as the definite or indefinite article, the genitive (possessive) marker, or cardinal
diacritic — a written symbol which is structurally dependent upon another symbol; that is, a symbol that does not occur independently, but always occurs with and is visually positioned in relation to another character, usually above or below. Diacritics are also sometimes referred to as accents. For example, acute, grave, circumflex, etc.
digraph — a multigraph composed of two components.
diphthong — in phonetics, a complex speech sound occupying one syllable, which begins with one vowel and ends with another. For example [eɪ̯] in British (RP) pronunciation of the word lane. See also monophthong.
display encoding — See presentation-form encoding.
distinctive — also contrastive. An element which makes a distinction between units. In phonology, a process or a pair of sounds, the alternation of which changes the meaning of a word. See also phoneme, minimal pair. For example, voicing is distinctive in most non-tonal languages, as illustrated by the difference between English fan and van, or German Kern and gern.
document — a collection of information. This includes the common sense of the word, i.e. an organisation of primarily textual information that can be produced by a word processing or data processing application. It goes beyond this, however, to include structured information held within an XML file. Each XML file is considered to contain one document, whatever the structure and type of that information.
Document Type Definition (DTD) — a markup declaration used by SGML and XML that contains the formal specifications, or grammar, of an SGML or XML document. One use of the DTD is to run a validation process over an XML file, which indicates if it matches the DTD, or if not, provides a listing of each line at which the file fails some part of the required structure.
DTD — see Document Type Definition.
em square — the square grid which is the basis for the design of all glyphs within a given font; so called because it historically corresponded to the size of the letter M. When rendering, the requested point size specifies the size of the font’s em square to which all glyphs are scaled.
encoded character — an abstract character in some repertoire together with a codepoint to which it is assigned within a coded character set. Encoded characters do not necessarily correspond to graphemes.
encoding — (1) synonym for a character encoding form.
(2) synonym for a character set encoding. This usage is common, especially in cases in which distinctions between a coded character set and a character encoding form is not important (i.e. 8-bit, single-byte implementations). Someone might think of an encoding as simply a mapping between byte sequences and the abstract characters they represent, though this model is not adequate to describe some implementations, particularly CJKV standards, or Unicode and ISO/IEC 10646.
Extensible Markup Language (XML) — a standard for marking up data so as to clearly indicate its structure, generally in a way that indicates the meaning of different parts of it rather than how they will be displayed. See http://www.w3.org/XML/ for details.
Extensible Stylesheet Language (XSL) — a language for expressing stylesheets. It consists of two parts: XSL transformations (XSLT) and an XML vocabulary for specifying formatting semantics. See http://www.w3.org/Style/XSL for full details.
featural writing system — a writing system in which phonetic features, rather than phones (sounds), are represented. For example, there might be a symbol to represent the feature “bilabial” (a sound produced with both lips), a symbol to represent the feature “voiced”, and a symbol to represent the feature “stop”. These could be combined to represent the sound [b]. The closest functioning writing system to this is the Korean Hangul, in which many of the strokes making up the symbols represent place or manner of articulation. Some writing systems used for representing signed languages also contain symbols which stand for particular features of signs. In this case, the symbol often visually resembles the feature it represents, such as direction of movement.
GDL — See Graphite.
gemination — in phonetics, consonant lengthening, usually by about a time-and-a-half of the length of a “short” consonant. Geminated fricatives, trills, nasals and approximants are simply prolonged. In geminated stops, the “hold” is prolonged. In some languages, such as Japanese, Hungarian, Arabic, Italian and Finnish, gemination is distinctive, but in most it is not. In languages where it is distinctive, it is usually restricted to certain consonants. English contains very few words in which gemination affects the meaning; among these are unnamed vs. unaimed or, in some dialects sixths /sıks:/ vs. six} /sıks/ (source: John Lawler, University of Michigan). In some languages, consonant length and vowel length depend on each other. For example in Swedish and Italian a short vowel must be followed by a long consonant (geminate), whereas a long vowel must be followed by a short consonant.
glyph — a shape that is the visual representation of a character. It is a graphic object stored within a font. Glyphs are objects that are recognizably related to particular characters and which are dependent on particular design (i.e. g, g and g are all distinct glyphs). Glyphs may or may not correspond to characters in a one-to-one manner. For example, a single character may correspond to multiple glyphs that have complementary distributions based upon context (e.g. final and non-final sigma in Greek), or several characters may correspond to a single glyph known as a ligature (e.g. conjuncts in Devanagari script). (For more information on glyphs and their relationship to characters, see ISO/IEC TR 15285.)
grapheme — anything that functions as a distinct unit within an orthography. A grapheme may be a single character, a multigraph, or a diacritic, but in all cases graphemes are defined in relation to the particular orthography.
Graphite — a package developed by SIL to provide “smart rendering” for complex writing systems in an extensible way. It is programmable using a language called Graphite Description Language (GDL). Because it is extensible, it can be used to provide rendering for minority languages not supported by Uniscribe.
heteronym — homographs which, although spelled the same way, are pronounced differently and have different meanings. For example, in English “wind” (noun, as in weather) and “wind” (verb, to coil something).
homograph — one of multiple words having the same spelling but different meanings. They may be pronounced differently (for example in English “tear: rip” and “tear: secreted when crying”), in which case they are also heteronyms, or they may be pronounced the same (for example in American English “tire: cause to be fatigued” and “tire: wheel of a car”), in which case they are also homophones.
homophone — one of multiple words having the same pronunciation but different meanings. They may be spelled differently (for example in English “write” and “right”), in which case they are called heterographs, or the same (for example in English “bark: on a tree” and “bark: of a dog”), in which case they are also homographs.
ideograph — see logograph
IME — see input method editor.
input method — any mechanism used to enter textual data, such as keyboards, speech recognition or handwriting recognition. The most common form of input method is the keyboard. The term "input method" is intended to include all forms of keyboard handling, including but not limited to input methods that are available for Chinese and other very-large-character-set languages and that are commonly known as input method editors (IMEs). An IME is taken to be a specific type of the more general class of input methods.
input method editor (IME) — a special form of keyboard input method that makes use of additional windows for character editing or selection in order to facilitate keyboard entry of writing systems with very large character sets.
internationalization — a process for producing software that can easily be adapted for use in (almost) any cultural environment; i.e. a methodology for producing software that can be script-enabled and is localisable. Sometimes abbreviated as “I18N”.
kern — to adjust the display position whilst rendering in order to visually improve the spacing between two glyphs. For instance, kerning might be used on the word WAVE to reduce the illusion of white space between the diagonal strokes of the W, A, and V.
Keyman — an input method program which changes and rearranges incoming characters to allow easy ways of typing data in writing systems that would otherwise be difficult or inconvenient to type. See www.tavultesoft.com/keyman.
LANGID — in the Microsoft Win32 API, a 16-bit integer used to identify a language or locale. A LANGID is composed of a 10-bit primary language identifier together with a 6-bit sub-language identifier (the latter being used to indicate regional distinctions for locales that use the same language).
language ID — a constant value within some system used for metadata identification of the language in which information is expressed. May be numeric or character based, depending on the system.
Latin script — see Roman script.
left side-bearing — the white space at the left edge of a glyph’s visual representation, or more specifically, the distance between the current horizontal display position and the left edge of the glyph’s bounding box. A positive left side-bearing indicates white space between the glyph and the previous one; a negative left side-bearing indicates overlap or overhang between them.
locale — a collection of parameters that affect how information is expressed or presented within a particular group of users, generally distinguished from one another on the basis of language or location (usually country). Locale settings affect things such as number formats, calendrical systems and date and time formats, as well as language and writing system.
localisability — the extent to which the design and implementation of a software product allows potential for localisation of the software.
localisation — the process of adapting software for use by users of different languages or in different geographic regions. For purposes of this document, localisation has to do with the language and script of users, and is distinct from script enabling, which has to do with the script in which language data is written. The localisation process may include such modifications as translating user-interface text, translating help files and documentation, changing icons, modifying the visual design of dialog boxes, etc. Sometimes abbreviated “L10N”.
logograph — also called a logogram or ideograph. A written symbol representing a whole word. Technically, this is distinct from an ideogram, which represents a concept independently of words, although the two are often used interchangeably.
logographic writing system — also known as an ideographic writing system. A writing system in which each symbol represents a complete word or morpheme. The symbols do not indicate the word's pronunciation, only its meaning. Historically, Sumerian cuneform and Egyptian hieroglyphics were logographic, but today Chinese is the only known writing system in the world that remains logographic. See also logosyllabary.
logosyllabary — a writing system in which each sign is used primarily to represent words or morphemes, with some subsidiary usage to represent syllables. Most natural logosyllabaries employ the rebus principle to extend the character set so that syllables as well as morphemes can be represented. Logosyllabaries may also include determinatives to mark semantic categories which would otherwise be ambiguous. The extent to which syllabic sounds are represented varies from one writing system to another. In instances where a relatively large number of symbols represent syllabic sounds, a logosyllabary may evolve into an abugida or an abjad as the syllabic use overtakes the logographic use.
metathesis — a phonological change in which the order of segments, particularly successive sounds, in a word is reversed. For example, the English word 'ask' was pronounced [æks] between the 5th and 12th centuries, and some dialects have reverted back to this pronunciation in modern times.
mnemonic keyboard — a keyboard layout based on the characters appearing on the keytops of the keyboard. See also positional keyboard.
monophthong — a vowel sound which does not change in quality as it is articulated. (Contrast with diphthong.) It can be short, as in English bed [bɛd], or long, as in English bead [bi:d]. A single short monophthong is the shortest syllable in any language. The process by which monophthongs change to diphthongs or vice versa is an important factor in language change. Diphthongization in the 15th or 16th century changed the long German monophthong [iː] to [aɪ], as in Eis 'ice', and long [uː] to [aʊ] as in Haus 'house'. A characteristic of Southern American English is the monophthongization of certain dipthongs such as [aɪ] to long [a:] in words such as kite. (source: Wikipedia)
mora — a unit of rhythmic measurement based syllable weight, which is distinctive in some languages. Japanese is one of the most well-documented of these languages. Short (or light) syllables are monomoraic, consisting of one mora. Long (or heavy) syllables are bimoraic, consisting of two morae. Some languages contain superheavy syllables, for example Hindi, in which a long vowel can be followed by a geminate consonant. These syllables are said to be trimoraic. The first consonant of a syllable does not represent any morae, as it does not constitute a syllable in itself. Syllable-final consonants can either form the final part of a bi- or trimoraic syllable, as is the case in Goidelic Irish, or they can represent a mora in themselves, as is the case in Japanese. Although there is a relation between syllables and morae, they are not necessarily interchangeable. For example, the Japanese word for “photograph”, [sjasin], consists of 2 syllables: sja + sin, but 3 morae: sja + si + n. (source: Jouji Miwa at Mora and Syllable)
multigraph — a combination of two or more written symbols or orthographic characters (e.g. letters) that are used together within an orthography to represent a single sound. (Combinations consisting of two characters are also known as digraphs.)
multi-language enabling — see script enabling.
multi-script encoding — an encoding implementation for some particular language that is designed to enable input to and rendering from that encoding using more than one writing system. When such an implementation is used, the different writing systems are normally based on different scripts.
multi-script enabling — see script enabling.
non-Roman script — a script using a set of characters other than those used by the ancient Romans. Non-Roman scripts include relatively simple ones such as Cyrillic, Georgian, and Vai, and complex scripts such as Arabic, Tamil, and Khmer.
normalization — transformation of data to a normal form. For historical reasons, the Unicode standard allows some characters to have more than one encoded representation. For example, á may be represented as a single codepoint, U+00E1 LATIN SMALL LETTER A WITH ACUTE, or two codepoints, U+0061 LATIN SMALL LETTER A and U+0301 COMBINING ACUTE ACCENT. A normalization scheme is used to standardize the codepoints so that every character is always represented by the same sequence of codepoints. Normalization is described in the Unicode Standard Section 5.7, Normalization.
orthographic character — a written symbol that is conventionally perceived as a distinct unit of writing in some writing system or orthography.
PDF — see Portable Document Format.
PERL — see Practical Extraction and Reporting Language.
phone — a speech sound which is identified as the audible realization of a phoneme.
phoneme — the smallest distinctive segment of sound in any language. It is actually comprised of a group of similar sounds, called allophones, which native speakers of a language may perceive as being all the same. If a pair of words exist which differ only in one phonological element (known as a minimal pair), the element in which they differ is distinctive, and represents two phonemes in the language. For example, in English, bit and pit are a minimal pair; [b] and [p] are distinct phonemes. Phonemes are not consistent across languages; two sounds may be separate phonemes in one language and allophones in another.
phonemic inventory — an inventory of all the distinctive sounds (phonemes) in a given language, also called a phoneme inventory.. A language's phonemic inventory is not fixed over time; as the language changes, sounds which were previously allophones may become phonemes. The smallest documented phoneme inventory belongs to the Rotokas language, which uses only 11 phonemes. The largest belongs to !Xóõ, with an estimated 112 phonemes. The number of phonemes used in speech does not necessarily correspond to the number of symbols used in writing for a given language. For example, the English alphabet contains 26 letters, but the phonemic inventory numbers between 35 and 47 depending on the dialect used (source: Wikipedia). In a true phonemic script the symbols should map on a one-to-one basis to the sounds in the phonemic inventory.
phonemic script — a writing system in which each symbol tends to correspond to one phoneme. For example, the N'ko alphabet assigns one symbol to each phoneme. Also sometimes called a phonetic script although technically this is not accurate, as a true phonetic script should represent every allophone in a language.
phonetization — see the rebus principle.
plain text — textual data that contains no document-structure or format markup, or any tagging devices that are controlled by a higher-level protocol. The meaning of plain text data is determined solely by the character encoding convention used for the data.
plane — in Unicode, a range of 64K codepoints. Plane zero is the original 64K codepoints that can be represented in a single 16-bit character. See also Basic Multilingual Plane, supplementary planes, and surrogate pair.
Portable Document Format (PDF) — a particular file format for the storage of electronic documents in a paged form. Created by Adobe around their Adobe Acrobat product. Usually created from a Postscript page description.
positional keyboard — a keyboard layout defined in terms of the relative positions of keys rather than what they have printed on them. See also mnemonic keyboard.
Postscript — a page description language defined by Adobe. Originally implemented in laser printers so pages were described in terms of line drawing commands rather than as a bitmap.
Postscript font — a font in a format suitable for use within a Postscript document. There are many types. Type 1 is the most common and is what is meant most commonly when people refer to Postscript fonts. There are also ways of embedding other font formats into a Postscript document. For example a Type 42 font is a TrueType font formatted for use within a Postscript document. Type 1 fonts differ in the way their outlines are described from TrueType fonts.
Practical Extraction and Reporting Language (PERL) — an interpreted programming language particularly strong for text processing.
presentation-form encoding — a character encoding system in which the abstract characters that are encoded match one-for-one with the glyphs required for text display. Such encodings allow correct rendering of writing systems on “dumb” rendering systems by having distinct codepoints for contextual forms, positional variants, etc. and are designed on the basis of rendering needs rather than on the basis of character semantics (the linguistically relevant information). Also known as glyph encoding, display encoding or surface encoding; distinguished from semantic encoding.
Private Use Area (PUA) — a range of Unicode codepoints (E000 - F8FF and planes 15 and 16) that are reserved for private definition and use within an organisation or corporation for creating proprietary, non-standard character definitions. For more information see The Unicode Consortium, 1996, pp. 619 ff.
PUA — see Private Use Area.
rasterising — converting a graphical image described in terms of lines and fills into a bitmap for display on an imaging device.
regression test — a test (usually a whole set of tests, often automated) designed to check that a program has not “regressed”, that is, that previous capabilities have not been compromised by introducing new ones.
render — to display or draw text on an output device (usually the computer screen or paper). This usually consists of two processes: transforming a sequence of characters to a set of positioned glyphs and rasterising those glyphs into a bitmap for display on the output device.
right side-bearing — the white space at the right edge of a glyph’s visual representation, or more specifically, the distance between the display position after a glyph is rendered and the right edge of the glyph’s bounding box. A positive right side-bearing indicates white space between the glyph and the following one; a negative right side-bearing indicates overlap or overhang between them.
Roman script — the script based on the alphabet developed by the ancient Romans ("A B C D E F G ..."), and used by most of the languages of Europe, including English, French, German, Czech, Polish, Swedish, Estonian, etc. Also called Latin script.
schema — in markup, a set of rules for document structure and content.
script — a maximal collection of characters used for writing languages or for transcribing linguistic data that share common characteristics of appearance, share a common set of typical behaviours, have a common history of development, and that would be identified as being related by some community of users. Examples: Roman (or Latin) script, Arabic script, Cyrillic script, Thai script, Devanagari script, Chinese script, etc.
Script Description File (SDF) — a file describing certain kinds of complex script behaviour, used to control a rendering engine to which it has given its name. Created by Tim Erickson and used in Shoebox, LinguaLinks, and ScriptPad.
script enabling — providing the capability in software to allow documents to include text in multiple languages or scripts, and to handle input, display, editing and other text-related operations of text data in multiple languages and scripts. Script enabling has to do with the script in which language data is written, as opposed to localisation, which has to do with the language and script of the user interface.
SDF — see Script Description File.
semantic encoding — an encoding that has the property of one codepoint for every semantically distinct character (the linguistically relevant units). In general, such encodings require the use of “smart” rendering systems for correct appearance to be achieved, but are more appropriate for all other operations performed on the text, especially for any form of analysis. Also known as deep encoding; distinguished from presentation-form encoding.
SFM — see Standard Format Marker.
SGML — See Standard Generalized Markup Language.
sort key — a sequence of numbers that when appropriately processed using a particular standard algorithm will position the corresponding string in the correct sort position in relation to other strings. The sort key need not correspond one number to one codepoint in the input string.
Standard Format Marker (SFM) — SIL has a proprietary format called "standard format markers" (SFM). It is possible (and even probable) that SFMs in a single document have different character encodings. When converting to one encoding (Unicode) these must be converted with different mapping files. A standard format marker begins with a backslash (). For example, p would represent a paragraph tag.
Standard Generalized Markup Language (SGML) — a notation for generalized markup developed by the International Organization for Standardization (ISO). It separates textual information from the processing function used for formatting. It was found difficult to parse, due to the many variants possible, and so XML was developed as a subset to resolve the ambiguities and to make parsing easier.
smart font — a font capable of performing transformations on complex patterns of glyphs, above and beyond the simple character-to-glyph mapping that is a basic function of font rendering (see cmap). The information specifying the smart behavior is typically in the form of extra tables embedded in the font, and will generally allow layered transformations involving one-to-many, many-to-one, and many-to-many mappings of glyphs.
supplementary planes — Unicode Planes 1 through 16, consisting of the supplementary code points, corresponding to codepoints U+10000 to U+10FFFF. In The Unicode Standard 3.1, characters were assigned in the supplementary planes for the first time, in Planes 1, 2 and 14. See also Basic Multilingual Plane.
surface encoding — see presentation form encoding.
surrogate pair — a mechanism in the UTF-16 encoding form of Unicode in which two 16-bit code unites from the range 0xD800 to 0xDFFF are used to encode Unicode supplementary plane characters, i.e. with Unicode scalar values in the range U+10000 to U+10FFFF.
syllabary — a form of writing in which the symbols represent syllables--most commonly a vowel-and-consonant combination. A syllabary differs from an abugida in that there are no distinct elements of the symbols to correspond to the syllable's phonemes.
symbol-encoded font — Windows supports two types of Unicode fonts: standard and symbol. Symbol-encoded fonts are used for either non-orthographic collections of shapes (such as Wingdings) or for legacy orthographies (e.g., SIL Ezra, SIL Galatia, SIL IPA) created prior to availablility of Unicode-based solutions. Symbol-encoded fonts encode characters in the Private Use Area, typically U+F020 .. U+F0FF
tokenisation — the process of analysing a string into a contiguous sequence of smaller units: for example, word breaking or syllable breaking or the creation of a sort key.
TrueType font — font format used primarily in Windows and on the Mac, allows for glyph scaling and hinting.
unicameral — describes a script with only one set of symbols per phoneme. See also bicameral.
Unicode Scalar Value (USV) — a number written as a hexadecimal (base 16) value that serves as the codepoint for Unicode characters. Characters in the BMP are written with four hex digits, eg: U+0061, U+AA32. Characters in supplementary planes use five or six digits.
Uniscribe (Unicode Script Processor) — due to technical limitations in OpenType, it is necessary to pre-process strings before applying OpenType smart behaviour. Microsoft uses a particular DLL (Dynamic Link Library) called Uniscribe to do this pre-processing. Uniscribe does all of the script specific, font generic processing of a string (such as reordering) leaving the font specific processing (such as contextual forms) to the OpenType lookups of a font.
USV — see Unicode Scalar Value.
UTF-8 — an encoding form for storing Unicode codepoints in terms of 8-bit bytes. Characters are encoding listing sequences of 1-4 bytes. Characters in the ASCII character set are all represented using a single byte. See http://www.unicode.org/unicode/faq/utf_bom.html.
UTF-32 — an encoding form for storing Unicode codepoints in 32-bit words. Since 32 bits encompasses the entire range of Unicode, every codepoint is encoded as a single 32-bit word. See Unicode Technical Report #19.
virama — the generic name for a written symbol, particularly common in Brahmic abugidas, having the function of silencing the inherent vowel in every consonant character. The virama can be used either to represent a word-final consonant or the first consonant(s) in a consonant cluster. The shape of the symbol varies from script to script, but it is often a diacritic, written above, below or alongside the consonant which it modifies.
VOLT — See Visual OpenType Layout Tool.
writing system — an implementation of one or more scripts to form a complete system for writing a particular language. Most writing systems are based primarily upon a single script; writing systems for Japanese and Korean are notable exceptions. Many languages have multiple writing systems, however, each based on different scripts; e.g. the Mongolian language can be written using Mongolian or Cyrillic scripts. A writing system uses some subset of the characters of the script or scripts on which it is based with most or all of the behaviours typical to that script and possibly certain behaviours that are peculiar to that particular writing system.
x-height — the distance from the baseline of a line of text to the top of the main body of lower-case letters, that is, without ascenders or descenders. It is the height of a lower-case x, as well as a lower-case u, v, w, and z. Curved letters such as a, e, n, and s tend to be slightly taller than the x-height for aesthetic purposes.
XML — see Extensible Markup Language.
XSL — see Extensible Stylesheet Language.
XSLT — see Extensible Stylesheet Language Transformations.
Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.
Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.
I would add samples of the different things you talk about .... so Hebrew text, Arabic text, ....
Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.
© 2003-2013 SIL International, all rights
reserved, unless otherwise noted elsewhere on this page.