Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Encoding > Conversion > Utilities
Short URL: http://scripts.sil.org/SILConverters25_doc

SIL Converters 2.5 documentation

Kent Spielmann, 2007-02-21

Contents of this page

SIL Encoding Converters 2.5 Setup

Besides being available from the SIL Software CD-ROMs, starting with SIL Converters version 2.5, there is a new web-based Master Installer that allows you to choose the features you want and then install them from the internet. The Master Installer is currently available from this link:  http://downloads.sil.org/EncodingConverters/setup.exe. The Master Installer is a small program that runs on your machine that runs a series of installers:

  1. Software prerequisites—Necessary system updates and add-ons are installed on your computer.
  2. SIL Encoding Converters2.5 Setup—Conversion applications are installed and conversion Maps and Tables are copied to your hard drive.
  3. SIL Converters for Office 2003—A misnamed installer that installs one additional operating system update.
  4. Converter Installer—A utility that allows you to activate the conversion Maps and Tables you want to use.

For further instructions on running the Master Installer and for initial running of SIL Encoding Converters2.5 Setup of refer to the installation documentation SIL Converters 2.5 Installation.

The following instructions assume you have already installed SIL Encoding Converters2.5 and want to understand and use additional features of the program.

Initial dialog windows

Warning

If you run the .msi installer without the Master Installer, you will be shown the following warning:

Figure 2. Installer Warning



  • If you have already run the master installer program and are sure all the prerequisites are installed you may "accept the risk" and click  Next 

Note

If after installing this way you have some functionality that isn’t working, try re-running the installation with the Master Installer first in order to insure that your system has the necessary prerequisites.

Application maintenance

Once Encoding converters is installed, the set up program displays an application maintenance screen that gives you the option to  Modify ,  Repair  or  Remove  your installation.

  • To modify your installation simply click  Next .

Select Features

The following section briefly describes most of the boxes in Figure 1 and links you to further information about how use the different utilities and applications for different text transduction applications. The information in this table is organized around the different components available in the SIL Converters Installer.

Feature overview

Figure 3. Select Features Tree



As you can see from Figure 3, there are four main categories of features that you can choose from when installing SIL Converters:

SIL Converters’ client application

This feature node contains some of the programs at the top layer of Figure 1, which are generally of the most interest to end users. These programs and utilities allow you to convert text data (e.g. Word documents, SFM documents, XML Documents, data on the system clipboard) using the text processing capabilities provided by the various transduction engines at the bottom of Figure 1.

Transduction Engines

This feature node contains the different transduction engine components that provide text processing capabilities at the lowest layer of Figure 1.

Most users should accept the defaults for this feature to insure that the proper transduction engines are installed. Otherwise, you must make sure you have the required transduction engines installed for the different text processing tasks you want to do.

Examples:

  • If you intend to do encoding conversions, you probably need to install the TECkit and/or Consistent Changes (CC) transduction engines.
  • If you want to use an ICU transliterator, you need to install the IBM Components for Unicode transduction engine.
  • If you want to write Perl expressions or call Python script functions for text processing, you need to install one or both of those transduction engines (both of which require separately available program distributions—see below)
Maps and Tables

This feature node contains several groupings of instances of conversion maps and tables (e.g. for TECkit and/or CC) which provide the input to the transduction engines (e.g. the SIL IPA93<>UNICODE map)

A few of the subfeature items are useful for all users, such as the Basic Converters and ICU Transliterators sets. Otherwise, you can only install those converter sets you expect to need (e.g. based on your entity).

If you would like to add a package of converters to the SIL Converters’ installer, contact .

Additional TECkit applications

Since the SILConverters installer installs TECkit (a subfeature of the Transduction Engine feature node discussed above), this feature node adds the rest of the content of the TECkit download from the TECkit site (i.e. the documentation and other TECkit client applications). A new TECkit map Unicode Editor assists in the creation of TECkit maps available from this feature node.

The following sections describe the sub-features available in each of these four nodes.

SIL Converters’ client applications

Figure 4. SIL Converters’ Client Applications



This installer installs the following of SIL Converters client applications directly (see Figure 1).

The FieldWorks and AdaptIt client applications have separate install programs.

Bulk SFM Converter

Use this application to convert the data in Standard Format Marker (SFM) fields using converters from the EncConverters’ repository and to convert the encoding of data in Shoebox, Toolbox, and Paratext (SFM) documents. You can also open multiple SFM documents for processing at the same time.

To use

  • Click Start… / All Programs / SIL Converters / Bulk SFM Converter.
  • For context sensitive help, click  ?  (top right program window) and then click the table in the main window.

Figure 5. SFM File Converter



Clipboard Encoding Converter

After starting up the Clipboard EncConverter, click the icon on the Windows Task Bar to convert text copied to the Windows clipboard.

To start it up

Clipboard Encoding Converter is a untilty that you access from the Windows Task Bar. To use it you need to first start it up.

  • To start it only when you need it: Click Start… / All Programs / SIL Converters / Clipboard EncConverter.
  • To start it every time you start Windows: Add a shortcut to Clipboard EncConverter to your All Programs / Startup folder.

To use

Clipboard Encoding Converter can be used in two ways:

  • Clipboard mode: To convert text that you've copied using  Copy  or  Ctrl  +  C  (in most applications), which you can then paste using  Paste  or  Ctrl  +  V .
  • SpellFixer mode: To add spelling corrections to a SpellFixer project from applications other than just Word.

Figure 6. EncConverter Clipboard Mode popup



To convert text using the Clipboard Mode

  1. Copy some text by selecting it and pressing  Ctrl  +  C 
  2. Right-click the Clipboard EncConverter Icon in the Windows task bar (System Tray) (see red arrow in Figure 6). Result: the available conversions will be listed in the top section of the popup window. If Preview is checked, a sample of your copied text will appear as it will be converted to the right of each conversion name.
  3. In the popup menu, select the conversion you want.
  4. Paste your converted text where you want it.

SpellFixer Mode

  1. Left-click the system tray icon, to turn on One-click SpellFixer mode.
  2. Tip: This allows you to add spelling corrections from any arbitrary application (instead of just from within Word, while using the SpellFixer document template).
  3. Select the misspelled word in your application and copy it to the clipboard ( Ctrl  +  C ).
  4. Left-click the Clipboard EncConverter icon in the system tray, select the project to enable, and type the replacement spelling for the word on the clipboard. Result: The word will be corrected on the clipboard and the correction you entered will be added to the SpellFixer database for the selected project.
  5. Paste the correct word into your application. Tip: For more details, see the SpellFixer.dot Word document template.
    1. In Word open a New document using SpellFixer.dot Word document template.
XML Data Converter

Use this application to convert the data (attributes or elements) in an XML document using converters from the EncConverters’ repository for example,

  • to convert the data in an AdaptIt Knowledge Base from a legacy encoding to Unicode, or
  • to convert the text in a Word document saved as XML. (This requires some knowledge of WordML and XPath syntax.)

To use

  1. Click Start… / All Programs / SIL Converters / XML Data Converters.
  2. For context sensitive help, click  ?  (top right of program window) and then click the different sections of the main window.

Figure 7. XML Data Converter Mappings



MS Word converters

You can use SILConverters directly in MS Word. The converters are macros contained in three Microsoft Word document templates (DOTs). These macros use the EncConverters repository to accomplish different tasks.

  • Data Conversion Macro in Data Conversion Macro xxxx.dot
  • SpellingFixer in SpellFixer.dot
  • Consistency Spelling Checker in Consistent Spelling Checker xxxsc.dot

If you select the WordDOTs feature node, the SILConverters’ installer will put these templates into your Templates folder (normally C:Documents and Settings<user>Application DataMicrosoftTemplates).

To use

  1. To access the document template clients from within Microsoft Word, click Tools… / Templates and Add-Ins. Result: The Templates and Add-Ins dialog box will displayed.
  2. In the Templates and Add-Ins dialog box, click  Add . Result: Your Templates folder will be opened.
  3. Select the Word .dot file you want to use.

Note

If multiple users on the machine want to use the document template, you need to manually move the .DOT files to some common location and each user will need to browse for them individually in Tools / Templates and Add-Ins. If you want one or more of these document templates to start up automatically when Word starts, move them either to the current user’s Startup folder (i.e. C:Documents and Settings<user>Application DataMicrosoftWordSTARTUP). For all users, put it in the global startup folder (e.g. C:Program FilesMicrosoft OfficeOFFICE11STARTUP).

Data Conversion Macro

Use the Data Conversion Macro to convert text in any arbitrary Word document based on Font name, Style, or even the current selection using converters from the EncConverters’ repository. It also supports SFM documents. Open the document template for more instructions.

Figure 8. Data Conversion Macro dialog



SpellingFixer

Use the SpellingFixer document template to correct misspelled words or make certain orthographic changes based on a user-defined database of bad-good spelling pairs. This is particular useful when you want

  • to condition spelling changes to be at either a beginning or ending word boundary, or
  • to convert a portion of a word (as opposed to full word forms).

Figure 9. Enter correction rules dialog



Once you have a database of such spelling fixes (or consistent changes), use one of the Correct Whole Document menu commands to go through all the words in the document to search for misspelled words. See the document template for instructions.

Consistency Spelling Checker

Use the Consistency Spelling Checker document template for a simple way of working with data (in any language, and any script) in Microsoft Word documents, Plain Text files or any Toolbox database to:

  • Check consistency of spelling (semi-automatically) based on linguistic principles
  • Apply global spelling changes:
    • to multiple documents which are currently open
    • by generating a CC table of changes to be applied to one or more plain text databases (such as Toolbox files)
  • Create a character inventory with frequency count
  • Create unique wordlists from one or more Word documents as:
    • a Word document table with frequency counts, or
    • a Toolbox (MDF-formatted) database for starting a lexicon

Figure 10. Spelling inconsistency parameter dialog



Note

This tool is not a full-fledged spelling checking tool. It does not use language-specific dictionaries, and therefore knows nothing about the languages it checks. It is only a consistency checking tool based on phonological similarity, or sets of user-defined ambiguous characters.

Prerequisites

The Spelling Consistency Checker macro requires that you install this software:

SIL Converters’ Transduction Engines

Several of the transduction engines in Figure 1 are provided by the EncConverters’ repository object itself (i.e. the code page converter, the AdaptIt Knowledge Base Lookup, and the Compound and Primary-Fallback meta converters) and are always available. The rest depend on external programs (SIL and other Open Source programs) and installation is optional, depending on your need.

Most end users will not need to concern themselves with these details except to be sure that the necessary transduction engine is installed for the converters they want to use. Chances are that someone in your entity has already created a map file that you can use to convert the encoding of your data. In this situation, you need to be sure that you install the proper transduction engine required by the map or table that implements the conversion you want.

Figure 11. Optional SIL Converters’ Transduction Engines



TECkit

Other applications use TECkit, a low-level toolkit, to perform encoding conversions (e.g., when importing legacy data into a Unicode-based application). The primary component of the TECkit package is a library that performs conversions. This is the “TECkit engine”. The engine relies on mapping tables in a specific binary format (see TECkit documentation). A compiler creates such tables from a human-readable mapping description (a simple text file).

In EncConverters, you can select either the compiled *.tec file or the uncompiled, human-readable *.map) to be the converter. If you choose the latter, EncConverters will automatically compile an out-of-date .tec file when it is used to convert data.

See Adding converters: TECkit map below for details about adding TECkit maps to the system repository.

Consistent Changes (CC)

Use Consistent Change tables to find all occurrences of specified characters, words, or phrases in a string of text, and then change them in a consistent way. The change may be done in every occurrence or only when certain conditions are met.

CC is like the find-and-replace feature in a text editor, but much more powerful. It allows you

  • to make changes which take context into consideration, and
  • to make a whole set of changes at once.

SpellFixer is also available. This is a user-friendly graphical user interface for creating consistent change tables. This interface is primarily available via the SpellFixer.dot Microsoft Word document template mentioned above in SILConverters’ client applications.

See Adding converters: Consistent Changes (CC) below for details about adding CC tables to the system repository.

International Components for Unicode (ICU) 3.4

Three distinct EncConverters-related features as well as other features of ICU used by other client are applications that must be installed as a unit.

For SILConverters, three transduction engines are included in this feature:

  • ICU Transliterators: provides a series of transliterators for various ranges of Unicode (c.f. Devanagari to Latin) as well as the ability to write custom rules for doing transliteration. See  http://icu.sourceforge.net/userguide/Transform.html for more details on the use and syntax of ICU Transliterators.
  • ICU Converters: provides comprehensive character set conversion services, mapping tables, and implementations for many encodings. Since ICU uses Unicode (UTF-16) internally, all converters convert between UTF-16 (with the endianness according to the current platform) and another encoding. This converter includes other Unicode encodings. These are typically of more interest to programmers than end-users. See  http://icu.sourceforge.net/userguide/conversion.html for more details on ICU converters.
  • Regular Expression: provides applications with the ability to apply regular expression matching to Unicode string data. The regular expression patterns and behavior are based on Perl's regular expressions. See  http://icu.sourceforge.net/userguide/regexp.html for more details on the syntax of ICU Regular Expressions.
Perl Expressions 5.8.7

The Perl Expressions 5.8.7 transduction engine allows you to write Perl expressions to do text processing in EncConverter client applications.

Note

This feature requires installation of a separate Perl 5.8.7 distribution to be installed.

The Perl plug-in has been tested with the following freely available Perl distributions:  http://www.activestate.com/solutions/perl/ ActiveState Perl or  PXPerl

See below for a known issue with the PXPerl distribution.

Also note that this plug-in will not work (yet) with the most recent v5.8.8 distribution.

Python Script Functions 2.4

The Python Script Functions 2.4 transduction engine allows you to do text processing using Python functions in EncConverter client applications.

Note

This feature requires a separate Python 2.4 distribution to be installed.

The Python plug-in has been tested with the following freely available Python distributions:  ActiveState Python or  Python.org.

Note

This plug-in will not work (yet) with the most recent v2.5 distribution.

SIL Converters’ Maps and Tables

Most end-users are interested only in a small number of encodings. Typically, computer support people have created TECkit maps and/or CC tables for the various encodings used in each entity, alleviating most end-users from having to create their own maps and tables.

Because there are hundreds of possible encoding converters and transliterators that different end-users may be interested in, they are packaged into logically-related groups of converters and are available via a two-step process.

Steps

  1. Use the SILConverters’ installer to install the package(s) of converter likely to be useful to you (e.g. based on your entity). Result: During installation, all the converter maps/tables in the selected package(s) will be installed into a fixed location on your computer (i.e. C:Documents and SettingsAll UsersApplication DataSILSILConverters22MapsTables).
  2. Use the Converter Installer application to install the few applications you want into the EncConverters’ repository. Result: They become available to EncConverters’ client applications.

Note

Installing maps and tables onto your computer with the SILConverters’ installer (step 1 above) will not make them available to EncConverters’ client applications unless you explicitly add them to the EncConverters’ repository using the Converter Installer or some other mechanism (see Adding converters).

Figure 12. Available optional maps and tables



The following sections gives the details about fonts and encodings for different maps and tables:

Basic Converters

Converters and Transliterators common to all SIL. This includes the following:

Converter NameEncoding NameFont Names
SIL IPA93<>UNICODE SIL-IPA93-2001 SILDoulos IPA93
    SILManuscript IPA93
    SILSophia IPA93
SIL-IPA-1990<>UNICODE SIL-IPA-1990 SILDoulosIPA
    SILManuscriptIPA
    SILSophiaIPA
SIL Galatia <>UNICODE SIL-GREEK_GALATIA-2001 SIL Galatia
ISO-8859<>UNICODE ISO-8859-1
AMER PHON>UNICODE (SIL)-Amer_Phon_SILDoulosL3-(2005)  
SIL PUA 3.2<>UNICODE 4.1
SIL PUA 3.2<>UNICODE 5.0
Symbol<>cp1252
UTF8<>UTF16
ReverseString For reversing the bytes of a “narrow” (bytes) string  
null No change to string, but can be used to apply a different font to some text (e.g. in the Data Conversion Macro)  
NFC Convert to normal form composed  
NFD Convert to normal form decomposed  
ICU Transliterators

Configuration information for the following ICU transliterators are for Unicode-encodings only.

Note

These are not the only transliterators available via the ICU Transliterator transduction engine, but are only a few of the pre-defined latinizing (or romanizing) transliterators that can be useful in different client applications for different ranges of Unicode.

  • Devanagari to Latin (aka. Devanagari-Latin)
  • Bengali to Latin (aka. Bengali-Latin)
  • Gujarati to Latin (aka. Gujarati-Latin)
  • Gurmukhi to Latin (aka. Gurmukhi-Latin)
  • Kannada to Latin (aka. Kannada-Latin)
  • Malayalam to Latin (aka. Malayalam-Latin)
  • Oriya to Latin (aka. Oriya-Latin)
  • Tamil to Latin (aka. Tamil-Latin)
  • Telugu to Latin (aka. Telugu-Latin)
  • Arabic to Latin (aka. Arabic-Latin)
  • Cyrillic to Latin (aka. Cyrillic-Latin)
  • Greek to Latin (aka. Greek-Latin)
  • Han to Latin (aka. Han-Latin)
  • Hangul to Latin (aka. Hangul-Latin)
  • Hebrew to Latin (aka. Hebrew-Latin)
  • Hiragana to Latin (aka. Hiragana-Latin)
  • Katakana to Latin (aka. Katakana-Latin)
  • Jamo to Latin (aka. Jamo-Latin)
  • NumericPinyin to Latin (aka. NumericPinyin-Latin)
  • Any to Latin (aka. Any-Latin)

Note

These transliterators can be daisy-chained together to transliterate between non-Latin scripts using a Compound meta-converter. For example, chaining the ‘Devanagari-Latin’ transliterator (in the Forward direction) with the ‘Arabic-Latin’ transliterator (in the Reverse direction) gives a ‘Devanagari-Arabic’ transliterator.

FindPhone to IPA converters

Adds the following converters for dealing with FindPhone encoded data:

  • FindPhone>SAG IPA93
  • FindPhone>UNICODE
SAG Indic

Contains encoding converter map(s) for the following encoding/font:

Converter NameEncoding NameFont Names
Annapurna<>UNICODE SIL-ANNAPURNA_05-2002 Annapurna
  • Includes an IPA transliterator for Unicode Devanagari to Unicode IPA (phonetic)
Cameroon

Contains encoding converter map(s) for the following encoding/fonts:

Converter NameEncoding NameFont Names
Cameroon <>UNICODE Cameroon Cam Cam SILDoulosL
    Cam Cam SILSophiaL
    Cam Cam SILManuscriptL
    Cam2 Cam2 SILDoulos
    Cam2 Cam2 SILSophia
    Cam2 Cam2 SILManuscript
    Cam Paratext SILDoulos
    Cam Paratext SILSophia
    Cam Paratext SILManuscript
West Africa

Contains encoding converter map(s) for the following encoding/fonts:

Converter NameEncoding Name
SIL-93linb-2005<>UNICODE SIL-93linb-2005
UBS-Abidjan-2005<>UNICODE UBS-Abidjan-2005
Bambara SIL Charis<>UNICODE Bambara SIL Charis
SIL-BF Font Family-2005<>UNICODE SIL-BF_Font_Family-2005
SIL-BF_Times-2006<>UNICODE SIL-BF_Times-2006
X-SIL-Fulfulde<>UNICODE X-SIL-Fulfulde
SIL-Ghana Doulos-2005<>UNICODE SIL-Ghana_Doulos-2005
SIL-Mali Standard Font Family<>UNICODE Mali Standard SILDoulos-2005
RCI Standard Doulos/Sophia/Manuscript<>UNICODE SIL-RCI Standard-1994
X-SIL-Senufo<>UNICODE X-SIL-Senufo
SIL-Karaboro-2006<>UNICODE SIL-Karaboro-2006
SIL Samogho Doulos/Sophia/Manuscript<>UNICODE SIL-Samogho-2006
SIL-Songhai-2006<>UNICODE SIL-Songhai-2006
Tombouctou-Dutch<>UNICODE SIL-Tombouctou-Dutch-2006
Burkina Faso Winye-2003<>UNICODE SIL-Burkina_Winye_Unknown_Font-2005
Eastern Congo Group

Contains encoding converter map(s) for the following encoding/fonts:

Converter NameEncoding Name
Times African<>UNICODE Times African
NdrunaASCII<>UNICODE NdrunaASCII
Mayogo<>UNICODE Mayogo
Komo<>UNICODE Komo
KomoASCII to Unicode KomoASCII
ECG<>UNICODE ECG-Unicode(Jan.2005)
BuduASCII<>UNICODE BuduASCII
BUDU<>UNICODE BUDU
BheleASCII<>UNICODE BheleASCII
Bantu Und<>UNICODE Bantu Und
NLCI (India)

Contains encoding converter map(s) for the following encoding/font:

Converter NameEncoding NameFont Names
SL Oriya<>UNICODE NLCI-SLOriya
Winscript/iLeap Devanagari<>UNICODE CDAC-ISFOC_DEVANAGARI DEV Panini
Winscript/iLeap Gujarati<>UNICODE CDAC-ISFOC_GUJARATI GUJ Gir
Winscript Malayalam<>UNICODE NLCI-Malayalam MAL Vayalar
Winscript Oriya<>UNICODE NLCI-Oriya ORI Asika
Winscript Tamil<>UNICODE NLCI-Tamil TAM Thiruvalluvar
Winscript Telugu<>UNICODE NLCI-Telugu TEL Nirmal

Additional TECkit applications

TECkit Map Unicode Editor

The TECkit map Unicode Editor is one more EncConverters’ client application mentioned in Figure 1. Use this program to develop TECkit maps for encoding conversion or other text processing applications (e.g. Transliteration).

Steps

  1. Install this application by selecting the TECkit Map Unicode Editor feature under the Additional TECkit applications feature node.
  2. Start the program by clicking Start… / All Programs / SIL Converters / TECkit / TECkit Map Unicode Editor.

Figure 13. TECkit Map Unicode Editor



  1. Type (or paste) text into the Sample boxes in the lower left window. Result: The code point values and/or names of the characters in the string will be displayed in the table in the upper right.
  2. Click the cells in that table to insert those values into the map (in the upper left).
  3. To save your map to the default system repository, click File... / Add to System Repository, navigate to C:Program FilesCommon FilesSILMapsTables, and click  Save . Result: this will bring up the TECkit map transduction engine dialog box.

Tips

  • If you are creating a map involving a Legacy encoding, the font glyphs for that font will be shown in the table in the lower right. Click to insert their values into the map or  Ctrl  + Click to insert them into the Sample boxes.
  • The map is automatically compiled as you make changes and you can click errors in the compiler results window (Figure 13, extreme top left) to jump to problem statements.
  • The program will automatically convert the data in the Left-side Sample box (in the forward direction) in order to check conversion as you work on it. It will also convert the Right-side Sample (in the reverse direction) in order to check the round-trip capability of the map.
  • Context-sensitive help is available—select a portion of the window and press  F1  or click  ?  (right-hand window).

Final dialog windows

  1. When you have finished making selections click  Next  again to begin installation and/or uninstalling.
  2. When you see the SIL Encoding Converters 2.5 has been successfully installed message, click  Finish .
  3. Add any new converters to the Encoding converters repository:
    1. If you run the SIL Encoding Converters 2.5 Setup .msi file directly the program will close and you will need to add any new converters to the Encoding Converters system repository. See Adding converters to the system repository
    2. If you run SIL Encoding Converters 2.5 Setup from the master installer, the Master Installer sequence will continue. If this is not an initial installation you may cancel the SIL Converters for Office 2003 Setup (see the installation documentation) and proceed to the Converter Installer to add any new converters to the Encoding converters repository.

Adding Converters to the System Repository

There are two primary ways of adding converters to the System Repository, by using either the

  • Converter Installer or
  • Transduction Engine configuration dialogs.

Converter Installer

If the converter you want to install into the system repository comes as part of the Maps and Tables features in the SILConverters installer (e.g. the SIL IPA93<>UNICODE converter that comes as part of the Basic Converters package), you can install it into the system repository by running the Converter Installer application.

How to get there

  • When running the Master Installer, this utility automatically launches as the last item in the installer sequence.
  • From the Windows taskbar, click Start… / All Programs / SIL Converters / Launch Converter Options Installer.
  • From the Clipboard EncConverter popup by selecting Launch Converter Installer.

Figure 14. Converter Installer



Installing converters

  • To install one or more of these converters into the system repository, check the box next to the converters you want and click  Commit .
  • To remove converters from the system repository, clear the check box and click  Commit .

For detailed instructions see the Converter Installer section in the installation documentation

Choose a Transduction Engine dialog box

If you have your own converter map (e.g. created with the TECkit map Unicode Editor) or one given to you not as part of an installer feature, you can add it to the system repository via the Choose a Transduction Engine dialog box.

Figure 15. Choose Transduction Engine dialog box



  • Select the item from the list that matches the type of converter you want to add and click  Create .

How to get there

  • If you are running the Clipboard EncConverter application, right-click the icon in the system tray and choose the Add Converter item.
  • If you are using AdaptIt, Data Conversion Macro, Bulk SFM Converter, XML Data Converter, or SpellFixer; open the Select Converter dialog window (Figure 16) and click  Add New .

Figure 16. Select Converter dialog



Transduction Engine Details

TECkit map
  • To add a TECkit map to the system repository, select TECkit map from the Transduction Engine dialog box, and click  Create . Result: The TECkit map Setup dialog will be displayed.

Figure 17. TECkit Setup



  1. Browse with the  ...  button for the TECkit .map or .tec file.
  2. To permanently add the converter to the System Repository, click  Save in System Repository .
  3. Click Test Area to test the converter with some sample data.
Consistent Changes (CC)

Result: The CC table Setup dialog will be displayed:

Figure 18. CC Table Setup



  1. Browse with the  ...  button for the CC table file.
  2. For CC Table expects and CC Table returns, select the desired encoding and click  Apply .

Tip:

If it expects Unicode-encoded data, select that option or your data may be incorrectly converted. For Non-Unicode (byte) data, the default system code page will be used to convert your data when necessary.

  • 3. If you installed the SpellFixer plug-in, click Add or Edit a SpellFixer CC Table. Result: This will allow you to create or edit an existing SpellFixer project.

Tip:

Though primarily a Microsoft Word-based tool, you can use The SpellFixer application to create a CC table. Use the SpellFixer graphical user-interface to configure Bad Spelling and Good Spelling couplets, which then are put into a CC table. The Microsoft Word document template also has macros for processing the text in a file in a word-by-word manner so you can use it in a Find First/Next fashion to correct spelling errors. The SpellFixer.dot file has further usage information.

  • 4. To permanently add the converter to the System Repository, click  Save in System Repository .

Note

You do not need to add a SpellFixer project to the System Repository, since it will be added automatically by the Project editor.

  • 5. Click Test Area to test the converter with some sample data.
ICU Transliterator

Result: The ICU Transliterator Setup dialog will be displayed:

Figure 19. ICU Transliterator Setup



  1. To use one of the built-in transliterators, choose Built-in transliterator.
  2. To write a custom transliterator using the syntax described on the webpage referred to above, choose Custom transliterator and enter the transliterator syntax in the box. Result: Previous Custom Rules will be enabled, showing examples of useful custom rules and others that you wrote (Figure 19, number 3).
  3. Click  Delete  to remove unwanted rules.
  4. To permanently add the converter to the System Repository, click  Save in System Repository .
  5. Click Test Area to test the converter with some sample data.
ICU Converters

The ICU Converter Setup dialog will be displayed:

Figure 20. ICU Converter Setup



  1. Choose the desired converter from the drop-down combo box as shown above.
  2. If you want the converter to be permanently added to the System Repository, then you must click  Save in System Repository .
  3. You can click the Test Area tab to test the converter with some sample data.
Regular Expression Find and Replace (ICU)

ICU's Regular Expressions package provides applications with the ability to apply regular expression matching to Unicode string data. The regular expression patterns and behavior are based on Perl's regular expressions. See  http://icu.sourceforge.net/userguide/regexp.html for more details on the syntax of ICU Regular Expressions.

  • If you want to add an ICU Regular Express Find and Replace converter to the system repository, choose Regular Expression Find and Replace (ICU) from the Transduction Engine dialog box and click  Create .

The Regular Expression Find and Replace (ICU) Setup dialog will be displayed:

Figure 21. ICU Regular Expression Setup



  1. In the Search for box, enter the Regular Expression search string you want to use. Tip: The search string can contain ICU Regular Expression Meta Characters and Operators defined below.
  2. Click the right wedge. Result: You will see a pop-up list of commonly used Regular Expression search operators. If selected, they will be inserted into Search for.

Figure 22. Commonly used regular expressions pop-up



  • 3. In the Replace with box, enter the string or operator that represents the text to replace the Search for string (see Replacement Text defined below).
  • 4. Check the Ignore Case box to have the ICU search algorithm ignore the case of the input text.
  • 5. The Previous Searches combo box includes a few example Regular Expressions and remembers any new ones you add. Click  Delete  to remove the selected search item.
  • 6. If you want the converter to be permanently added to the System Repository, then you must click  Save in System Repository .
  • 7. Click the Test Area tab to test the converter with some sample data.
Regular Expression Metacharacters
Character Description
a Match a BELL, u0007
A Match at the beginning of the input. Differs from ^ in that A will not match after a new line within the input.
b, outside of a [Set] Match if the current position is a word boundary. Boundaries occur at the transitions between word (w) and non-word (W) characters, with combining marks ignored. For better word boundaries, see  ICU Boundary Analysis.
b, within a [Set] Match a BACKSPACE, u0008.
B Match if the current position is not a word boundary.
cX Match a control-X character.
d Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)
D Match any character that is not a decimal digit.
e Match an ESCAPE, u001B.
E Terminates a Q ... E quoted sequence.
f Match a FORM FEED, u000C.
G Match if the current position is at the end of the previous match.
n Match a LINE FEED, u000A.
N{UNICODE CHARACTER NAME} Match the named character.
p{UNICODE PROPERTY NAME} Match any character with the specified Unicode Property.
P{UNICODE PROPERTY NAME} Match any character not having the specified Unicode Property.
Q Quotes all following characters until E.
r Match a CARRIAGE RETURN, u000D.
s Match a white space character. White space is defined as [tnfrp{Z}].
S Match a non-white space character.
t Match a HORIZONTAL TABULATION, u0009.
uhhhh Match the character with the hex value hhhh.
Uhhhhhhhh Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is U0010ffff.
w Match a word character. Word characters are [p{Ll}p{Lu}p{Lt}p{Lo}p{Nd}].
W Match a non-word character.
x{hhhh} Match the character with hex value hhhh. From one to six hex digits may be supplied.
xhh Match the character with two digit hex value hh
X Match a  Grapheme Cluster.
Z Match if the current position is at the end of input, but before the final line terminator, if one exists.
z Match if the current position is at the end of input.
n Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern. Note: Octal escapes, such as 012, are not supported in ICU regular expressions
[pattern] Match any one character from the set. See  UnicodeSet for a full description of what may appear in the pattern
. Match any character.
^ Match at the beginning of a line.
$ Match at the end of a line.
Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | . /

Regular Expression Operators
Operator Description
| Alternation. A|B matches either A or B.
* Match 0 or more times. Match as many times as possible.
+ Match 1 or more times. Match as many times as possible.
? Match zero or one times. Prefer one.
{n} Match exactly n times
{n,} Match at least n times. Match as many times as possible.
{n,m} Match between n and m times. Match as many times as possible, but not more than m.
*? Match 0 or more times. Match as few times as possible.
+? Match 1 or more times. Match as few times as possible.
?? Match zero or one times. Prefer zero.
{n}? Match exactly n times
{n,}? Match at least n times, but no more than required for an overall pattern match
{n,m}? Match between n and m times. Match as few times as possible, but not less than n.
*+ Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match)
++ Match 1 or more times. Possessive match.
?+ Match zero or one times. Possessive match.
{n}+ Match exactly n times
{n,}+ Match at least n times. Possessive Match.
{n,m}+ Match between n and m times. Possessive Match.
( ... ) Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
(?: ... ) Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
(?> ... ) Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>"
(?# ... ) Free-format comment (?# comment ).
(?= ... ) Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?! ... ) Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<= ... ) Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?<! ... ) Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?ismx-ismx: ... ) Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismx-ismx) Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.
Replacement Text

The replacement text for find-and-replace operations may contain references to capture-group text from the find. References are of the form $n, where n is the number of the capture group.

Character Descriptions
$n The text of capture group n will be substituted for $n. n must be >= 0 and not greater than the number of capture groups. A $ not followed by a digit has no special meaning, and will appear in the substitution text as itself, a $.
Treat the following character as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for '$' and '', but may be used on any other character without bad effects.
Perl Expression

Note

This feature requires installation of a separate Perl 5.8.7 distribution to be installed.

The Perl plug-in has been tested with the following freely available Perl distributions:  http://www.activestate.com/solutions/perl/ ActiveState Perl or  PXPerl

See the Unable to add converters with PXPerl installed in the installation document in for a known issue with the PXPerl distribution.

The Perl Expression Setup dialog will be displayed:

Figure 23. Perl Expression Setup



  1. This area is for entering your Perl expression.
  2. For Perl expression expects and Perl expression returns, select the desired encoding type and click  Apply . Tip: If your data is Unicode-encoded, select that option or your data may be incorrectly converted. For Non-Unicode (byte) data, the default system code page will be used to convert your data when necessary.
  3. The Previous Expressions combo box includes a few example Perl expressions and remembers any new ones you add. Click  Delete  to remove unwanted expressions.
  4. Click  Distro Config  to set up the path to your Perl Distribution's library folders (e.g. C:PXPerl, C:PXPerllib, C:PXPerlsitelib) and to specify certain Perl modules to be automatically loaded for all expressions (e.g. Win32, or SIL:RTF:Unicode). Tip: If you get an error message saying that some *.pm file can't be found, then you probably don't have the correct path to the file configured.
  5. To permanently add the expression to the System Repository, click  Save in System Repository .
  6. Click the Test Area tab to test the converter with some sample data.
Python Script Function

Note

This feature requires a separate Python 2.4 distribution to be installed.

The Python plug-in has been tested with the following freely available Python distributions:  ActiveState Python or  Python.org.

If you want to add a Python script function converter to the system repository,

The Python Script Setup dialog will be displayed:

Figure 24. Python Script Setup



  1. Browse with the ... button for the Python script file.
  2. If the file contains valid Python functions, then the function names will be put into the Function Name combo box. Choose the function desired.
  3. If you want to pass some static information to your Python function, then define the function as shown below and put the static information in the Additional parameters field. For example, with a Python function prototype of:
def ChangeLanguage(sLang, uI):
    if not isinstance(uI, unicode):
        raise UnicodeError(u'Input Data not Unicode! (%s)' % uI)0
    else:
        if sLang == u'Chinese':
            # do some Chinese processing and put result in uO
            uO = ProcessChinese(uI)
        return uO

The Additional parameters field would be enabled and you could enter the fixed string, Chinese, in order to trigger the script properly.

Note

If you have more than one additional parameter, the static strings should be separated by a semicolon (i.e. ";").

  • 4. This area is for indicating how your expression expects and returns data. If your data is Unicode-encoded, be sure to select that option or your data may be incorrectly converted. For Non-Unicode (byte) data, the default system code page will be used to convert your data when necessary.
  • After configuring these four items, click  Apply  to accept the Python script function.

The Setup tab also has the following options:

  • 5. As you iterate through the functions listed in the combo box in, the Function Prototype window show what the prototype of selected Python function looks like.

Note

If a particular function allows additional (static) parameters, then the proper order of parameters will also be shown in this window.

  • 6. If you want the Python function to be permanently added to the System Repository, then you must click  Save in System Repository .
  • 7. Click the Test Area tab to test the converter with some sample data.

AdaptIt Knowledge Base Lookup Converter

The AdaptIt Knowledge Base Lookup transduction engine is contained in the EncConverters assembly itself and therefore is always available without requiring an installer selection.

This transducer allows you to do lookups on words in either the adaptation or glossing Knowledge Base of an AdaptIt project.

To add an AdaptIt Knowledge Base Lookup converter to the system repository:

The AdaptIt Knowledge Base Converter Setup dialog will be displayed:

Figure 25. AdaptIt Knowledge Base Lookup Converter Setup



  • 1. Choose which version of AdaptIt (i.e. Non-Roman/Unicode vs. Legacy/Ansi) that you use to create the (XML) Knowledge Base.
  • 2. Choose the project desired in the Projects list box. This is automatically populated from the projects available on the local machine for the current user.

Note

For an AdaptIt Transliteration Project, the transliteration data will be in the normal project Knowledge Base file; not the Glossing Knowledge Base. However, it is possible to access a Glossing Knowledge Base if desired.

Note that if you access an Adaptation Knowledge Base (i.e. from an AdaptIt project used to adapt texts from one language to another—which most likely will contain ambiguities), then the converter will return a string containing all the ambiguities for the given lookup word in the Ample ambiguity format (i.e. %count%form1%form2%...%).

For example, if your AdaptIt Knowledge Base has an ambiguity for the word /से/ 'from, with' in the Source language, which sometimes means /ते/ 'from' and sometimes /कन्‍ने/ 'with' in the target language, then if you attempt to process the word /से/ with this converter, it will return the string /%2%ते%कन्‍ने%/.

If you have such values in a document readable by Microsoft Word, then you can use the Word Pick Document Template to simplify disambiguating these tokens.

  • 3. If you want the converter to be permanently added to the System Repository, then you must click  Save in System Repository .
  • 4. Click Test Area tab to test the converter with some sample data.
Compound “daisy-chained” converters

Two final converter types to be discussed are actually meta converter types; that is, they allow you to combine two or more existing converters in the system repository in a serial or parallel fashion.

The Compound Converter type can be used to combine 2 or more converters together in a serial fashion so that the output of one step will become the input to the next step automatically. This can be helpful when you have multiple, different conversions to apply to your data to get it in the ultimate form you need without requiring separate conversions.

For example, you may have one converter that goes from FindPhone IPA to SIL-IPA93, and another converter that converts from SIL-IPA93 to Unicode IPA. In order to perform the end-to-end conversion from FindPhone IPA to Unicode IPA, you can create a daisy-chain of the two existing converters (a “virtual converter”) so that the data is converted in one step.

Note

When creating or using a compound converter, then all n+1 converters must be in the system repository (i.e. the n steps plus the compound converter itself). If you create a compound converter and then subsequently delete one of the steps, it will not work.

To add a Compound converter to the system repository:

The Compound (daisy-chained) Converter Setup dialog will be displayed:

Figure 26. Compound Converter Setup



  • 1. Use this combo box to choose the converter to become one of the steps of the compound conversion.
  • 2. If the selected converter is bi-directional, then the Reverse box will be enabled allowing you to select the reverse direction, if needed. For example, to go from Devanagari to Arabic, note that the above configured converter goes forward (by default) from Devanagari to Latin. This is followed by a second step that goes in the reverse direction from Latin to Arabic.
  • 3. If you need to explicitly normalize the output data of any step (i.e. to Fully Composed or Full Decomposed), you can check the Normalization box and the compound converter will do this before continuing with the next step.
  • 4. Click  Add Step  to add it to the queue of steps in the compound converter.
  • 5. This area shows what steps are configured, the direction of conversion, and whether any normalization is requested.
  • 6. If you make a mistake in the steps, click  Remove Steps  to clear out the compound converter and start over.

Note

Compound converters may not be temporary converters.

    • Once you click  Apply , the Enter Converter Name dialog box will appear:

Note

The converter friendly name you enter here is for the Compound converter itself, which is distinct from the names of the converter steps. For Compound converters, the default name will a concatenation of the individual steps’ names. However, you can change it to something more meaningful if desired (e.g. “Devanagari to Arabic”).

  • 7. Once you complete the previous step, the converter name will be displayed in the Compound converter name box.
  • 8. Click the Test Area tab to test the converter with some sample data.
Primary-Fallback Compound Converter

The Primary-Fallback Compound Converter type allows you to specify two existing converters: one to be a primary, and the other, a fallback converter.

The configured primary converter is first called to do a conversion. If the primary converter doesn’t change the data, then, and only then, the fallback converter is called.

This can be useful for transliteration where a character-based transliterator (e.g. TECkit, ICU, or CC) does most of the work, but certain words (or character sequences) are otherwise unpredictable from the context. In this case, you might want a lexicon-based approach to supply the special case forms.

In this scenario, you would configure the lexicon-based transliterator (e.g. a SpellFixer CC table or an AdaptIt Knowledge Base Lookup converter) to be the primary converter and the character-based transliterator as the fallback converter. If the text isn’t modified by the primary converter (i.e. if it isn’t an exception), then the fallback converter is called to do the conversion.

To add a Primary-Fallback converter to the system repository:

The Primary-Fallback Converter Setup dialog will be displayed:

Figure 27. Primary-Fallback Converter Setup



  • 1. Use this combo box to choose the existing EncConverter which is to be the Primary converter for this conversion (e.g. the lexical lookup converter).
  • 2. If the selected Primary converter is bi-directional, then the Reverse box will be enabled allowing you to select the reverse direction, if needed.
  • 3. Use this combo box to choose the existing EncConverter which is to be the Fallback converter for this conversion (e.g. the character-based transliterator).
  • 4. If the selected Fallback converter is bi-directional, then the Reverse box will be enabled allowing you to select the reverse direction, if needed.

Note

Primary-Fallback converters may not be temporary converters.

    • Once you click  Apply , the Enter Converter Name dialog box will appear:

Figure 28. Enter Converter Name dialog



Note

The converter friendly name you enter here is for the combined Primary-Fallback converter itself, which is distinct from the names of the Primary and Fallback converters.

  • 5. Once you complete the previous step, the converter name will be 6. displayed in the Compound converter name box.
  • 6. Click the Test Area tab to test the converter with some sample data.

Saving the converter in the System Repository

On any of the Transduction Engine Setup dialogs, by default, if you click  Apply  or  OK , the configured converter will be returned to the client application as a temporary converter; once the client application (e.g. FieldWorks or Word) is closed or releases the converter, it will no longer be available. If you want the converter to be permanently available to client applications, then you must explicitly add it to the System Repository using the  Save in System Repository  button (or the  Update in System Repository  button when editing a map).

  • When you click  Save in System Repository , the following dialog box will be displayed to query for a friendly name by which the converter will be known in client applications:

Figure 29. Enter Converter Name dialog box



  • Click  Advanced...  to enter further, optional information about this converter which is also put into the System Repository Result: It is also put into the System Repository and can be used by various client applications and will bring up Advanced EncConverter Configuration.

Figure 30. Advanced Configuration dialog box



Though these values are not necessary for the operation of the converter, they can be helpful to various client applications. For example, the Clipboard EncConverter can be configured to filter the list of displayed converters based on the Encoding Name and/or the Transduction Type configured here.

Known Issues

SIL Converters has the option of displaying all fonts in a Word document. However, it will only show you the fonts that are installed on your computer. It does not warn you of any fonts that have been used in the document but are not currently installed on your computer. You can find out what fonts the Word document needs that are not on your
computer by looking at the Compatibility tab of Tools / Options, click on  Font Substitution... .


© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.