You are here: Encoding > Conversion > Utilities
Short URL: https://scripts.sil.org/SILConverters25_doc
SIL Converters 2.5 documentation
Kent Spielmann, 2007-02-21
Contents of this page
SIL Encoding Converters 2.5 Setup
Besides being available from the SIL Software CD-ROMs, starting with SIL Converters version 2.5, there is a new web-based Master Installer that allows you to choose the features you want and then install them from the internet. The Master Installer is currently available from this link:
http://downloads.sil.org/EncodingConverters/setup.exe. The Master Installer is a small program that runs on your machine that runs a series of installers:
- Software prerequisites—Necessary system updates and add-ons are installed on your computer.
- SIL Encoding Converters2.5 Setup—Conversion applications are installed and conversion Maps and Tables are copied to your hard drive.
- SIL Converters for Office 2003—A misnamed installer that installs one additional operating system update.
- Converter Installer—A utility that allows you to activate the conversion Maps and Tables you want to use.
For further instructions on running the Master Installer and for initial running of SIL Encoding Converters2.5 Setup of refer to the installation documentation SIL Converters 2.5 Installation.
The following instructions assume you have already installed SIL Encoding Converters2.5 and want to understand and use additional features of the program.
Initial dialog windows
Warning
If you run the .msi installer without the Master Installer, you will be shown the following warning:
- If you have already run the master installer program and are sure all the prerequisites are installed you may "accept the risk" and click Next
Note
If after installing this way you have some functionality that isn’t working, try re-running the installation with the Master Installer first in order to insure that your system has the necessary prerequisites.
Application maintenance
Once Encoding converters is installed, the set up program displays an application maintenance screen that gives you the option to Modify , Repair or Remove your installation.
- To modify your installation simply click Next .
Select Features
The following section briefly describes most of the boxes in Figure 1 and links you to further information about how use the different utilities and applications for different text transduction applications. The information in this table is organized around the different components available in the SIL Converters Installer.
Feature overview
As you can see from Figure 3, there are four main categories of features that you can choose from when installing SIL Converters:
SIL Converters’ client application
This feature node contains some of the programs at the top layer of Figure 1, which are generally of the most interest to end users. These programs and utilities allow you to convert text data (e.g. Word documents, SFM documents, XML Documents, data on the system clipboard) using the text processing capabilities provided by the various transduction engines at the bottom of Figure 1.
Transduction Engines
This feature node contains the different transduction engine components that provide text processing capabilities at the lowest layer of Figure 1.
Most users should accept the defaults for this feature to insure that the proper transduction engines are installed. Otherwise, you must make sure you have the required transduction engines installed for the different text processing tasks you want to do.
Examples:
- If you intend to do encoding conversions, you probably need to install the TECkit and/or Consistent Changes (CC) transduction engines.
- If you want to use an ICU transliterator, you need to install the IBM Components for Unicode transduction engine.
- If you want to write Perl expressions or call Python script functions for text processing, you need to install one or both of those transduction engines (both of which require separately available program distributions—see below)
Maps and Tables
This feature node contains several groupings of instances of conversion maps and tables (e.g. for TECkit and/or CC) which provide the input to the transduction engines (e.g. the SIL IPA93<>UNICODE map)
A few of the subfeature items are useful for all users, such as the Basic Converters and ICU Transliterators sets. Otherwise, you can only install those converter sets you expect to need (e.g. based on your entity).
If you would like to add a package of converters to the SIL Converters’ installer, contact .
Additional TECkit applications
Since the SILConverters installer installs TECkit (a subfeature of the Transduction Engine feature node discussed above), this feature node adds the rest of the content of the TECkit download from the TECkit site (i.e. the documentation and other TECkit client applications). A new TECkit map Unicode Editor assists in the creation of TECkit maps available from this feature node.
The following sections describe the sub-features available in each of these four nodes.
SIL Converters’ client applications
This installer installs the following of SIL Converters client applications directly (see Figure 1).
The FieldWorks and AdaptIt client applications have separate install programs.
Bulk SFM Converter
Use this application to convert the data in Standard Format Marker (SFM) fields using converters from the EncConverters’ repository and to convert the encoding of data in Shoebox, Toolbox, and Paratext (SFM) documents. You can also open multiple SFM documents for processing at the same time.
To use
- Click .
- For context sensitive help, click ? (top right program window) and then click the table in the main window.
Clipboard Encoding Converter
After starting up the Clipboard EncConverter, click the
icon on the Windows Task Bar to convert text copied to the Windows clipboard.
To start it up
Clipboard Encoding Converter is a untilty that you access from the Windows Task Bar. To use it you need to first start it up.
- To start it only when you need it: Click .
- To start it every time you start Windows: Add a shortcut to Clipboard EncConverter to your .
To use
Clipboard Encoding Converter can be used in two ways:
- Clipboard mode: To convert text that you've copied using Copy or Ctrl + C (in most applications), which you can then paste using Paste or Ctrl + V .
- SpellFixer mode: To add spelling corrections to a SpellFixer project from applications other than just Word.
To convert text using the Clipboard Mode
- Copy some text by selecting it and pressing Ctrl + C
- Right-click the Clipboard EncConverter Icon
in the Windows task bar (System Tray) (see red arrow in Figure 6).
Result: the available conversions will be listed in the top section of the popup window. If is checked, a sample of your copied text will appear as it will be converted to the right of each conversion name.
- In the popup menu, select the conversion you want.
- Paste your converted text where you want it.
SpellFixer Mode
- Left-click the system tray icon, to turn on .
- Tip: This allows you to add spelling corrections from any arbitrary application (instead of just from within Word, while using the SpellFixer document template).
- Select the misspelled word in your application and copy it to the clipboard ( Ctrl + C ).
- Left-click the Clipboard EncConverter icon in the system tray, select the project to enable, and type the replacement spelling for the word on the clipboard.
Result: The word will be corrected on the clipboard and the correction you entered will be added to the SpellFixer database for the selected project.
- Paste the correct word into your application.
Tip: For more details, see the SpellFixer.dot Word document template.
- In Word open a New document using SpellFixer.dot Word document template.
XML Data Converter
Use this application to convert the data (attributes or elements) in an XML document using converters from the EncConverters’ repository for example,
- to convert the data in an AdaptIt Knowledge Base from a legacy encoding to Unicode, or
- to convert the text in a Word document saved as XML. (This requires some knowledge of WordML and XPath syntax.)
To use
- Click .
- For context sensitive help, click ? (top right of program window) and then click the different sections of the main window.
MS Word converters
You can use SILConverters directly in MS Word. The converters are macros contained in three Microsoft Word document templates (DOTs). These macros use the EncConverters repository to accomplish different tasks.
- Data Conversion Macro in Data Conversion Macro xxxx.dot
- SpellingFixer in SpellFixer.dot
- Consistency Spelling Checker in Consistent Spelling Checker xxxsc.dot
If you select the WordDOTs feature node, the SILConverters’ installer will put these templates into your Templates folder (normally C:Documents and Settings<user>Application DataMicrosoftTemplates).
To use
- To access the document template clients from within Microsoft Word, click .
Result: The Templates and Add-Ins dialog box will displayed.
- In the Templates and Add-Ins dialog box, click Add .
Result: Your Templates folder will be opened.
- Select the Word .dot file you want to use.
Note
If multiple users on the machine want to use the document template, you need to manually move the .DOT files to some common location and each user will need to browse for them individually in . If you want one or more of these document templates to start up automatically when Word starts, move them either to the current user’s Startup folder (i.e. C:Documents and Settings<user>Application DataMicrosoftWordSTARTUP). For all users, put it in the global startup folder (e.g. C:Program FilesMicrosoft OfficeOFFICE11STARTUP).
Data Conversion Macro
Use the Data Conversion Macro to convert text in any arbitrary Word document based on Font name, Style, or even the current selection using converters from the EncConverters’ repository. It also supports SFM documents. Open the document template for more instructions.
SpellingFixer
Use the SpellingFixer document template to correct misspelled words or make certain orthographic changes based on a user-defined database of bad-good spelling pairs. This is particular useful when you want
- to condition spelling changes to be at either a beginning or ending word boundary, or
- to convert a portion of a word (as opposed to full word forms).
Once you have a database of such spelling fixes (or consistent changes), use one of the menu commands to go through all the words in the document to search for misspelled words. See the document template for instructions.
Consistency Spelling Checker
Use the Consistency Spelling Checker document template for a simple way of working with data (in any language, and any script) in Microsoft Word documents, Plain Text files or any Toolbox database to:
- Check consistency of spelling (semi-automatically) based on linguistic principles
- Apply global spelling changes:
- to multiple documents which are currently open
- by generating a CC table of changes to be applied to one or more plain text databases (such as Toolbox files)
- Create a character inventory with frequency count
- Create unique wordlists from one or more Word documents as:
- a Word document table with frequency counts, or
- a Toolbox (MDF-formatted) database for starting a lexicon
Note
This tool is not a full-fledged spelling checking tool. It does not use language-specific dictionaries, and therefore knows nothing about the languages it checks. It is only a consistency checking tool based on phonological similarity, or sets of user-defined ambiguous characters.
Prerequisites
The Spelling Consistency Checker macro requires that you install this software:
SIL Converters’ Transduction Engines
Several of the transduction engines in Figure 1 are provided by the EncConverters’ repository object itself (i.e. the code page converter, the AdaptIt Knowledge Base Lookup, and the Compound and Primary-Fallback meta converters) and are always available. The rest depend on external programs (SIL and other Open Source programs) and installation is optional, depending on your need.
Most end users will not need to concern themselves with these details except to be sure that the necessary transduction engine is installed for the converters they want to use. Chances are that someone in your entity has already created a map file that you can use to convert the encoding of your data. In this situation, you need to be sure that you install the proper transduction engine required by the map or table that implements the conversion you want.
TECkit
Other applications use TECkit, a low-level toolkit, to perform encoding conversions (e.g., when importing legacy data into a Unicode-based application). The primary component of the TECkit package is a library that performs conversions. This is the “TECkit engine”. The engine relies on mapping tables in a specific binary format (see TECkit documentation). A compiler creates such tables from a human-readable mapping description (a simple text file).
In EncConverters, you can select either the compiled *.tec file or the uncompiled, human-readable *.map) to be the converter. If you choose the latter, EncConverters will automatically compile an out-of-date .tec file when it is used to convert data.
See Adding converters: TECkit map below for details about adding TECkit maps to the system repository.
Consistent Changes (CC)
Use Consistent Change tables to find all occurrences of specified characters, words, or phrases in a string of text, and then change them in a consistent way. The change may be done in every occurrence or only when certain conditions are met.
CC is like the find-and-replace feature in a text editor, but much more powerful. It allows you
- to make changes which take context into consideration, and
- to make a whole set of changes at once.
SpellFixer is also available. This is a user-friendly graphical user interface for creating consistent change tables. This interface is primarily available via the SpellFixer.dot Microsoft Word document template mentioned above in SILConverters’ client applications.
See Adding converters: Consistent Changes (CC) below for details about adding CC tables to the system repository.
International Components for Unicode (ICU) 3.4
Three distinct EncConverters-related features as well as other features of ICU used by other client are applications that must be installed as a unit.
For SILConverters, three transduction engines are included in this feature:
- ICU Transliterators: provides a series of transliterators for various ranges of Unicode (c.f. Devanagari to Latin) as well as the ability to write custom rules for doing transliteration. See
http://icu.sourceforge.net/userguide/Transform.html for more details on the use and syntax of ICU Transliterators.
- ICU Converters: provides comprehensive character set conversion services, mapping tables, and implementations for many encodings. Since ICU uses Unicode (UTF-16) internally, all converters convert between UTF-16 (with the endianness according to the current platform) and another encoding. This converter includes other Unicode encodings. These are typically of more interest to programmers than end-users. See
http://icu.sourceforge.net/userguide/conversion.html for more details on ICU converters.
- Regular Expression: provides applications with the ability to apply regular expression matching to Unicode string data. The regular expression patterns and behavior are based on Perl's regular expressions. See
http://icu.sourceforge.net/userguide/regexp.html for more details on the syntax of ICU Regular Expressions.
Perl Expressions 5.8.7
The Perl Expressions 5.8.7 transduction engine allows you to write Perl expressions to do text processing in EncConverter client applications.
Note
This feature requires installation of a separate Perl 5.8.7 distribution to be installed.
The Perl plug-in has been tested with the following freely available Perl distributions:
http://www.activestate.com/solutions/perl/ ActiveState Perl or
PXPerl
See below for a known issue with the PXPerl distribution.
Also note that this plug-in will not work (yet) with the most recent v5.8.8 distribution.
Python Script Functions 2.4
The Python Script Functions 2.4 transduction engine allows you to do text processing using Python functions in EncConverter client applications.
Note
This feature requires a separate Python 2.4 distribution to be installed.
The Python plug-in has been tested with the following freely available Python distributions:
ActiveState Python or
Python.org.
Note
This plug-in will not work (yet) with the most recent v2.5 distribution.
SIL Converters’ Maps and Tables
Most end-users are interested only in a small number of encodings. Typically, computer support people have created TECkit maps and/or CC tables for the various encodings used in each entity, alleviating most end-users from having to create their own maps and tables.
Because there are hundreds of possible encoding converters and transliterators that different end-users may be interested in, they are packaged into logically-related groups of converters and are available via a two-step process.
Steps
- Use the SILConverters’ installer to install the package(s) of converter likely to be useful to you (e.g. based on your entity).
Result: During installation, all the converter maps/tables in the selected package(s) will be installed into a fixed location on your computer (i.e. C:Documents and SettingsAll UsersApplication DataSILSILConverters22MapsTables).
- Use the Converter Installer application to install the few applications you want into the EncConverters’ repository.
Result: They become available to EncConverters’ client applications.
Note
Installing maps and tables onto your computer with the SILConverters’ installer (step 1 above) will not make them available to EncConverters’ client applications unless you explicitly add them to the EncConverters’ repository using the Converter Installer or some other mechanism (see Adding converters).
The following sections gives the details about fonts and encodings for different maps and tables:
Basic Converters
Converters and Transliterators common to all SIL. This includes the following:
SIL IPA93<>UNICODE |
SIL-IPA93-2001 |
SILDoulos IPA93 |
|
|
SILManuscript IPA93 |
|
|
SILSophia IPA93 |
SIL-IPA-1990<>UNICODE |
SIL-IPA-1990 |
SILDoulosIPA |
|
|
SILManuscriptIPA |
|
|
SILSophiaIPA |
SIL Galatia <>UNICODE |
SIL-GREEK_GALATIA-2001 |
SIL Galatia |
ISO-8859<>UNICODE |
ISO-8859-1 |
|
AMER PHON>UNICODE |
(SIL)-Amer_Phon_SILDoulosL3-(2005) |
|
SIL PUA 3.2<>UNICODE 4.1 |
|
|
SIL PUA 3.2<>UNICODE 5.0 |
|
|
Symbol<>cp1252 |
|
|
UTF8<>UTF16 |
|
|
ReverseString |
For reversing the bytes of a “narrow” (bytes) string |
|
null |
No change to string, but can be used to apply a different font to some text (e.g. in the Data Conversion Macro) |
|
NFC |
Convert to normal form composed |
|
NFD |
Convert to normal form decomposed |
|
ICU Transliterators
Configuration information for the following ICU transliterators are for Unicode-encodings only.
Note
These are not the only transliterators available via the ICU Transliterator transduction engine, but are only a few of the pre-defined latinizing (or romanizing) transliterators that can be useful in different client applications for different ranges of Unicode.
- Devanagari to Latin (aka. Devanagari-Latin)
- Bengali to Latin (aka. Bengali-Latin)
- Gujarati to Latin (aka. Gujarati-Latin)
- Gurmukhi to Latin (aka. Gurmukhi-Latin)
- Kannada to Latin (aka. Kannada-Latin)
- Malayalam to Latin (aka. Malayalam-Latin)
- Oriya to Latin (aka. Oriya-Latin)
- Tamil to Latin (aka. Tamil-Latin)
- Telugu to Latin (aka. Telugu-Latin)
- Arabic to Latin (aka. Arabic-Latin)
- Cyrillic to Latin (aka. Cyrillic-Latin)
- Greek to Latin (aka. Greek-Latin)
- Han to Latin (aka. Han-Latin)
- Hangul to Latin (aka. Hangul-Latin)
- Hebrew to Latin (aka. Hebrew-Latin)
- Hiragana to Latin (aka. Hiragana-Latin)
- Katakana to Latin (aka. Katakana-Latin)
- Jamo to Latin (aka. Jamo-Latin)
- NumericPinyin to Latin (aka. NumericPinyin-Latin)
- Any to Latin (aka. Any-Latin)
Note
These transliterators can be daisy-chained together to transliterate between non-Latin scripts using a Compound meta-converter. For example, chaining the ‘Devanagari-Latin’ transliterator (in the Forward direction) with the ‘Arabic-Latin’ transliterator (in the Reverse direction) gives a ‘Devanagari-Arabic’ transliterator.
FindPhone to IPA converters
Adds the following converters for dealing with FindPhone encoded data:
- FindPhone>SAG IPA93
- FindPhone>UNICODE
SAG Indic
Contains encoding converter map(s) for the following encoding/font:
Annapurna<>UNICODE |
SIL-ANNAPURNA_05-2002 |
Annapurna |
- Includes an IPA transliterator for Unicode Devanagari to Unicode IPA (phonetic)
Cameroon
Contains encoding converter map(s) for the following encoding/fonts:
Cameroon <>UNICODE |
Cameroon |
Cam Cam SILDoulosL |
|
|
Cam Cam SILSophiaL |
|
|
Cam Cam SILManuscriptL |
|
|
Cam2 Cam2 SILDoulos |
|
|
Cam2 Cam2 SILSophia |
|
|
Cam2 Cam2 SILManuscript |
|
|
Cam Paratext SILDoulos |
|
|
Cam Paratext SILSophia |
|
|
Cam Paratext SILManuscript |
West Africa
Contains encoding converter map(s) for the following encoding/fonts:
SIL-93linb-2005<>UNICODE |
SIL-93linb-2005 |
UBS-Abidjan-2005<>UNICODE |
UBS-Abidjan-2005 |
Bambara SIL Charis<>UNICODE |
Bambara SIL Charis |
SIL-BF Font Family-2005<>UNICODE |
SIL-BF_Font_Family-2005 |
SIL-BF_Times-2006<>UNICODE |
SIL-BF_Times-2006 |
X-SIL-Fulfulde<>UNICODE |
X-SIL-Fulfulde |
SIL-Ghana Doulos-2005<>UNICODE |
SIL-Ghana_Doulos-2005 |
SIL-Mali Standard Font Family<>UNICODE |
Mali Standard SILDoulos-2005 |
RCI Standard Doulos/Sophia/Manuscript<>UNICODE |
SIL-RCI Standard-1994 |
X-SIL-Senufo<>UNICODE |
X-SIL-Senufo |
SIL-Karaboro-2006<>UNICODE |
SIL-Karaboro-2006 |
SIL Samogho Doulos/Sophia/Manuscript<>UNICODE |
SIL-Samogho-2006 |
SIL-Songhai-2006<>UNICODE |
SIL-Songhai-2006 |
Tombouctou-Dutch<>UNICODE |
SIL-Tombouctou-Dutch-2006 |
Burkina Faso Winye-2003<>UNICODE |
SIL-Burkina_Winye_Unknown_Font-2005 |
Eastern Congo Group
Contains encoding converter map(s) for the following encoding/fonts:
Times African<>UNICODE |
Times African |
NdrunaASCII<>UNICODE |
NdrunaASCII |
Mayogo<>UNICODE |
Mayogo |
Komo<>UNICODE |
Komo |
KomoASCII to Unicode |
KomoASCII |
ECG<>UNICODE |
ECG-Unicode(Jan.2005) |
BuduASCII<>UNICODE |
BuduASCII |
BUDU<>UNICODE |
BUDU |
BheleASCII<>UNICODE |
BheleASCII |
Bantu Und<>UNICODE |
Bantu Und |
NLCI (India)
Contains encoding converter map(s) for the following encoding/font:
SL Oriya<>UNICODE |
NLCI-SLOriya |
|
Winscript/iLeap Devanagari<>UNICODE |
CDAC-ISFOC_DEVANAGARI |
DEV Panini |
Winscript/iLeap Gujarati<>UNICODE |
CDAC-ISFOC_GUJARATI |
GUJ Gir |
Winscript Malayalam<>UNICODE |
NLCI-Malayalam |
MAL Vayalar |
Winscript Oriya<>UNICODE |
NLCI-Oriya |
ORI Asika |
Winscript Tamil<>UNICODE |
NLCI-Tamil |
TAM Thiruvalluvar |
Winscript Telugu<>UNICODE |
NLCI-Telugu |
TEL Nirmal |
Additional TECkit applications
TECkit Map Unicode Editor
The TECkit map Unicode Editor is one more EncConverters’ client application mentioned in Figure 1. Use this program to develop TECkit maps for encoding conversion or other text processing applications (e.g. Transliteration).
Steps
- Install this application by selecting the feature under the Additional TECkit applications feature node.
- Start the program by clicking .
- Type (or paste) text into the Sample boxes in the lower left window.
Result: The code point values and/or names of the characters in the string will be displayed in the table in the upper right.
- Click the cells in that table to insert those values into the map (in the upper left).
- To save your map to the default system repository, click , navigate to C:Program FilesCommon FilesSILMapsTables, and click Save .
Result: this will bring up the TECkit map transduction engine dialog box.
Tips
- If you are creating a map involving a Legacy encoding, the font glyphs for that font will be shown in the table in the lower right. Click to insert their values into the map or Ctrl + Click to insert them into the Sample boxes.
- The map is automatically compiled as you make changes and you can click errors in the compiler results window (Figure 13, extreme top left) to jump to problem statements.
- The program will automatically convert the data in the Left-side Sample box (in the forward direction) in order to check conversion as you work on it. It will also convert the Right-side Sample (in the reverse direction) in order to check the round-trip capability of the map.
- Context-sensitive help is available—select a portion of the window and press F1 or click ? (right-hand window).
Final dialog windows
- When you have finished making selections click Next again to begin installation and/or uninstalling.
- When you see the SIL Encoding Converters 2.5 has been successfully installed message, click Finish .
- Add any new converters to the Encoding converters repository:
- If you run the SIL Encoding Converters 2.5 Setup .msi file directly the program will close and you will need to add any new converters to the Encoding Converters system repository. See Adding converters to the system repository
- If you run SIL Encoding Converters 2.5 Setup from the master installer, the Master Installer sequence will continue. If this is not an initial installation you may cancel the SIL Converters for Office 2003 Setup (see the installation documentation) and proceed to the Converter Installer to add any new converters to the Encoding converters repository.
Adding Converters to the System Repository
There are two primary ways of adding converters to the System Repository, by using either the
- Converter Installer or
- Transduction Engine configuration dialogs.
Converter Installer
If the converter you want to install into the system repository comes as part of the Maps and Tables features in the SILConverters installer (e.g. the SIL IPA93<>UNICODE converter that comes as part of the Basic Converters package), you can install it into the system repository by running the Converter Installer application.
How to get there
- When running the Master Installer, this utility automatically launches as the last item in the installer sequence.
- From the Windows taskbar, click .
- From the Clipboard EncConverter popup by selecting .
Installing converters
- To install one or more of these converters into the system repository, check the box next to the converters you want and click Commit .
- To remove converters from the system repository, clear the check box and click Commit .
For detailed instructions see the Converter Installer section in the installation documentation
Choose a Transduction Engine dialog box
If you have your own converter map (e.g. created with the TECkit map Unicode Editor) or one given to you not as part of an installer feature, you can add it to the system repository via the dialog box.
- Select the item from the list that matches the type of converter you want to add and click Create .
How to get there
- If you are running the Clipboard EncConverter application, right-click the icon in the system tray and choose the item.
- If you are using AdaptIt, Data Conversion Macro, Bulk SFM Converter, XML Data Converter, or SpellFixer; open the dialog window (Figure 16) and click Add New .
Transduction Engine Details
TECkit map
- To add a TECkit map to the system repository, select TECkit map from the Transduction Engine dialog box, and click Create . Result: The TECkit map Setup dialog will be displayed.
- Browse with the ... button for the TECkit .map or .tec file.
- To permanently add the converter to the System Repository, click Save in System Repository .
- Click to test the converter with some sample data.
Consistent Changes (CC)
Result: The CC table Setup dialog will be displayed:
- Browse with the ... button for the CC table file.
- For and , select the desired encoding and click Apply .
Tip:
If it expects Unicode-encoded data, select that option or your data may be incorrectly converted. For Non-Unicode (byte) data, the default system code page will be used to convert your data when necessary.
- 3. If you installed the SpellFixer plug-in, click .
Result: This will allow you to create or edit an existing SpellFixer project.
Tip:
Though primarily a Microsoft Word-based tool, you can use The SpellFixer application to create a CC table. Use the SpellFixer graphical user-interface to configure Bad Spelling and Good Spelling couplets, which then are put into a CC table. The Microsoft Word document template also has macros for processing the text in a file in a word-by-word manner so you can use it in a Find First/Next fashion to correct spelling errors. The SpellFixer.dot file has further usage information.
- 4. To permanently add the converter to the System Repository, click Save in System Repository .
Note
You do not need to add a SpellFixer project to the System Repository, since it will be added automatically by the Project editor.
- 5. Click to test the converter with some sample data.
ICU Transliterator
Result: The ICU Transliterator Setup dialog will be displayed:
- To use one of the built-in transliterators, choose .
- To write a custom transliterator using the syntax described on the webpage referred to above, choose and enter the transliterator syntax in the box. Result: will be enabled, showing examples of useful custom rules and others that you wrote (Figure 19, number 3).
- Click Delete to remove unwanted rules.
- To permanently add the converter to the System Repository, click Save in System Repository .
- Click to test the converter with some sample data.
ICU Converters
The ICU Converter Setup dialog will be displayed:
- Choose the desired converter from the drop-down combo box as shown above.
- If you want the converter to be permanently added to the System Repository, then you must click Save in System Repository .
- You can click the tab to test the converter with some sample data.
Regular Expression Find and Replace (ICU)
ICU's Regular Expressions package provides applications with the ability to apply regular expression matching to Unicode string data. The regular expression patterns and behavior are based on Perl's regular expressions. See
http://icu.sourceforge.net/userguide/regexp.html for more details on the syntax of ICU Regular Expressions.
- If you want to add an ICU Regular Express Find and Replace converter to the system repository, choose from the Transduction Engine dialog box and click Create .
The Regular Expression Find and Replace (ICU) Setup dialog will be displayed:
- In the box, enter the Regular Expression search string you want to use. Tip: The search string can contain ICU Regular Expression Meta Characters and Operators defined below.
- Click the right wedge.
Result: You will see a pop-up list of commonly used Regular Expression search operators. If selected, they will be inserted into Search for.
- 3. In the box, enter the string or operator that represents the text to replace the Search for string (see Replacement Text defined below).
- 4. Check the box to have the ICU search algorithm ignore the case of the input text.
- 5. The combo box includes a few example Regular Expressions and remembers any new ones you add.
Click Delete to remove the selected search item.
- 6. If you want the converter to be permanently added to the System Repository, then you must click Save in System Repository .
- 7. Click the tab to test the converter with some sample data.
Regular Expression Metacharacters
a |
Match a BELL, u0007 |
A |
Match at the beginning of the input. Differs from ^ in that A will not match after a new line within the input. |
b, outside of a [Set] |
Match if the current position is a word boundary. Boundaries occur at the transitions between word (w) and non-word (W) characters, with combining marks ignored. For better word boundaries, see ICU Boundary Analysis. |
b, within a [Set] |
Match a BACKSPACE, u0008. |
B |
Match if the current position is not a word boundary. |
cX |
Match a control-X character. |
d |
Match any character with the Unicode General Category of Nd (Number, Decimal Digit.) |
D |
Match any character that is not a decimal digit. |
e |
Match an ESCAPE, u001B. |
E |
Terminates a Q ... E quoted sequence. |
f |
Match a FORM FEED, u000C. |
G |
Match if the current position is at the end of the previous match. |
n |
Match a LINE FEED, u000A. |
N{UNICODE CHARACTER NAME} |
Match the named character. |
p{UNICODE PROPERTY NAME} |
Match any character with the specified Unicode Property. |
P{UNICODE PROPERTY NAME} |
Match any character not having the specified Unicode Property. |
Q |
Quotes all following characters until E. |
r |
Match a CARRIAGE RETURN, u000D. |
s |
Match a white space character. White space is defined as [tnfrp{Z}]. |
S |
Match a non-white space character. |
t |
Match a HORIZONTAL TABULATION, u0009. |
uhhhh |
Match the character with the hex value hhhh. |
Uhhhhhhhh |
Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is U0010ffff. |
w |
Match a word character. Word characters are [p{Ll}p{Lu}p{Lt}p{Lo}p{Nd}]. |
W |
Match a non-word character. |
x{hhhh} |
Match the character with hex value hhhh. From one to six hex digits may be supplied. |
xhh |
Match the character with two digit hex value hh |
X |
Match a Grapheme Cluster. |
Z |
Match if the current position is at the end of input, but before the final line terminator, if one exists. |
z |
Match if the current position is at the end of input. |
n |
Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern. Note: Octal escapes, such as 012, are not supported in ICU regular expressions |
[pattern] |
Match any one character from the set. See UnicodeSet for a full description of what may appear in the pattern |
. |
Match any character. |
^ |
Match at the beginning of a line. |
$ |
Match at the end of a line. |
|
Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | . / |
Regular Expression Operators
| |
Alternation. A|B matches either A or B. |
* |
Match 0 or more times. Match as many times as possible. |
+ |
Match 1 or more times. Match as many times as possible. |
? |
Match zero or one times. Prefer one. |
{n} |
Match exactly n times |
{n,} |
Match at least n times. Match as many times as possible. |
{n,m} |
Match between n and m times. Match as many times as possible, but not more than m. |
*? |
Match 0 or more times. Match as few times as possible. |
+? |
Match 1 or more times. Match as few times as possible. |
?? |
Match zero or one times. Prefer zero. |
{n}? |
Match exactly n times |
{n,}? |
Match at least n times, but no more than required for an overall pattern match |
{n,m}? |
Match between n and m times. Match as few times as possible, but not less than n. |
*+ |
Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match) |
++ |
Match 1 or more times. Possessive match. |
?+ |
Match zero or one times. Possessive match. |
{n}+ |
Match exactly n times |
{n,}+ |
Match at least n times. Possessive Match. |
{n,m}+ |
Match between n and m times. Possessive Match. |
( ... ) |
Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match. |
(?: ... ) |
Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses. |
(?> ... ) |
Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>" |
(?# ... ) |
Free-format comment (?# comment ). |
(?= ... ) |
Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position. |
(?! ... ) |
Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position. |
(?<= ... ) |
Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.) |
(?<! ... ) |
Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.) |
(?ismx-ismx: ... ) |
Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled. |
(?ismx-ismx) |
Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match. |
Replacement Text
The replacement text for find-and-replace operations may contain references to capture-group text from the find. References are of the form $n, where n is the number of the capture group.
$n |
The text of capture group n will be substituted for $n. n must be >= 0 and not greater than the number of capture groups. A $ not followed by a digit has no special meaning, and will appear in the substitution text as itself, a $. |
|
Treat the following character as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for '$' and '', but may be used on any other character without bad effects. |
Perl Expression
Note
This feature requires installation of a separate Perl 5.8.7 distribution to be installed.
The Perl plug-in has been tested with the following freely available Perl distributions:
http://www.activestate.com/solutions/perl/ ActiveState Perl or
PXPerl
See the Unable to add converters with PXPerl installed in the installation document in for a known issue with the PXPerl distribution.
The Perl Expression Setup dialog will be displayed:
- This area is for entering your Perl expression.
- For Perl expression expects and Perl expression returns, select the desired encoding type and click Apply .
Tip: If your data is Unicode-encoded, select that option or your data may be incorrectly converted. For Non-Unicode (byte) data, the default system code page will be used to convert your data when necessary.
- The combo box includes a few example Perl expressions and remembers any new ones you add.
Click Delete to remove unwanted expressions.
- Click Distro Config to set up the path to your Perl Distribution's library folders (e.g. C:PXPerl, C:PXPerllib, C:PXPerlsitelib) and to specify certain Perl modules to be automatically loaded for all expressions (e.g. Win32, or SIL:RTF:Unicode).
Tip: If you get an error message saying that some *.pm file can't be found, then you probably don't have the correct path to the file configured.
- To permanently add the expression to the System Repository, click Save in System Repository .
- Click the tab to test the converter with some sample data.
Python Script Function
Note
This feature requires a separate Python 2.4 distribution to be installed.
The Python plug-in has been tested with the following freely available Python distributions:
ActiveState Python or
Python.org.
If you want to add a Python script function converter to the system repository,
The Python Script Setup dialog will be displayed:
- Browse with the ... button for the Python script file.
- If the file contains valid Python functions, then the function names will be put into the combo box. Choose the function desired.
- If you want to pass some static information to your Python function, then define the function as shown below and put the static information in the field. For example, with a Python function prototype of:
def ChangeLanguage(sLang, uI):
if not isinstance(uI, unicode):
raise UnicodeError(u'Input Data not Unicode! (%s)' % uI)0
else:
if sLang == u'Chinese':
# do some Chinese processing and put result in uO
uO = ProcessChinese(uI)
return uO
The field would be enabled and you could enter the fixed string, , in order to trigger the script properly.
Note
If you have more than one additional parameter, the static strings should be separated by a semicolon (i.e. ";").
- 4. This area is for indicating how your expression expects and returns data. If your data is Unicode-encoded, be sure to select that option or your data may be incorrectly converted. For Non-Unicode (byte) data, the default system code page will be used to convert your data when necessary.
- After configuring these four items, click Apply to accept the Python script function.
The Setup tab also has the following options:
- 5. As you iterate through the functions listed in the combo box in, the Function Prototype window show what the prototype of selected Python function looks like.
Note
If a particular function allows additional (static) parameters, then the proper order of parameters will also be shown in this window.
- 6. If you want the Python function to be permanently added to the System Repository, then you must click Save in System Repository .
- 7. Click the tab to test the converter with some sample data.
AdaptIt Knowledge Base Lookup Converter
The AdaptIt Knowledge Base Lookup transduction engine is contained in the EncConverters assembly itself and therefore is always available without requiring an installer selection.
This transducer allows you to do lookups on words in either the adaptation or glossing Knowledge Base of an AdaptIt project.
To add an AdaptIt Knowledge Base Lookup converter to the system repository:
The AdaptIt Knowledge Base Converter Setup dialog will be displayed:
- 1. Choose which version of AdaptIt (i.e. vs. ) that you use to create the (XML) Knowledge Base.
- 2. Choose the project desired in the list box. This is automatically populated from the projects available on the local machine for the current user.
Note
For an AdaptIt Transliteration Project, the transliteration data will be in the normal project Knowledge Base file; not the Glossing Knowledge Base. However, it is possible to access a Glossing Knowledge Base if desired.
Note that if you access an Adaptation Knowledge Base (i.e. from an AdaptIt project used to adapt texts from one language to another—which most likely will contain ambiguities), then the converter will return a string containing all the ambiguities for the given lookup word in the Ample ambiguity format (i.e. %count%form1%form2%...%).
For example, if your AdaptIt Knowledge Base has an ambiguity for the word /से/ 'from, with' in the Source language, which sometimes means /ते/ 'from' and sometimes /कन्ने/ 'with' in the target language, then if you attempt to process the word /से/ with this converter, it will return the string /%2%ते%कन्ने%/.
If you have such values in a document readable by Microsoft Word, then you can use the Word Pick Document Template to simplify disambiguating these tokens.
- 3. If you want the converter to be permanently added to the System Repository, then you must click Save in System Repository .
- 4. Click tab to test the converter with some sample data.
Compound “daisy-chained” converters
Two final converter types to be discussed are actually meta converter types; that is, they allow you to combine two or more existing converters in the system repository in a serial or parallel fashion.
The Compound Converter type can be used to combine 2 or more converters together in a serial fashion so that the output of one step will become the input to the next step automatically. This can be helpful when you have multiple, different conversions to apply to your data to get it in the ultimate form you need without requiring separate conversions.
For example, you may have one converter that goes from FindPhone IPA to SIL-IPA93, and another converter that converts from SIL-IPA93 to Unicode IPA. In order to perform the end-to-end conversion from FindPhone IPA to Unicode IPA, you can create a daisy-chain of the two existing converters (a “virtual converter”) so that the data is converted in one step.
Note
When creating or using a compound converter, then all n+1 converters must be in the system repository (i.e. the n steps plus the compound converter itself). If you create a compound converter and then subsequently delete one of the steps, it will not work.
To add a Compound converter to the system repository:
The Compound (daisy-chained) Converter Setup dialog will be displayed:
- 1. Use this combo box to choose the converter to become one of the steps of the compound conversion.
- 2. If the selected converter is bi-directional, then the box will be enabled allowing you to select the reverse direction, if needed.
For example, to go from Devanagari to Arabic, note that the above configured converter goes forward (by default) from Devanagari to Latin. This is followed by a second step that goes in the reverse direction from Latin to Arabic.
- 3. If you need to explicitly normalize the output data of any step (i.e. to Fully Composed or Full Decomposed), you can check the box and the compound converter will do this before continuing with the next step.
- 4. Click Add Step to add it to the queue of steps in the compound converter.
- 5. This area shows what steps are configured, the direction of conversion, and whether any normalization is requested.
- 6. If you make a mistake in the steps, click Remove Steps to clear out the compound converter and start over.
Note
Compound converters may not be temporary converters.
- Once you click Apply , the dialog box will appear:
Note
The converter friendly name you enter here is for the Compound converter itself, which is distinct from the names of the converter steps. For Compound converters, the default name will a concatenation of the individual steps’ names. However, you can change it to something more meaningful if desired (e.g. “Devanagari to Arabic”).
- 7. Once you complete the previous step, the converter name will be displayed in the box.
- 8. Click the tab to test the converter with some sample data.
Primary-Fallback Compound Converter
The Primary-Fallback Compound Converter type allows you to specify two existing converters: one to be a primary, and the other, a fallback converter.
The configured primary converter is first called to do a conversion. If the primary converter doesn’t change the data, then, and only then, the fallback converter is called.
This can be useful for transliteration where a character-based transliterator (e.g. TECkit, ICU, or CC) does most of the work, but certain words (or character sequences) are otherwise unpredictable from the context. In this case, you might want a lexicon-based approach to supply the special case forms.
In this scenario, you would configure the lexicon-based transliterator (e.g. a SpellFixer CC table or an AdaptIt Knowledge Base Lookup converter) to be the primary converter and the character-based transliterator as the fallback converter. If the text isn’t modified by the primary converter (i.e. if it isn’t an exception), then the fallback converter is called to do the conversion.
To add a Primary-Fallback converter to the system repository:
The Primary-Fallback Converter Setup dialog will be displayed:
- 1. Use this combo box to choose the existing EncConverter which is to be the Primary converter for this conversion (e.g. the lexical lookup converter).
- 2. If the selected Primary converter is bi-directional, then the Reverse box will be enabled allowing you to select the reverse direction, if needed.
- 3. Use this combo box to choose the existing EncConverter which is to be the Fallback converter for this conversion (e.g. the character-based transliterator).
- 4. If the selected Fallback converter is bi-directional, then the Reverse box will be enabled allowing you to select the reverse direction, if needed.
Note
Primary-Fallback converters may not be temporary converters.
- Once you click Apply , the dialog box will appear:
Note
The converter friendly name you enter here is for the combined Primary-Fallback converter itself, which is distinct from the names of the Primary and Fallback converters.
- 5. Once you complete the previous step, the converter name will be 6. displayed in the box.
- 6. Click the tab to test the converter with some sample data.
Saving the converter in the System Repository
On any of the Transduction Engine Setup dialogs, by default, if you click Apply or OK , the configured converter will be returned to the client application as a temporary converter; once the client application (e.g. FieldWorks or Word) is closed or releases the converter, it will no longer be available. If you want the converter to be permanently available to client applications, then you must explicitly add it to the System Repository using the Save in System Repository button (or the Update in System Repository button when editing a map).
- When you click Save in System Repository , the following dialog box will be displayed to query for a friendly name by which the converter will be known in client applications:
- Click Advanced... to enter further, optional information about this converter which is also put into the System Repository
Result: It is also put into the System Repository and can be used by various client applications and will bring up Advanced EncConverter Configuration.
Though these values are not necessary for the operation of the converter, they can be helpful to various client applications. For example, the Clipboard EncConverter can be configured to filter the list of displayed converters based on the Encoding Name and/or the Transduction Type configured here.
Known Issues
SIL Converters has the option of displaying all fonts in a Word document. However, it will only show you the fonts that are installed on your computer. It does not warn you of any fonts that have been used in the document but are not currently installed on your computer. You can find out what fonts the Word document needs that are not on your
computer by looking at the tab of , click on Font Substitution... .
© 2003-2023 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.