Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Encoding > Unicode > Tutorials
Short URL: http://scripts.sil.org/UTW01

Introduction to Encoding Conversion Tutorial

David Rowe and Lorna A. Priest, 2009-02-17

Return to Unicode Transition Tutorial Links.

Contents

Mapping introduction

Before beginning you will need to follow the document: Computer setup for encoding conversion.

Note

The object of this tutorial is to introduce you to the tools we'll be using for conversion of data encoded in legacy fonts to Unicode. We'll use a trivial example to see an overview of the steps in the process.

First we'll look at the source file

  • Open Explorer and navigate to wherever you unzipped the file Demo.txt. Double-click on this file. The Notepad program comes up displaying the three characters: SIp in the Times New Roman font (or whatever you have as your default font).

Note

If your computer is set to use a different program than Notepad to open .txt files, you will have to manually launch Notepad and open this file, or figure out the equivalent process with your program.

  • Select Format / Font, then choose SILDoulos IPA93 from the list and click OK.
  • The Notepad program now displays the three characters as IPA in the SILDoulos IPA93 font.
  • Select Format / Font, then choose Times New Roman (or whatever the font was before you changed it!) from the list and click OK. This is to reverse the font change done above. Select File / Close to close Notepad.

Now we'll look at the mapping file

In order to keep this demonstration simple, we have only included mappings for the three characters that occur in the source file. This is not at all realistic!

  • From Windows Explorer, right-click on Demo.map and select Open With and the TECkit Mapping Unicode Editor.
  • At the Select Conversion Type dialog box, make sure Legacy to Unicode and Bidirectional are selected.
  • Click on  OK .
  • Where it says Select the font for the left-hand side encoding click on  OK .
  • Select "SILDoulos IPA93" for the font and click on  OK .
  • Where it says Select the font for the right-hand side encoding click on  OK .
  • Select "Doulos SIL" for the font and click on  OK .
  • The TECKit Mapping Editor program will start with the Demo.map file loaded. The display should have:
EncodingName            "demo"

0x49     <>     U+026A     ; latin_letter_small_capital_i
0x53     <>     U+0283     ; latin_small_letter_esh
0x70     <>     U+0070     ; latin_small_letter_p
  • If you do not already see a message saying "Compiled successfully!" then select File / Compile. The TECKit Mapping Editor creates the compiled mapping in the folder C:UTTutorial1-IntroDemo.tec.
  • Near the bottom in the Left-side Sample: box, carefully type "SIp". Down below you'll see the what it is converting the characters to and it shows the roundtrip conversion. They should all visually look the same.
  • Delete the "SIp" in the top box and replace it with "RST". You'll see "FFFD 0283 FFFD" displayed in the middle box. U+FFFD  is the Unicode character which is "...used to replace an incoming character whose value is unknown or unrepresentable in Unicode." Since the Demo.map file contains no mapping for 0x52 ("R") or 0x54 ("T"), these characters are replaced by the U+FFFD. And in the conversion back to bytes, the U+FFFD is converted to a glottal character.
  • Click on File / Exit to close the TECKit Mapping Editor.

Using the compiled mapping

Now we will use the compiled mapping Demo.tec to convert the text in our source file Demo.txt using the DropTEC program.

  • Click on All Programs / SIL Converters / TECkit / DropTEC to open the program.
  • From Explorer, drag the compiled mapping file Demo.tec to the box at the top labeled "Mapping file:". (Alternatively, you can use the adjacent  Browse  button and navigate to the file.)
  • From Explorer, drag the file Demo.txt to the box on the left labeled "Legacy text file:" (alternatively, you can use the adjacent  Browse  button and navigate to the file). When you have loaded the file, DropTEC prompts you for a destination file. Click  Save  and accept the default name of Demo-U.txt.
  • Once you supply the name, the file is created, so if you want to select a different "Unicode output form", you'll need to do it before specifying the input file (and giving the name for the output file).
  • Click on File / Exit of the "DropTEC" window to close the program.
  • Next, open the Demo-U.txt in Notepad. Change the font to "Doulos SIL" and see if the characters display properly. They should.

Encoding Forms

Note

The object of this tutorial is to look at several different Unicode encoding forms. We'll use the example file from the Introduction tutorial.

  • Click on All Programs / SIL Converters / TECkit / DropTEC to open the program
  • With Explorer navigate to the C:UTTutorials2-Encoding Forms folder and drag the compiled mapping file Demo.tec to the box at the top labeled "Mapping file:". (Alternatively, you can use the adjacent  Browse...  button in DropTEC and navigate to the file.)
  • From Explorer, drag the file Demo.txt to the box on the left labeled "Legacy text file:". (Alternatively, you can use the adjacent  Browse...  button and navigate to the file.) When you have loaded the file, DropTEC prompts you for a destination file. Replace the default name of Demo-U.txt with Demo-UTF8.txt. Make sure that  UTF8  is selected in the "Unicode output form" box. Click on  Save .

Note

Note that once you supply the target file name, the file is created, so if you want to select a different "Unicode output form", you'll need to do it before specifying the input file (and giving the name for the output file).

  • Click on  UTF16BE  in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF16BE.txt. Click on  Save .
  • Click on  UTF16LE  in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF16LE.txt. Click on  Save .
  • Click on  UTF32BE  in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF32BE.txt. Click on  Save .
  • Click on  UTF32LE  in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF32LE.txt. Click on  Save .
  • Close the "DropTEC" window.
  • Use Explorer to double-click on the DumpDemo.bat file in the C:UTTutorials2-Encoding Forms folder. (This will not work on Windows 95.) This will generate a hex dump on the source file and the five files you just created. When you're done looking at the hexadecimal dump, press a key to close the box. The display should look like:

The table below shows how the three characters (plus the initial Byte Order Mark) appear in each of the different encodings.

Note

Note that although, strictly speaking, the Byte Order Mark is not needed for UTF-8, some programs include it as an indication that the text is encoded in UTF-8.


© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.