You are here: Encoding > Unicode > Tutorials
Short URL: https://scripts.sil.org/UTW01
Introduction to Encoding Conversion Tutorial
David Rowe and Lorna A. Priest, 2009-02-17
Return to Unicode Transition Tutorial Links.
Contents
Mapping introduction
Before beginning you will need to follow the document: Computer setup for encoding conversion.
Note
The object of this tutorial is to introduce you to the tools we'll be using for conversion of data encoded in legacy fonts to Unicode. We'll use a trivial example to see an overview of the steps in the process.
First we'll look at the source file
- Open Explorer and navigate to wherever you unzipped the file Demo.txt. Double-click on this file. The Notepad program comes up displaying the three characters: SIp in the Times New Roman font (or whatever you have as your default font).
Note
If your computer is set to use a different program than Notepad to open .txt files, you will have to manually launch Notepad and open this file, or figure out the equivalent process with your program.
- Select , then choose from the list and click .
- The Notepad program now displays the three characters as IPA in the SILDoulos IPA93 font.
- Select , then choose (or whatever the font was before you changed it!) from the list and click . This is to reverse the font change done above.
Select to close Notepad.
Now we'll look at the mapping file
In order to keep this demonstration simple, we have only included mappings for the three characters that occur in the source file. This is not at all realistic!
- From Windows Explorer, right-click on Demo.map and select and the .
- At the dialog box, make sure and are selected.
- Click on OK .
- Where it says click on OK .
- Select "SILDoulos IPA93" for the font and click on OK .
- Where it says click on OK .
- Select "Doulos SIL" for the font and click on OK .
- The TECKit Mapping Editor program will start with the Demo.map file loaded. The display should have:
EncodingName "demo"
0x49 <> U+026A ; latin_letter_small_capital_i
0x53 <> U+0283 ; latin_small_letter_esh
0x70 <> U+0070 ; latin_small_letter_p
- If you do not already see a message saying "Compiled successfully!" then select . The TECKit Mapping Editor creates the compiled mapping in the folder C:UTTutorial1-IntroDemo.tec.
- Near the bottom in the box, carefully type "SIp". Down below you'll see the what it is converting the characters to and it shows the roundtrip conversion. They should all visually look the same.
- Delete the "SIp" in the top box and replace it with "RST". You'll see "FFFD 0283 FFFD" displayed in the middle box. U+FFFD is the Unicode character which is "...used to replace an incoming character whose value is unknown or unrepresentable in Unicode." Since the Demo.map file contains no mapping for 0x52 ("R") or 0x54 ("T"), these characters are replaced by the U+FFFD. And in the conversion back to bytes, the U+FFFD is converted to a glottal character.
- Click on to close the TECKit Mapping Editor.
Using the compiled mapping
Now we will use the compiled mapping Demo.tec to convert the text in our source file Demo.txt using the DropTEC program.
- Click on to open the program.
- From Explorer, drag the compiled mapping file Demo.tec to the box at the top labeled "Mapping file:". (Alternatively, you can use the adjacent Browse button and navigate to the file.)
- From Explorer, drag the file Demo.txt to the box on the left labeled "Legacy text file:" (alternatively, you can use the adjacent Browse button and navigate to the file). When you have loaded the file, DropTEC prompts you for a destination file. Click Save and accept the default name of Demo-U.txt.
- Once you supply the name, the file is created, so if you want to select a different "Unicode output form", you'll need to do it before specifying the input file (and giving the name for the output file).
- Click on of the "DropTEC" window to close the program.
- Next, open the Demo-U.txt in Notepad. Change the font to "Doulos SIL" and see if the characters display properly. They should.
Encoding Forms
Note
The object of this tutorial is to look at several different Unicode encoding forms. We'll use the example file from the Introduction tutorial.
- Click on to open the program
- With Explorer navigate to the C:UTTutorials2-Encoding Forms folder and drag the compiled mapping file Demo.tec to the box at the top labeled "Mapping file:". (Alternatively, you can use the adjacent Browse... button in DropTEC and navigate to the file.)
- From Explorer, drag the file Demo.txt to the box on the left labeled "Legacy text file:". (Alternatively, you can use the adjacent Browse... button and navigate to the file.) When you have loaded the file, DropTEC prompts you for a destination file. Replace the default name of Demo-U.txt with Demo-UTF8.txt. Make sure that UTF8 is selected in the "Unicode output form" box. Click on Save .
Note
Note that once you supply the target file name, the file is created, so if you want to select a different "Unicode output form", you'll need to do it before specifying the input file (and giving the name for the output file).
- Click on UTF16BE in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF16BE.txt. Click on Save .
- Click on UTF16LE in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF16LE.txt. Click on Save .
- Click on UTF32BE in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF32BE.txt. Click on Save .
- Click on UTF32LE in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF32LE.txt. Click on Save .
- Close the "DropTEC" window.
- Use Explorer to double-click on the DumpDemo.bat file in the C:UTTutorials2-Encoding Forms folder. (This will not work on Windows 95.) This will generate a hex dump on the source file and the five files you just created. When you're done looking at the hexadecimal dump, press a key to close the box. The display should look like:
The table below shows how the three characters (plus the initial Byte Order Mark) appear in each of the different encodings.
Note
Note that although, strictly speaking, the Byte Order Mark is not needed for UTF-8, some programs include it as an indication that the text is encoded in UTF-8.
© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.