You are here: Encoding > Unicode > Tutorials
Short URL: https://scripts.sil.org/StructuredDataConversion
Structured data conversion
Lorna Priest and David Rowe, 2003-03-03
Warning!
These instructions are out-of-date. They may still be useful to give an overview of the process.
Please use SILConverters 4.0 for converting your data to Unicode. The , , , and the tools will all help you with structured data conversion.
Note
By the end of this tutorial you should be able to convert (roundtrip) structured data and test it by bringing it into various applications. Any issues discovered in this process should be fixed in the mapping files.
Contents
Structured Data Conversion for sfm files
Before beginning you will need to follow the document: Computer setup for encoding conversion.
Now we want to convert some Scripture text (formatted with standard format markers) into Unicode.
- Navigate to: C:UTWork5b-Structured Data Conversion.
- We will be converting a file called BN1J.sfm. This is a file from Ethiopia which uses the Ethiopic script (Bench scripture used by permission of the Bible Society of Ethiopia). Take a minute to open this file (BN1J.sfm) in a text editor.
- You may be asked what encoding to use. Just choose “plain text.” Glance through the file. Although the text appears in Latin characters, it is really Ethiopic, with the exception of the English text in the “id” line and in the “pic” line which is right above verse 5 of chapter 1. This is text we do not want to convert to the Ethiopic script.
- Write down the names of the sfms which contain English words.
- Search for |gk and look at the syntax.
- We already have a mapping file prepared for you called SILEthBench.map. Open this file with the TECkit Mapping Unicode Editor ().
- Choose legacy to Unicode encoding
- Choose for the left-hand side encoding and for the right-hand side encoding.
- The first Ethiopic word in BN1J.sfm is the header: h 1 yohanis The rules in the mapping file which will be used for converting this word are the following:
'yo' <> ethiopic_syllable_yo
'ha' <> ethiopic_syllable_ha ; ha
'ni' <> ethiopic_syllable_ni
's' <> ethiopic_syllable_se
- From the menu, select , type “yohanis” in the top box, then click Do Mapping . You should see “12EE 1200 1292 1235” displayed in the middle box and “yohanis” in the bottom box. Close the “Test Mapping” window.
- From the menu, select to compile the mapping file. This creates a compiled mapping file called SILEthBench.tec . Close the TECkit Mapping Editor.
- Open and drag and drop SILEthBench.tec into the “Mapping file” window. Leave the “Unicode output form” set to UTF8.
- Drag and drop BN1J.sfm to the “Legacy text file” window, and accept the Unicode file name suggestion of BN1J-U.sfm.
- Now open BN1J-U.sfm in Microsoft Word. When you are asked to choose an encoding, select UTF-8. You may need to select the entire text and set the font to Code 2000. What do you notice? Have your sfms been preserved? Can you read the “id” line and the “pic” lines? Does the inline marker (|gk{}) look like Greek?
- Close Word without saving any changes.
SFConv to the rescue
For sfm files we really do not want to take them into Microsoft Word to convert them. Word does all kinds of unexpected modifications and so if possible it is better to do your conversion outside of Word.
is a command-line utility which can convert sfm files to and from Unicode. Various sfms can be converted with different mapping files.
- Find the file called GNT-map.xml. Open this file in a text editor. This is a sample SFconv control file which you will adapt to work with this data.
- this file (GNT-map.xml) as BEN-map.xml.
- Change the line:
<sfConversion defaultMapping="SILGreek">
to our Bench mapping file name:
<sfConversion defaultMapping="SILEthBench">
We have changed the default encoding to the Ethiopian language named Bench.
- Next we will change the lines for sfms which will use a different mapping file. Remember the sfms you wrote down above? This is where you will put them.
- Change the following lines:
<sfMarkers escape="”
chars="abcdefghijklmnopqrstuvwxyz_ABCDEFGHIJKLMNOPQRSTUVWXYZ" mapping="ISO-8859-1">
<marker name="id" mapping="ISO-8859-1" />
<marker name="fe" mapping="ISO-8859-1" />
</sfMarkers>
to (the only thing you will change is "fe" to "pic"):
<sfMarkers escape="”
chars="abcdefghijklmnopqrstuvwxyz_ABCDEFGHIJKLMNOPQRSTUVWXYZ" mapping="ISO-8859-1">
<marker name="id" mapping="ISO-8859-1" />
<marker name="pic" mapping="ISO-8859-1" />
</sfMarkers>
The first line tells TECkit what characters form sfms and what mapping to use to convert the text of the markers to Unicode (i.e. I don’t want to change “p” to an Ethiopic character). The next lines tell TECkit that we do not want to use SILEthBench for the “id” and “pic” sfms. We want to use ISO-8859-1. If you wanted to use a totally different encoding for one of these sfms you could do that too. You can also add lines here to add other sfms.
Next we will look at inline markers.
Remember looking at the inline marker |gk which was in your sfm file? This is an unrealistic example of how you might use an inline marker, but it will clearly show you how other encodings can be used.
- Take a look at the following lines in the xml file. Notice that there is an inline marker for Greek.
<inlineMarkers escape="|" start="{" end="}"
chars="abcdefghijklmnopqrstuvwxyz_ABCDEFGHIJKLMNOPQRSTUVWXYZ" mapping="ISO-8859-1">
<marker name="rm" mapping="ISO-8859-1" />
<marker name="gk" mapping="SILGreek" />
</inlineMarkers>
- Finally, to do the actual conversion from 8-bit to Unicode (UTF8), we need to go to the Command Prompt (MS-DOS) and type this (you will need to add the right path, depending on where you have put sfconv.exe):
"Program FilesTECkitsfconv" -8u -utf8 -c ben-map.xml -i bn1j.sfm -o bn1j-utf8.txt
- and to convert Unicode (UTF8) back to 8-bit text:
"Program FilesTECkitsfconv" -u8 -utf8 -c ben-map.xml -i bn1j-utf8.txt -o test.sf
Note
There is also a batch file called test.bat provided for your use. You may need to change the path.
- If you run these two conversions, test.sf should be identical to bn1j.sfm. The Unicode version can be viewed in a Unicode text editor or in Word. You might notice some square boxes. These are characters in the PUA (Private Use Area of Unicode) and unless you have the SIL Abyssinica font you will not be able to view these characters. Notice also that the actual markers, as well as the content of the “id” line and the “pic” line should all be in roman script, the content of the |gk{} marker should be in Greek and everything else should be in Ethiopic.
Structured Data Conversion for Word documents
- With Explorer navigate to the C:UTTutorials5a-Structured Data Conversion folder and locate the file Ex1.doc. Double-click on this file to open it with Word. If Word asks about macros, you should enable them. (We promise that as they are distributed these files do not contain any viruses.)
Note
Note that you will be closing this file without saving the changes.
- From the menu, select the menu item.
- For “Name:”, select “SIL IPA93 <> UNICODE”.
- For “Scope of change”, select “A specific regular font:” and choose “SILDoulos IPA93” from the list.
- For “Target Font”, check the “Apply specific Font” box and choose “Doulos SIL” from the list.
- Click .
- Observe that a number of characters (the open “o” characters for example) did not convert correctly.
- Close this file without saving changes.
- Open the Ex1.doc file with the WordPad program. You may be able to right-click on the file, select , and from the context menu. Otherwise you need to start the WordPad program, navigate to the file and open it.
- In WordPad, select . In the dialog box that comes up, make sure that the “Save as type:” field is set to “Rich Text Format (RTF)”.
- Change the “File name:” field to “Ex1.rtf”.
- Click the Save button and then close WordPad.
Now double-click on the newly created Ex1.rtf file to open it in Word.
From the menu, select the Data Conversion menu item.
- For “Name:”, select “SIL IPA93 <> UNICODE”.
- For “Scope of change”, select “A specific regular font:” and choose “SILDoulos IPA93” from the list.
- For “Target Font”, check the “Apply specific Font” box and choose “Doulos SIL” from the list.
- Click OK .
- You should now see the IPA data converted and in the new font. However, some of the original text was in Times New Roman. We now want to convert it to the Doulos SIL font, so that the whole text is in the same font.
Type Ctrl + Home to return to the top of the file.
- Select .
- If necessary, click on More to display the Search Options, including Format which will be needed in the next steps.
- Make sure that the “Find what:” field is empty, then click Format , select Font... , and choose “Times New Roman”.
- Now make sure the “Replace with:” field is empty, then click Format , select Font... , and choose “Doulos SIL”.
- Click Replace All . Word displays a dialog box with the message “Word has completed its search of the document and has made 11 replacements.” Click OK to acknowledge this message, then close the dialog box.
- You may notice that the data in the table no longer fits properly. Click in the table somewhere, then from the menu, select .
Note
Note that the exact sequence of commands may vary depending on your version of Word.
- From the menu, choose .
- From the menu, choose .
© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.