This is an archive of the original scripts.sil.org site, preserved as a historical reference. Some of the content is outdated. Please consult our other sites for more current information: software.sil.org, ScriptSource, FDBP, and silfontdev



Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE | PRIVACY POLICY

You are here: Encoding > Unicode > Tutorials
Short URL: https://scripts.sil.org/StructuredDataConversion

Structured data conversion

Lorna Priest and David Rowe, 2003-03-03

Warning!

These instructions are out-of-date. They may still be useful to give an overview of the process.

Please use SILConverters 4.0 for converting your data to Unicode. The Bulk SFM Converter, Bulk Word Document Converter, XML Data Converters, and the SILConverters for Office tools will all help you with structured data conversion.

Note

By the end of this tutorial you should be able to convert (roundtrip) structured data and test it by bringing it into various applications. Any issues discovered in this process should be fixed in the mapping files.

Contents

Structured Data Conversion for sfm files

Before beginning you will need to follow the document: Computer setup for encoding conversion.

Now we want to convert some Scripture text (formatted with standard format markers) into Unicode.

  • Navigate to: C:UTWork5b-Structured Data Conversion.
  • We will be converting a file called BN1J.sfm. This is a file from Ethiopia which uses the Ethiopic script (Bench scripture used by permission of the Bible Society of Ethiopia). Take a minute to open this file (BN1J.sfm) in a text editor.
    • You may be asked what encoding to use. Just choose “plain text.” Glance through the file. Although the text appears in Latin characters, it is really Ethiopic, with the exception of the English text in the “id” line and in the “pic” line which is right above verse 5 of chapter 1. This is text we do not want to convert to the Ethiopic script.
  • Write down the names of the sfms which contain English words.
  • Search for |gk and look at the syntax.
  • We already have a mapping file prepared for you called SILEthBench.map. Open this file with the TECkit Mapping Unicode Editor (Start / All Programs / SIL Converters / TECkit / TECkit Map Unicode Editor).
    • Choose legacy to Unicode encoding
    • Choose Arial for the left-hand side encoding and Code2000 for the right-hand side encoding.
  • The first Ethiopic word in BN1J.sfm is the header: h 1 yohanis The rules in the mapping file which will be used for converting this word are the following:
'yo' <> ethiopic_syllable_yo
'ha' <> ethiopic_syllable_ha    ; ha
'ni' <> ethiopic_syllable_ni
's' <> ethiopic_syllable_se
  • From the File menu, select Test Mapping..., type “yohanis” in the top box, then click  Do Mapping . You should see “12EE 1200 1292 1235” displayed in the middle box and “yohanis” in the bottom box. Close the “Test Mapping” window.
  • From the File menu, select Compile to compile the mapping file. This creates a compiled mapping file called SILEthBench.tec . Close the TECkit Mapping Editor.
  • Open DropTEC and drag and drop SILEthBench.tec into the “Mapping file” window. Leave the “Unicode output form” set to UTF8.
  • Drag and drop BN1J.sfm to the “Legacy text file” window, and accept the Unicode file name suggestion of BN1J-U.sfm.
  • Now open BN1J-U.sfm in Microsoft Word. When you are asked to choose an encoding, select UTF-8. You may need to select the entire text and set the font to Code 2000. What do you notice? Have your sfms been preserved? Can you read the “id” line and the “pic” lines? Does the inline marker (|gk{}) look like Greek?
  • Close Word without saving any changes.

SFConv to the rescue

For sfm files we really do not want to take them into Microsoft Word to convert them. Word does all kinds of unexpected modifications and so if possible it is better to do your conversion outside of Word.

SFConv is a command-line utility which can convert sfm files to and from Unicode. Various sfms can be converted with different mapping files.

  • Find the file called GNT-map.xml. Open this file in a text editor. This is a sample SFconv control file which you will adapt to work with this data.
  • Save this file (GNT-map.xml) as BEN-map.xml.
  • Change the line:
<sfConversion defaultMapping="SILGreek">

to our Bench mapping file name:

<sfConversion defaultMapping="SILEthBench">

We have changed the default encoding to the Ethiopian language named Bench.

  • Next we will change the lines for sfms which will use a different mapping file. Remember the sfms you wrote down above? This is where you will put them.
  • Change the following lines:
<sfMarkers escape="” 
     chars="abcdefghijklmnopqrstuvwxyz_ABCDEFGHIJKLMNOPQRSTUVWXYZ" mapping="ISO-8859-1">
<marker name="id" mapping="ISO-8859-1" /> 
<marker name="fe" mapping="ISO-8859-1" /> 
</sfMarkers>

to (the only thing you will change is "fe" to "pic"):

<sfMarkers escape="” 
     chars="abcdefghijklmnopqrstuvwxyz_ABCDEFGHIJKLMNOPQRSTUVWXYZ" mapping="ISO-8859-1">
<marker name="id" mapping="ISO-8859-1" /> 
<marker name="pic" mapping="ISO-8859-1" /> 
</sfMarkers>

The first line tells TECkit what characters form sfms and what mapping to use to convert the text of the markers to Unicode (i.e. I don’t want to change “p” to an Ethiopic character). The next lines tell TECkit that we do not want to use SILEthBench for the “id” and “pic” sfms. We want to use ISO-8859-1. If you wanted to use a totally different encoding for one of these sfms you could do that too. You can also add lines here to add other sfms.

Next we will look at inline markers.

Remember looking at the inline marker |gk which was in your sfm file? This is an unrealistic example of how you might use an inline marker, but it will clearly show you how other encodings can be used.

  • Take a look at the following lines in the xml file. Notice that there is an inline marker for Greek.
<inlineMarkers escape="|" start="{" end="}" 
     chars="abcdefghijklmnopqrstuvwxyz_ABCDEFGHIJKLMNOPQRSTUVWXYZ" mapping="ISO-8859-1">
<marker name="rm" mapping="ISO-8859-1" />
<marker name="gk" mapping="SILGreek" /> 
</inlineMarkers>
  • Finally, to do the actual conversion from 8-bit to Unicode (UTF8), we need to go to the Command Prompt (MS-DOS) and type this (you will need to add the right path, depending on where you have put sfconv.exe):
"Program FilesTECkitsfconv" -8u -utf8 -c ben-map.xml -i bn1j.sfm -o bn1j-utf8.txt
  • and to convert Unicode (UTF8) back to 8-bit text:
"Program FilesTECkitsfconv" -u8 -utf8 -c ben-map.xml -i bn1j-utf8.txt -o test.sf

Note

There is also a batch file called test.bat provided for your use. You may need to change the path.

  • If you run these two conversions, test.sf should be identical to bn1j.sfm. The Unicode version can be viewed in a Unicode text editor or in Word. You might notice some square boxes. These are characters in the PUA (Private Use Area of Unicode) and unless you have the SIL Abyssinica font you will not be able to view these characters. Notice also that the actual markers, as well as the content of the “id” line and the “pic” line should all be in roman script, the content of the |gk{} marker should be in Greek and everything else should be in Ethiopic.

Structured Data Conversion for Word documents

  • With Explorer navigate to the C:UTTutorials5a-Structured Data Conversion folder and locate the file Ex1.doc. Double-click on this file to open it with Word. If Word asks about macros, you should  enable  them. (We promise that as they are distributed these files do not contain any viruses.)

Note

Note that you will be closing this file without saving the changes.

  • From the Tools menu, select the Data Conversion menu item.
  • For “Name:”, select “SIL IPA93 <> UNICODE”.
  • For “Scope of change”, select “A specific regular font:” and choose “SILDoulos IPA93” from the list.
  • For “Target Font”, check the “Apply specific Font” box and choose “Doulos SIL” from the list.
  • Click OK.
  • Observe that a number of characters (the open “o” characters for example) did not convert correctly.

Note

It turns out that these characters were inserted into the Word document using Insert / Symbol, and are encoded in a way that the macro cannot identify the character. See Encoding Conversion Frequently Asked Questions.

  • Close this file without saving changes.
  • Open the Ex1.doc file with the WordPad program. You may be able to right-click on the file, select Open With, and WordPad from the context menu. Otherwise you need to start the WordPad program, navigate to the file and open it.
  • In WordPad, select File / Save As. In the dialog box that comes up, make sure that the “Save as type:” field is set to “Rich Text Format (RTF)”.
  • Change the “File name:” field to “Ex1.rtf”.
  • Click the  Save  button and then close WordPad. Now double-click on the newly created Ex1.rtf file to open it in Word. From the Tools menu, select the Data Conversion menu item.
  • For “Name:”, select “SIL IPA93 <> UNICODE”.
  • For “Scope of change”, select “A specific regular font:” and choose “SILDoulos IPA93” from the list.
  • For “Target Font”, check the “Apply specific Font” box and choose “Doulos SIL” from the list.
  • Click  OK .
  • You should now see the IPA data converted and in the new font. However, some of the original text was in Times New Roman. We now want to convert it to the Doulos SIL font, so that the whole text is in the same font. Type  Ctrl  +  Home  to return to the top of the file.
  • Select Edit / Replace.
  • If necessary, click on  More  to display the Search Options, including  Format  which will be needed in the next steps.
  • Make sure that the “Find what:” field is empty, then click  Format , select  Font... , and choose “Times New Roman”.
  • Now make sure the “Replace with:” field is empty, then click  Format , select  Font... , and choose “Doulos SIL”.
  • Click  Replace All . Word displays a dialog box with the message “Word has completed its search of the document and has made 11 replacements.” Click  OK  to acknowledge this message, then close the dialog box.
  • You may notice that the data in the table no longer fits properly. Click in the table somewhere, then from the Table menu, select AutoFit / AutoFit to Contents.

Note

Note that the exact sequence of commands may vary depending on your version of Word.

  • From the File menu, choose Save.
  • From the File menu, choose Exit.

© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.