You are here: Encoding > Conversion
Short URL: https://scripts.sil.org/SFM2OSISConv
An Experiment in Converting Legacy Data to Unicode and XML
These instructions are out-of-date for current versions of OSIS. They may still be useful to give an overview of the process. There is a USFM to Sword Module here: USFMtoSwordModule.
This is a quick example to give a general overview of the complete process of converting a legacy file to Unicode (UTF-8) and XML (we have chosen to use OSIS markup format). Learning about the various utilities will take time, but this should give you the big picture.
We’ll use a sample text from Ethiopia called: BN1J.sfm. The sfm file must be normalized, as you normally would, before beginning this process. Part of normalization is to find out what SFMs are being used in your data. You can download all the files specific to this project here (Bench scripture used by permission of the Bible Society of Ethiopia):
For all other utilities referenced in this document, follow the links to install.
Converting the SFM legacy file to SFM Unicode
First, a mapping table for TECkit must be created. Normally, you’ll have to figure out the mapping to Unicode. Tutorials for learning about mapping tables are found under the heading “Encoding Conversion” at:
Unicode Transition Tutorials.
In this case we’ll use SILEthBench.map.
- Open SILEthBench.map in the “TECkit mapping Editor”.
- Click on SILEthBench.tec.
to create a compiled mapping of this file. The compiled mapping file is called
- Double-click on Bench.bat. Because we want to preserve the sfm markers we will not be using the DropTec utility. DropTec would convert the sfm markers to the same encoding Instead we can run the batch file I’ve already created which runs “sfconv” – a utility which comes with TECkit.
- This batch file uses a configuration file called BEN-map.xml. If you were to do this with your own data this file would need to be modified for your purposes. It calls different mapping files for different standard format markers. One of the mapping files it calls is SILEthBench.tec.
- You may want to open BN1J-utf8.txt in Microsoft Word to see what it looks like. You will be asked to choose what encoding. You should choose UTF-8. You will probably notice some square boxes. These are PUA characters. If you install the “SIL Abyssinica U” (included in the file you downloaded) and then select the whole document and apply the “SIL Abyssinica U” font the square boxes should disappear.
- Compare with BN1J.SFM. The new file is in the UTF-8 form of Unicode. Notice that the SFM markers have been preserved.
- Close the file and do not save changes.
- This batch file also converts your new Unicode file back to the legacy encoding test.sf. This is so you can compare the original file (BN1J.SFM) with test.sf to see if your conversion routine works properly. They should be identical. (Because I chose not to normalize BN1J.SFM before working with it you will find the files are not identical.)
Converting the Unicode file from sfm format to XML
Now we are ready to convert this new file to XML using the OSIS markup format.
We have to create a batch file to run it. Read the SFM to OSIS Conversion Utility documentation. You will need to follow the installation instructions (including the installation of Python) before proceeding.
- Double-click BENRunOne.bat. (When using your own data you will need to edit this file and make changes.) The SFM to OSIS Conversion Utility documentation explains the process. This batch file uses a configuration file called conf.txt. This maps the conversion between standard format markers and OSIS markup. If you found you had sfms in your data which are not included in this file you would need to edit conf.txt to reflect your sfms or modify your data to match conf.txt. I found that I had sfms in my file which were not in the conf.txt file. I chose to modify my sfm file rather than conf.txt.
If you had success you will now have a file in OSIS markup called Bible.BEN.1John.xml. If there were any errors you will find them in the Bible.BEN.1John.log file.
If you want to see it in Internet Explorer you will need to:
- Create an XSL stylesheet (OSIS2HTML.1.2.1.BEN.xsl) based on the one in the SFM to OSIS Conversion Utility.
- The only change made to this stylesheet was to modify the font used.
- Put a link in the XML file (Bible.BEN.1John.xml) to the stylesheet. For example:
<?xml-stylesheet type="text/xsl" href="OSIS2HTML.1.2.1.BEN.xsl"?>
This should be the second line.
As long as you have the “SIL Abyssinica U” font (found in the zip file you downloaded) installed on your computer, you should see 1 John in the Bench language of Ethiopia!
I found that the stylesheet OSIS2HTML.1.2.1.xsl did not handle introductory paragraphs and so stripped them out. I had to add some lines in the .xsl file. There will always be situations of this sort to handle (although I expect this one to be corrected). The lines I added would not take care of more complicated introductory sections. You will probably never be able to take it straight “out of the box” with no changes to be made.
© 2003-2019 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Contact us here.