Computers & Writing Systems
You are here: Encoding > Conversion
An Experiment in Converting Legacy Data to Unicode and XML
These instructions are out-of-date for current versions of OSIS. They may still be useful to give an overview of the process. There is a USFM to Sword Module here: USFMtoSwordModule.
This is a quick example to give a general overview of the complete process of converting a legacy file to Unicode (UTF-8) and XML (we have chosen to use OSIS markup format). Learning about the various utilities will take time, but this should give you the big picture.
We’ll use a sample text from Ethiopia called: BN1J.sfm. The sfm file must be normalized, as you normally would, before beginning this process. Part of normalization is to find out what SFMs are being used in your data. You can download all the files specific to this project here (Bench scripture used by permission of the Bible Society of Ethiopia):
For all other utilities referenced in this document, follow the links to install.
Converting the SFM legacy file to SFM Unicode
First, a mapping table for TECkit must be created. Normally, you’ll have to figure out the mapping to Unicode. Tutorials for learning about mapping tables are found under the heading “Encoding Conversion” at:
In this case we’ll use SILEthBench.map.
Converting the Unicode file from sfm format to XML
Now we are ready to convert this new file to XML using the OSIS markup format.
We have to create a batch file to run it. Read the SFM to OSIS Conversion Utility documentation. You will need to follow the installation instructions (including the installation of Python) before proceeding.
If you had success you will now have a file in OSIS markup called Bible.BEN.1John.xml. If there were any errors you will find them in the Bible.BEN.1John.log file.
If you want to see it in Internet Explorer you will need to:
<?xml-stylesheet type="text/xsl" href="OSIS2HTML.1.2.1.BEN.xsl"?>
This should be the second line.
As long as you have the “SIL Abyssinica U” font (found in the zip file you downloaded) installed on your computer, you should see 1 John in the Bench language of Ethiopia!
I found that the stylesheet OSIS2HTML.1.2.1.xsl did not handle introductory paragraphs and so stripped them out. I had to add some lines in the .xsl file. There will always be situations of this sort to handle (although I expect this one to be corrected). The lines I added would not take care of more complicated introductory sections. You will probably never be able to take it straight “out of the box” with no changes to be made.
© 2003-2023 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.