NRSI: Computers & Writing Systems
scripts.sil.org may be unavailable for around 5-10 minutes between 1000-1100 EST Saturday, October 10. Thanks for your patience.
XSEM: XML Scripture Encoding Model
This page is out of date. It may still be useful for historical purposes. XSEM was the guiding concept which led to the development of Open Scripture Information Standard (OSIS) and the OXES (Open XML for Editing Scripture) format.
When a piece of literature is translated from one language to another, something special takes place. When done accurately, ideas are conveyed, knowledge and even emotions are transferred to the receptor language. Concepts and ideas that were previously contained only in one language are now free to be explored by the reader in another language.
In a sense, this same concept holds true for our computers when it comes to delivering content to a user. Different software applications have different ways of containing and expressing information. This makes it inherently difficult for content producers when they want to have their material available on several different forms of media. For example, a publisher of a medical journal may wish to have a volume that was published in book format also available on the world-wide-web. The source text of the book will need to be converted from the format used to produce the book into an electronic delivery format that will be used for the Internet. There are several different kinds of electronic delivery formats available today to choose from. The one that is chosen becomes the "receptor" language. The diversity of the two mediums, book and electronic, will determine the level of difficulty in converting, or translating, this volume of literature from one format to another.
Different forms of content delivery mechanisms have different advantages and are tailored to different audiences. The obvious advantage of a book is that once it is produced, other than light, it requires no power to deliver its content. However, it is more difficult to perform a word search on a paper book than an HTML web page. For this reason it is beneficial for publishers to have their content published in as many content delivery forms as reasonable so that the widest audience can be reached. This means that there can be many copies of the content in as many delivery formats.
The problem comes when content needs to be changed or updated. For example, let us suppose that the author and publisher of the medical journal discovers that a certain section in their material contains false information. This material would need to be updated quickly to assure quality patient care. To update the material there are basically two ways that are commonly used. One way would be to edit the content of each delivery format and cross-check to be sure that the content in each form is accurately represented. The chances of corrupting the content is high with this method.
A second way would be to edit just one of the delivery formats and then convert that using special processes into the other delivery formats. This method, once perfected, would be the most efficient way and there is little chance of error. However, the cost involved in a conversion system of this nature is high and it would be a proprietary system. If the author ever needed to change to a different publisher the expense of creating another proprietary system would be incurred.What is needed is a foundational standard that encodes the content by which it can be transformed easily into additional delivery formats. By employing such a standard it is far easier for publishers to update and deliver content and also exchange content when necessary. A standard such as this would benefit everyone. Such a standard does exist and it is called XML (Extensible Markup Language).
The XML standard was developed in the late 1990's by the (link: http://www.w3.org/ World Wide Web Consortium or W3C). The industry has begun the process of adopting this standard and it is amazing to see how widely it is being embraced. Software developers are implementing XML any place where they need to transfer data or content between applications or systems.
XSEM is a markup standard that reflects the structure of a particular type of literature, Scripture. More properly, it is a data schema. A data schema defines the structure of a specific data set, in this case, the Bible. It allows Bible text to be encoded in a standard way regardless of the language the text may be written in. This benefits all by allowing a more standardized approach to translating and publishing Scripture.
There are different ways to express a data schema. Here, I will refer to this aspect as a notation. There are several different types of schema notations used by those who implement XML based systems. The most common is one called Document Type Definition (DTD). DTD notation has been in use for quite some time. However, for XSEM, we chose to use the (link: http://www.w3.org/XML/Schema.html W3C XML Schema) notation standard for encoding data structure because it had some distinct advantages over the DTD notation.
So, what sets XSEM apart from other markup languages? XSEM has several distinctions. First, this standard revolves around one kind of literature, the Bible. Building processes around this focused standard will be simpler than a more generalized markup language. Second, though there may be other schemas that allow markup of Scripture text, they typically rely on the Book-Chapter-Verse structure. This is good for referencing systems but the developers of XSEM did not feel this would be good in a translation and publishing environment. XSEM uses a Book-Section-Paragraph structure view which better fits our view of Scripture in SIL.
We feel that XSEM will work well in our environment and we invite others to look at it. It may help solve some problems in the area of content management as it relates to Scripture. The XSEM source is availible here:
To see XSEM represented in DTD form you can download the following file:
In order to implement any kind of XML language, good documentation is a necessity. XSEM is somewhat unique compared to other projects in that its documentation is part of its source code. XML Schema allows documentation to be embedded inside its structure code. XSEM has taken advantage of this capability, so the task of producing accurate documentation has been greatly simplified.
Because XML Schema uses the XML standard to encode data structure, extracting the embedded documentation becomes a matter of applying an XSL (XML Stylesheet Language) process to the source and from that comes documentation that is more readable by ordinary people. This is what we have done with XSEM.
We are unable to provide direct on-line viewing of the documentation at this time. However, below, we provide an archived version that you can download and view off-line. Just uncompress the archive on your local drive and open the documentation.html file in your browser. Choose the appropreat archive for your platform.
To help interested developers we are able to provide a limited number of XSL stylesheets to render XSEM. We hope the demonstration contained in this zipped archive will be helpful. It is, however, only usable on the Windows platform. Adaptations for other platforms have not been produced as of yet.
To get started unzip the package on your local drive and look for the ReadMetxt file. Please note that this file may refer to programs that have been superceeded by newer versions.
The XSEM Project
The project proposal for XSEM was made in the fall of 1999. The project, proposed by Dennis Drescher, the project manager, was to be simply a DTD. However, after some research by Eric Albright, the author of XSEM, it was determined that using the XML Schema notation would prove to be the best way to capture both the structure of Biblical text and document it at the same time. One unique feature of XSEM is that it has no external documentation. All documentation is contained within the main schema file.
This project had a workgroup made up of six people. One of the first tasks for the workgroup was to identify all the textual elements found in Scripture needed for this encoding model. The text model the group used was the Good News Bible, Today's English Version, Second Edition, published by Thomas Nelson Publishers. The list of elements derived from the text model was then compared with existing element inventories used with the current markup system. It was determined that there were some missing elements. Even though these elements were not found in the text model, they were in use in other models and it was felt they needed to be included in this encoding model in order for it to be complete. At the end of this phase of the project there were over 125 elements identified. Jim Albright, Nathan Miles and John Carr worked hard to finalize the element list.
The next step was to catalog the elements and find examples of each one. If an element did not exist in the text model, it needed to be located in another one. Each element example was digitally scanned. These images were used in the documentation. At this point the list was also made available for review by a broader audience.
With a complete element inventory in hand, this information was turned over to Eric Albright, who was to be the actual author of XSEM. It was decided that only one person would be responsible for encoding the model and Eric was designated as that person. During the early stages of the encoding process Eric worked closely with Gary Simons who was the markup consultant. As the project progressed, his work was reviewed by a larger group which helped to refine the model further. During the refining process it was determined that some of the elements originally identified were not necessary in an XML environment. The original list of elements was reduced to the current list which numbers 111.
Before the Pre-release at the Computer Technical Conference held at JAARS in Waxhaw, North Carolina in November 2000, Jim Albright began the task of marking up model text. Through no small effort on his part, he completed the task on time which provided a good first test of the model. Also during this time, John Carr edited the XSEM documentation.
In May 2001 Eric completed work on XSEM and Level 1 was finalized. This web site and the available material is the result of all this effort. It is our hope that software developers who work in this area will find XSEM a standard that will enhance future software products and make the task of translating and publishing Scripture easier.