|
Computers & Writing Systems
You are here: Encoding > Conversion > Utilities Converting RTF to SFM using RTF2SFM perl script
IntroductionThis tutorial is based on Bob Hallissy's RTF2SFM program. RTF2SFM is a useful tool for converting a styled Word .RTF file to UTF8-encoded SFM (Standard Format Marker). Unlike SFConv it correctly handles Unicode characters. Please Read this first! This tutorial was written for an older version of RTF2SFM which required Perl5.6. For Windows users, the latest version is available as a standalone Windows executable and no longer requires Perl to be installed first. Additionally, there is now more extended help (see -h option) that means you don't have to look into the Perl source to understand the control file options. For those not on Windows or who, for other reasons, want the Perl source, it now requires Perl 5.8 or newer. Download and Install Perl 5.6.1Download PerlInstall Perl 5.6.1 as this is the version required for the RTF2SFM conversion program to work. Note If you already had a “500” version installed (this one is a “600”), installing this will fail. You will need to uninstall the prior version and do some registry maintenance to get this to work. This is all documented in the html Release Notes. AND The download file is rather large at over 8 MB. For downloading and installing Perl you will need at least 50 MB free on your hard drive.
This downloads the file ActivePerl-5.6.1.nnn-MSWin32-x86.msi which you can install by double clicking from Windows Explorer. Note Perl 5.6.1 can also be found on the “CTC02 Resource Collection” CD. Install PerlInstall Perl to c:Perl.
Verify Perl Installed
Install RTF2SFMInstallDownload the RTF2SFM ZIP archive for the program (current version 0.7) to your hard drive and unzip it to your directory of choice such as: c:JobsParatextRTF2SFMSIL-RTF-0.07 Follow the website’s installation instructions as in the example below. This should create RTF2SFM.PLX and RTF2SFM.BAT in c:Perlbin so that when you type RTF2SFM at the Command Prompt it actually executes. VerifyAt the Command Prompt type in RTF2SFM and press Enter . You should get the following message: Converting RTF to USFMDownload and unzip sample files
Get list of styles used in the RTF file to convert to USFMsOpen the file to be converted into Word. 41MAT.RTF is provided for you as an example. Confirm which styles are used. You can do this in Word 2000 (9.0) by selecting , under List text box . Creating macros similar to what follows is better as it generates a list of used styles that can be copied later. These macros search to see which styles available to this document are actually used in the document. It also checks it against the styles found in RTF2SFM.plx to see if you need to do anything else. Sub AreRTFStylesDefinedInRTF2SFMConversionFile() ' ' AreRTFStylesDefinedInRTF2SFMConversionFile Macro ' ' Description: Creates a list of styles found in an open RTF file. ' Reports which listed style(s) is NOT found in a ' designated RTF2SFM conversion file, ' ' Prerequisite: 1. The RTF file must be open in the Active Window and ' should already be saved. ' 2. Word thinks the RTF file has been modified, so make ' sure you DON'T SAVE the RTF file, just to be safe. ListStylesInDoc VerifyStylesInConversion End Sub Sub ListStylesInDoc() ' ' ListStylesInDoc Macro ' ' This macro will create a list of the styles used in a Word Document, ' It will display a maximum of 500 styles. It was created to analyze a ' scripture word document/rtf file. It will run on the Active Word ' Window and: ' ' 1. make sure window is set to "show all" so hidden styles will be processed ' 2. look through list of styles in the document ' 3. find which styles actually exist in the document ' 4. create a new document containing the list of styles used ' 5. restore "show all" setting for the original document ' and NOT show all for the newly created List of Styles doc Dim ActiveDoc As String Dim CurrStyle As String Dim FoundStyle As Boolean Dim IgnoreStyleCnt As Integer Dim J As Integer Dim Msg As String Dim NameOfDoc As String Dim StyleCount As Integer Dim StyleList(500) As String Dim WasShowAll As Boolean Dim X As Long Msg = "This macro uses Find to locate styles used in this " + vbCr + _ "RTF file, specifically Find / Format / Style. Using " + vbCr + _ "this feature somehow makes Word think that a " + vbCr + _ "change has occurred. Then, when you go to close " + vbCr + _ "the document, Word asks if you want to save the " + vbCr + _ "changes to the file." + vbCr + _ vbCr + _ "While the macro has made no actual changes to " + vbCr + _ "the data, it is still safest NOT TO SAVE the RTF" + vbCr + _ "file when closing it." MsgBox Msg, vbInformation, "WARNING" ' 1. make sure window is set to "show all" so hidden styles will be processed '----------------------------------------------------------------------- NameOfDoc = ActiveWindow.ActivePane.Document.Name If ActiveWindow.ActivePane.View.ShowAll Then ' save whether the document was set to Show All WasShowAll = True Else WasShowAll = False ActiveWindow.ActivePane.View.ShowAll = True End If ' 2. look through list of styles in the document '----------------------------------------------------------------------- StyleCount = 0 X = ActiveDocument.Styles.Count ' count number of styles available to this document For J = 1 To X ' for each available style: search to see if it is in this doc CurrStyle = ActiveDocument.Styles.Item(J).NameLocal ' capture style name ' 3. find which styles actually exist in the document '----------------------------------------------------------------------- With Selection.Find .ClearFormatting ' remove any previous formatting from match .Text = "" ' match any text with the Current Style .Replacement.Text = "" ' no replace will take place .Forward = True ' search from current location forward .Wrap = wdFindContinue ' wrapping past the end to the start of the document .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False .Format = False ' special format options will be part of find .Style = CurrStyle ' look for CurrStyle style name End With If Selection.Find.Execute Then ' If CurrStyle found Select Case CurrStyle Case "Default Paragraph Font" ' skip these Word ONLY Fonts Case "No List" ' this font is a category of fonts in Word 2003 Case Else StyleCount = StyleCount + 1 ' add it to a list of found styles StyleList(StyleCount) = CurrStyle End Select End If Next J ' 4 create a new document containing the list of styles used '----------------------------------------------------------------------- Documents.Add For J = 1 To StyleCount Selection.TypeText Text:=StyleList(J) Selection.TypeParagraph Next J ' 5. restore "show all" setting for the original document ' and NOT show all for the newly created List of Styles doc '----------------------------------------------------------------------- ActiveWindow.ActivePane.View.ShowAll = False If WasShowAll Then Else Windows(NameOfDoc).ActivePane.View.ShowAll = False End If StatusBar = StyleCount & " style(s) used in the document" End Sub Sub VerifyStylesInConversion() ' ' VerifyStylesInConversion Macro ' ' Description: Checks if styles in a textual List of Styles doc are ' defined in the conversion file used by the RTF2SFM ' Perl script. This macro will: ' ' 1. Count the styles listed in the List of styles doc and move ' cursor to first entry in the list ' 2. Locate the RTF2SFM configuration file to open ' 3. See if the opened file appears to be an RTF2SFM configuration file ' 4. Identify styles in the List of Styles document that are NOT defined ' in the specified RTF2SFM conversion/configuration file. ' 5. When stylename not found update entry in List of Styles doc to say ' NOT FOUND and format entry in BOLD RED ' 6. End of Process wrap up ' . ' Prerequisite: 1. ListStylesInDoc which creates a list of styles used in ' a selected document. That document is the input to this ' 2. Either the RTF2SFM.PLX Perl script with self contained ' conversion values for Word styles to SFMs ' OR ' a hand made RTF2SFM.INI file where you specify the ' Word Styles and what SFMs to convert them to. Dim StyleCount As Integer Dim J As Integer Dim LSID As Integer Dim Msg As String Dim R2S As Integer Dim R2S_File As String Dim R2S_LookFor As String Dim StyleName As String Dim X As Long R2S_LookFor = "rtf2sfm.plx" ' modify to request a specific starter filename ' 1. Count the styles listed in the List of styles doc and move ' cursor to first entry in the list '----------------------------------------------------------------------- LSID = ActiveWindow.Index X = ActiveDocument.Paragraphs.Count Selection.HomeKey Unit:=wdStory ' 2. Locate the RTF2SFM configuration file to open '----------------------------------------------------------------------- Msg = "In the OPEN dialog that follows locate" + vbCr + _ "the RTF2SFM conversion file to use." + vbCr + vbCr + _ "This is either the .INI file you created OR" + vbCr + _ "is the default conversion contained in the " + vbCr + _ "RTF2SFM.PLX Perl Script installed in the BIN" + vbCr + _ "subdirectory where Perl was installed." MsgBox Msg, vbOKOnly + vbInformation, "RTF2SFM Conversion File" With Dialogs(wdDialogFileOpen) .Name = R2S_LookFor If .Display = 0 Then MsgBox "Open Canceled—Macro Terminated", vbExclamation End End If ' If no name is entered in the File Name text box, ' Word will open the file with gray background in ' the browser window. .Execute ' opens the configuration file R2S_File = ActiveDocument.Path & Application.PathSeparator & ActiveDocument.Name End With ' This is the full path of the file that was opened R2S = ActiveWindow.Index ' 3. See if the opened file appears to be an RTF2SFM configuration file '----------------------------------------------------------------------- With Selection.Find .ClearFormatting .Forward = True .Wrap = wdFindContinue .Text = "[styles]" ' a key element of configuration file End With If Not Selection.Find.Execute Then Msg = R2S_File + vbCr + vbCr + _ "MISSING [styles] entry." + vbCr + vbCr + _ "Does not appear to be an RTF2SFM configuration file!" + vbCr + vbCr + _ "Close file and Rerun." + vbCr + _ "Select a valid RTF2SFM configuration file." MsgBox Msg, vbCritical End End If ' 4. Identify styles in the List of Styles document that are NOT defined ' in the specified RTF2SFM conversion/configuration file. '----------------------------------------------------------------------- ' cursor position starts with the first style in the list For J = 1 To X - 1 Windows(LSID).Activate ' if not the first time through move to next style in list If (J <> 1) Then Selection.MoveDown Unit:=wdLine, Count:=1 End If ' capture the listed stylename and switch to the conversion file StyleName = Selection.Paragraphs(1).Range.Text StyleName = Left(StyleName, Len(StyleName) - 1) + "=" Windows(R2S).Activate ' see if the stylename followed by an = sign exists in the conversion file With Selection.Find .ClearFormatting .Forward = True .Wrap = wdFindContinue .Text = StyleName End With ' 5. When stylename not found update entry in List of Styles doc to say ' NOT FOUND and format entry in BOLD RED '----------------------------------------------------------------------- If Not Selection.Find.Execute Then StyleCount = StyleCount + 1 ' count number of NOT FOUND styles Windows(LSID).Activate ' switch to List of Styles and Selection.EndKey Unit:=wdLine ' modify the entry Selection.TypeText Text:=vbTab & "NOT FOUND" Selection.HomeKey Unit:=wdLine, Extend:=wdExtend With Selection.Font .Bold = True .ColorIndex = wdRed End With End If Next J ' 6. End of Process wrap up '----------------------------------------------------------------------- Windows(R2S).Close SaveChanges:=wdDoNotSaveChanges ' close config file Windows(LSID).Activate ' switch to List of Styles ' if no styles found maybe this is not a valid RTF2SFM configuration file If StyleCount = X - 1 Then Msg = "NONE of the styles found in configuration file" + _ vbCr + vbCr + R2S_File + vbCr + vbCr + _ "Is this an RTF2SFM configuration file?" MsgBox Msg, vbCritical ' otherwise indicate how many styles weren't defined in the config file Else Msg = StyleCount & " style(s) NOT FOUND in configuration file" + _ vbCr + vbCr + R2S_File MsgBox Msg, vbInformation End If End Sub Creating the AreRTFStylesDefinedInRTF2SFMConversionFile, ListStylesInDoc, and VerifyStylesInConversion macrosTo create these macros you will need to open RTF2SFM_ChkStylesMacro.txt (part of the package you just downloaded) in Word and open the Visual Basic Editor. When done, the macro will be stored in Word’s Normal document template so it can be executed from within any Word document/RTF file.
Note This macro sets the window to format codes so that it can process hidden text. If it was set to before the macro executed, it will be restored to hide when it is finished.Getting the List of Styles found in the RTF documentOpen the RTF file to be converted in Word and run the macro (This macro also runs the other two macros you created):
This will first run the macro and then the macro.You will also get a dialog box saying: Read this and then you may click .Next you will also get another dialog box saying: Read this and then you may click .As the previous dialog box mentioned, this next dialog box is asking for the location of rtf2sfm.plx which you will find in C:Perlbin if you installed Perl as per the instructions. Find it and click . Next you will get another dialog box saying: Read this. If it found any styles which are not listed in the configuration file it will say. You may click .Now you can look at the temporary word document that was produced with contains a list of style names found in the RTF similar to the following (which was run on the sample file 41MAT.RTF}): Bookname Chapter Number Footnote Reference Identification Main Title Paragraph Poetry Left Quote / Poetry Quote 2 Reference Section Head Verse Num If there were any styles not listed in the configuration file it will list them here. Make a note of those. You may close the sample .rtf file, but remember not to save it. Build the configuration file to convert the RTF to USFMAn .ini configuration is used to map the word style names to USFM codes. A sample of the code can be found in RTF2SFM.PLX. [As per the Options section of Bob Hallissy’s RTF2SFM perl script description on the SIL scripts web site mentioned at the beginning of this paper.] You can create an .ini file from the RTF2SFM.PLX as follows:
You associate the style name in the .ini file to a USFM marker by simply adding an equal sign followed by the USFM. See the sample Configuration File that follows. Note It is OK to leave all the unused styles in the .ini file. Sample RTF2SFM.INI configuration file; Note: The following data is assumed to be in UTF8! [options] usebarcodes=2 addfootnoteclosing=1 inlinefootnotes=1 [styles] ; stylename = <tag>,<type>,<marker>,<textbefore>,<textafter>,<endtag> ; see %sf for details note: <textafter> is ignored. ; following styles were in the sample book of Matthew but not in the RTF2SFM Sample ;Default Paragraph Font RefSec=r RefSecChp=r SecChp Head=s SecRef Head=s SecRefChp Head=s ; the following were part of the sample .ini under __DATA__ at the end RTF2SFM.PLX header=,i footer=,i Footnote reference= Identification=id Bookname=h Main Title=mt Secondary Title=st Chapter Number=c,c Section Head=s Reference=r Paragraph=p Paragraph Cont=m Verse Num=v,v Quote / Poetry=q Quote 2=q2 Quote 3=q3 Poetry Left=qm Footnote Text=f,f,*f*,s+,,f* Emphasized Word(s)=|emph,e Running the ConversionYou will convert the RTF file(s) to USFM file(s) by running the conversion program from the Command Prompt.
Note You may have to wait 10 or 15 seconds before you see something happening. Check residueWhen converting an RTF file to SFM, the program may find information that it doesn't know what to do with. Sometimes this information is superfluous, e.g., when Microsoft adds extra information to the RTF file structure that isn't of interest. But the extra information may be from a style that wasn't identified in the configuration file. After converting an RTF file, you should always check the residue file (same name as the output SFM file except with extension .RES) — if you see data in there that is part of your project and you wanted it in the SFM file, you may need to modify your configuration file and try again. One of the messages you will see in the test file is: UNHANDLED dest: '*pn', '', '4' end dest: '*pn', '4' Just ignore these messages for now. Annotations (comments and revision tracking)When editing a document that will eventually be converted to SFM it may be helpful in some project situations to use Word's comments capability (-a parameter. ) or to turn on change tracking ( ). RTF2SFM will extract such annotations (into a separate file) if you supply a file name by using theRTF2SFM.PLX Tag Replacement InformationThe PLX file header that follows gives some clues to the content of the replacements for the tag data on the right of the equal signs. Pay attention to the bolded text. my %sf; # $sf{<stylename>}{tag} is the standard format marker, e.g., c # $sf{<stylename>}{type} is one of: # $sf{<stylename>}{marker} is used if {type} eq 'f', and indicates the # placeholder mark to be in the text, e.g, '|fn' or perhaps '*f*' # $sf{<stylename>}{textbefore} any initial text that should be stripped # Works # on paragraph styles only # $sf{<stylename>}{textafter} any final text that should be stripped # Not yet # implemented ### TODO # -------------------- # Bar code control: my $useBarCodes; # Whether to look for special SFConvert character styles Bar-i, Bar-b, and Bar-u # and # map them to |i, |b, and |u markers. Possible values: # 0 or undef: do not map # 1 do map, using |i...|r style # -------------------- # Footnote control: my $addFootNoteClosing; # Whether to mark the end of footnotes using a capitalized marker, e.g., F or # Possible values = 1 (true) or 0 (false) my $inlineFootnotes; # Whether footnotes are inline a la Paratext or a marker left in the text stream # and footnotes output later. Possible values # 0 Leave marker in the stream and output footnotes later # -------------------- # Special destinations # RTF destinations whose content we are not interested in: # %skipDest are things the parser doesn't need, so we skip it by forcing parser # to search for matching brace # %ignoreDest are things the parser needs, but we don't my %skipDest = ( map { $_ => 1 } (qw(info *panose colortbl *pnseclvl *falt *ts *rsidtbl *generator header headerl headerr headerf footer footerl footerr footerf *ftnsep *ftnsepc *aftnsep *aftnsepc *template *bkmkstart *bkmkend *listtable *listoverridetable *revtbl *atnid *atnauthor *latentstyles)) ); my %ignoreDest = ( map { $_ => 1 } (qw(fonttbl stylesheet *cs)) ) ; # -------------------- # Parser object my $p; # as passed in as first param to handler routines ConclusionYou may have to do some cleanup if RTF2SFM includes some data not related to stylesheets. This was necessary after several trial and error passes on scripture RTF code that came from the FolioViews info base. It included some superfluous reference data, but eventually it was all discovered and a CC table was written able to clean most of it out of the data. All this to say that RTF2SFM is a tool that can do a pretty good job of converting RTF code to SFM format when the RTF file is not generated by SF Converter or by the PNG Scripture template with embedded SFMs. Note This conversion is to a UTF-8 version of Unicode. All keyboard characters (access codes under 128) are already UTF-8 compliant. Upper ASCII characters like acute-a “á” (access code 225) will be converted to UTF-8 format, which will be multiple bytes. To get converted SFM files into Paratext you must specify 65001 UNICODE (UTF-8) as the encoding for the project and then import the SFM files. File List
© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page. |