|
Computers & Writing Systems
You are here: Encoding > Unicode > Training Test Your Mapping
Goals for this StepTo learn how to test whether your file converted to Unicode correctly. Several options are presented. This step is part of the procedure How to Write a Conversion Mapping for your Legacy Font. This page was expanded to include testing Word documents, although converting Word documents is not part of this procedure. For those users who are advanced enough to use SILConverters, we hope you will find this section useful. We welcome further input and corrections. The section "What do Mistakes Look Like?" can probably use considerable expansion. We welcome input on pitfalls not addressed and how to avoid them. Test your ResultsYou will now want to verify your new file. To view your Unicode file, you must use an application that supports Unicode. Some suggestions are Microsoft Word, WordPad or NotePad. Visual CheckThe first and simplest test is a side-by-side visual test of the legacy .txt file and the Unicode .txt file.
Verify with the Unicode MacrosIf you are uncertain what you are seeing (whether a given text is legacy or Unicode), open the Unicode file in Word. You can then check some text with the Unicode Macros button Verify with the Unicode BMP fontYou can also do a general check using Unicode BMP Fallback Font. This font displays a small box containing a hex number for each character, rather than the character itself. Select the text and change the font to Unicode BMP Fallback SIL. Legacy data that was saved in Microsoft Word will always display hex numbers starting with F0. Legacy data that has not been saved in Word should display numbers only in the x0000-00FF range. Unicode data will display with the corresponding Unicode hex numbers, starting with 0 (for Roman data). Note that for standard ABCs, this is also x0000-x00FF. Only a visual check will tell you whether your Unicode file is identical to a legacy file in the x0000-x00FF range. A Unicode file displayed with a legacy font will display empty boxes for characters starting with F. (That is, the lower F0.. range, not the legal Unicode F9.. and above characters.) Verify with Test FileThe following file contains a single example of each of the 256 characters available in a legacy font (excluding the invisible control characters like tab or line-feed). Convert a copy of this file to Unicode using your mapping. Display the original file in your legacy font side-by-side with the new Unicode file you just created. They should be identical if your mapping is correct.
Verify Character Counts (What's in Your Unicode File?)Compare character count of your legacy and Unicode text files. Follow the instructions to create a chart of Unicode characters in your newly converted file by following What's in Your Unicode File?. Compare the results with the chart you created in What's in Your File?. Testing Word documentsAlthough not addressed in this procedure, if you have advanced to using SILConverters in Microsoft Word for converting legacy data in Word documents, here are some ways to test your results. Compare Font ListsCreate a PDF of your word document. Open it in Adobe Reader. Click > . This will show you a list of the fonts used in creating this PDF.However, if Microsoft Word has substituted a font for one that is not installed on your machine, it won't tell you this. Compare this file to the Font Substitution list in Word. Open the original document in Word. Click > . Select the Compatibility tab. Click . This will give a list of any fonts that were not available and what font was used to replace them.Your goal is to create a new Word document that does no font substitution for legacy fonts. Beware of Insert SymbolNote There is now a solution to this problem. Just use the Bulk Word document converter (which is now a part of this package: SILConverters 4.0). It does convert occurrences, and you should not have any of the problems listed in this section and the next. Any data which was entered using Word's SILConverters 4.0. That is because any inserted character is not a "real" character. It is read as a left parenthesis and converted to whatever Unicode value is assigned for d40 (U+0028 or U+F028). Moreover, you cannot reliably search on these characters once they have been converted. See Encoding Conversion Frequently Asked Questions and Known Issues for more information. Keep in mind that if you save your Word document as an RTF file, it will remove all footnotes, headers, and footers. In addition, autonumbering will be unreliable. You may prefer to hand-edit all characters. menu will not convert correctly using the Word macro inHand editing Insert Symbol charactersMake sure you have an absolutely reliable paper copy of your document. You will find it easier to refer to it, once you are in the process of conversion. The biggest difficulty is identifying which characters were entered with The Unicode Macros . button will identify some characters as being "0028". It will show "0028" regardless of what the actual character is. You can then copy each unique one into Word's Find box and search on that. See Encoding Conversion Frequently Asked Questions and Known Issues for more information. (Question:I have a lot of autonumbering...See Encoding Conversion Frequently Asked Questions and Known Issues for more information on converting your character once you've identified it. (Question:When I try to convert...) Remove Legacy Fonts TestYou can check that you have found every reference to a legacy font in a Word document by exiting Word, uninstalling the legacy font, and then opening your converted Word document. Look and see if Word used Font Substitution for your legacy font. (See procedure above). If so, you haven't removed every reference to the legacy font in this file yet. Sometimes you can notice these font substitutions by slight differences in size, typeface, or because they are substituted with completely wrong characters. Visually compare your converted document to the printout of the original. Check StylesStyles may contain references to legacy fonts which will show up in the Font Substitution or PDF font list even if you don't have a single character with that style. Remove legacy fonts from your styles and replace them with the Unicode font you are now using. can sometimes help you find legacy data you missed. Click the down arrow next to a legacy style and choose "Select all # instances". Hand-edit these as needed.Create an RTFSave a copy of your converted Word document as type "RTF". Open the RTF document in an editor that can display the raw RTF code, such as a DOS editor or Notepad. Search for the word "SYMBOL". After some preliminary fontname references you can ignore, it will show you the decimal codepoint of all characters still left that were entered with "Insert Symbol". Here is an example: SYMBOL 234 f "SILSophiaIPA" This tells you that you still have a character d234 in the SILSophiaIPA font in your Word document that was entered using "Insert Symbol". You may also notice "insrsid13054641" in the code preceding this. Use your original document and search on the codepoint shown (234 in this example) to find the location of your problem symbol. Delete this RTF file when you are done. Note that this procedure cannot check footnotes, headers, or footers. Other SuggestionsSome other suggestions on verifying data:
What do mistakes look like?
Testing Different Spaces Note: How to Delete Legacy Characters If your legacy font uses thin spaces for improving display (adding spaces between letters, etc.), this should now be handled in the Unicode font's smart code. When you write your mapping, thin spaces should be deleted rather than converted to U+2009. You can do this by leaving the right-hand side of the TECkit mapping for that character empty. The EndThis ends the instructions for How to Write a Conversion Mapping for your Legacy Font. Some additional basic information for writing TECkit mappings can be found in the tutorial using the IPA93 font TECkit mapping language conversion. The tutorial gives specifics about issues with Encore2Unicode mappings that you may find useful. TECkit documentation is included in your TECkit Documentation folder where you installed TECkit. It contains a full description of TECkit mapping syntax. For working with Word documents and more advanced conversions, go to SILConverters. Finally, this page Utilities contains numerous references to other conversion means and methods. Page History2008-02-28 JW: reviewed, updated 2006-10-28 JW: created © 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page. |