Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Encoding > Unicode > Training
Short URL: http://scripts.sil.org/UTTTestMap

Test Your Mapping

Joan Wardell, 2006-10-28

Goals for this Step

To learn how to test whether your file converted to Unicode correctly. Several options are presented.

This step is part of the procedure How to Write a Conversion Mapping for your Legacy Font.

This page was expanded to include testing Word documents, although converting Word documents is not part of this procedure. For those users who are advanced enough to use SILConverters, we hope you will find this section useful. We welcome further input and corrections.

The section "What do Mistakes Look Like?" can probably use considerable expansion. We welcome input on pitfalls not addressed and how to avoid them.

Test your Results

You will now want to verify your new file. To view your Unicode file, you must use an application that supports Unicode. Some suggestions are Microsoft Word, WordPad or NotePad.

Visual Check

The first and simplest test is a side-by-side visual test of the legacy .txt file and the Unicode .txt file.

  1. Open your legacy file in NotePad or WordPad and set the font to your legacy font.
  2. Choose a point size that is easy to view.
  3. Open the Unicode .txt file in NotePad or WordPad. Set the font to a Unicode font which contains the Unicode characters you are using. For all Roman (ABCs), you can use Doulos SIL or Gentium. For Non-Roman, you must choose a Unicode Non-Roman font.
  4. Display these side-by-side and look for differences.

Verify with the Unicode Macros

If you are uncertain what you are seeing (whether a given text is legacy or Unicode), open the Unicode file in Word. You can then check some text with the Unicode Macros button to verify the actual codepoints. This step was described here: What Number is Your Character?

Verify with the Unicode BMP font

You can also do a general check using Unicode BMP Fallback Font. This font displays a small box containing a hex number for each character, rather than the character itself. Select the text and change the font to Unicode BMP Fallback SIL. Legacy data that was saved in Microsoft Word will always display hex numbers starting with F0. Legacy data that has not been saved in Word should display numbers only in the x0000-00FF range. Unicode data will display with the corresponding Unicode hex numbers, starting with 0 (for Roman data). Note that for standard ABCs, this is also x0000-x00FF. Only a visual check will tell you whether your Unicode file is identical to a legacy file in the x0000-x00FF range. A Unicode file displayed with a legacy font will display empty boxes for characters starting with F. (That is, the lower F0.. range, not the legal Unicode F9.. and above characters.)

Verify with Test File

The following file contains a single example of each of the 256 characters available in a legacy font (excluding the invisible control characters like tab or line-feed). Convert a copy of this file to Unicode using your mapping. Display the original file in your legacy font side-by-side with the new Unicode file you just created. They should be identical if your mapping is correct.

All Legacy Characters
Joan Wardell, 2005-10-17
Download "Chars.txt", Text document, 1KB [2936 downloads]


Verify Character Counts (What's in Your Unicode File?)

Compare character count of your legacy and Unicode text files.

Follow the instructions to create a chart of Unicode characters in your newly converted file by following What's in Your Unicode File?.

Compare the results with the chart you created in What's in Your File?.

Testing Word documents

Although not addressed in this procedure, if you have advanced to using SILConverters in Microsoft Word for converting legacy data in Word documents, here are some ways to test your results.

Compare Font Lists

Create a PDF of your word document. Open it in Adobe Reader. Click File>Document Properties.... This will show you a list of the fonts used in creating this PDF.

However, if Microsoft Word has substituted a font for one that is not installed on your machine, it won't tell you this. Compare this file to the Font Substitution list in Word. Open the original document in Word. Click Tools>Options.... Select the Compatibility tab. Click  Font Substitution... . This will give a list of any fonts that were not available and what font was used to replace them.

Your goal is to create a new Word document that does no font substitution for legacy fonts.

Beware of Insert Symbol

Note

There is now a solution to this problem. Just use the Bulk Word document converter (which is now a part of this package: SILConverters 4.0). It does convert Insert / Symbol occurrences, and you should not have any of the problems listed in this section and the next.

Any data which was entered using Word's Insert / Symbol menu will not convert correctly using the Word macro in SILConverters 4.0. That is because any inserted character is not a "real" character. It is read as a left parenthesis and converted to whatever Unicode value is assigned for d40 (U+0028 or U+F028). Moreover, you cannot reliably search on these characters once they have been converted. See Encoding Conversion Frequently Asked Questions and Known Issues for more information. Keep in mind that if you save your Word document as an RTF file, it will remove all footnotes, headers, and footers. In addition, autonumbering will be unreliable. You may prefer to hand-edit all Insert / Symbol characters.

Hand editing Insert Symbol characters

Make sure you have an absolutely reliable paper copy of your document. You will find it easier to refer to it, once you are in the process of conversion.

The biggest difficulty is identifying which characters were entered with Insert / Symbol.

The Unicode Macros  Show Unicode  button will identify some Insert / Symbol characters as being "0028". It will show "0028" regardless of what the actual character is. You can then copy each unique one into Word's Find box and search on that. See Encoding Conversion Frequently Asked Questions and Known Issues for more information. (Question:I have a lot of autonumbering...

See Encoding Conversion Frequently Asked Questions and Known Issues for more information on converting your character once you've identified it. (Question:When I try to convert...)

Remove Legacy Fonts Test

You can check that you have found every reference to a legacy font in a Word document by exiting Word, uninstalling the legacy font, and then opening your converted Word document. Look and see if Word used Font Substitution for your legacy font. (See procedure above). If so, you haven't removed every reference to the legacy font in this file yet.

Sometimes you can notice these font substitutions by slight differences in size, typeface, or because they are substituted with completely wrong characters. Visually compare your converted document to the printout of the original.

Check Styles

Styles may contain references to legacy fonts which will show up in the Font Substitution or PDF font list even if you don't have a single character with that style. Remove legacy fonts from your styles and replace them with the Unicode font you are now using. Styles and Formatting... can sometimes help you find legacy data you missed. Click the down arrow next to a legacy style and choose "Select all # instances". Hand-edit these as needed.

Create an RTF

Save a copy of your converted Word document as type "RTF". Open the RTF document in an editor that can display the raw RTF code, such as a DOS editor or Notepad. Search for the word "SYMBOL". After some preliminary fontname references you can ignore, it will show you the decimal codepoint of all characters still left that were entered with "Insert Symbol".

Here is an example: SYMBOL 234 f "SILSophiaIPA"

This tells you that you still have a character d234 in the SILSophiaIPA font in your Word document that was entered using "Insert Symbol". You may also notice "insrsid13054641" in the code preceding this.

Use your original document and search on the codepoint shown (234 in this example) to find the location of your problem symbol. Delete this RTF file when you are done. Note that this procedure cannot check footnotes, headers, or footers.

Other Suggestions

Some other suggestions on verifying data:

  • Do word counts (may be difficult for symbol data).
  • Try counting base character plus diacritics (may be difficult in Unicode).
  • Do alphabetical word lists. (Easy to identify spelling differences).

What do mistakes look like?

  1. Square boxes. A square box will show up for any character that cannot be displayed in your new Unicode file. This is often due to choosing the wrong font when you display the file. Make sure you have assigned a Unicode font to the text. If so, check the text with the Unicode Macros or the Unicode BMP Fallback font to see what the hex values are. If they are "F0..", this is not Unicode data. Use your legacy font. If the Unicode BMP Fallback font displays the correct hex number, choose a different Unicode font.
  2. SimSun font or Font Substitution. Microsoft will substitute characters from another font if they are missing in the one you have chosen. This is commonly the SimSun font. Watch for this by noting changes in the typeface and other irregularities in a line of text which should be visually similar. Watch numbers and superscript numbers for irregularities, particularly if you are converting Word documents using SILConverters.
  3. Spaces. Don't skip any characters in your font just because they are invisible. You should map every slot from 32-255, unless the original font has a square box in that slot. Check this by looking at your "Your Font" chart. An empty slot in the chart probably indicates a space of some kind. Find out what any empty slots contain by checking your data files for examples, asking colleagues, or doing visual comparisons in a test file, such as shown below.
Unicodetest between x'scharacter
U+0020 x x x x x spaces
U+2003 x x x x x em-spaces
U+2009 x x x x x thin-spaces

Testing Different Spaces

Note: How to Delete Legacy Characters

If your legacy font uses thin spaces for improving display (adding spaces between letters, etc.), this should now be handled in the Unicode font's smart code. When you write your mapping, thin spaces should be deleted rather than converted to U+2009. You can do this by leaving the right-hand side of the TECkit mapping for that character empty.

The End

This ends the instructions for How to Write a Conversion Mapping for your Legacy Font.

Some additional basic information for writing TECkit mappings can be found in the tutorial using the IPA93 font TECkit mapping language conversion. The tutorial gives specifics about issues with Encore2Unicode mappings that you may find useful.

TECkit documentation is included in your TECkit Documentation folder where you installed TECkit. It contains a full description of TECkit mapping syntax.

For working with Word documents and more advanced conversions, go to SILConverters.

Finally, this page Utilities contains numerous references to other conversion means and methods.

Page History

2008-02-28 JW: reviewed, updated

2006-10-28 JW: created


© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.