|
Computers & Writing Systems
You are here: Encoding > Unicode > Training What's in Your File?
Goals for this StepBefore converting your data to Unicode, it may be useful to find out what's in your file. By the end of these instructions you should have an accurate count of the codepoints which occur in your legacy plain text, Word document, or RTF file. The reason this can be helpful is to identify the characters that are actually used in your file. These characters must be accurately identified in order to map to Unicode correctly. Other characters in the font (those not used in your documents) may be ignored if necessary. This step is part of the procedure How to Write a Conversion Mapping for your Legacy Font. Quick CheckYou can do an easy check of your file by opening it in Microsoft Word. Does it display correctly? If so, Save As a different name, and set the file type to "Plain Text". When the File Conversion window comes up, choose the radio button "Other Encoding" and select "Unicode" from the list. Can you view all the data correctly? Data that turns into square boxes is legacy data, which will likely need conversion. Close and then open the new file. Does it display correctly? If so, your data has already been correctly converted to Unicode. This process is built-in to Microsoft Word. The File Conversion window may also come up when you first open the file. You may delete the plain text file just created. Preparing the fileNow we want to prepare the legacy file. First you need to get just your legacy data into a plain text file (with the ".txt" format). Note This is not a completely automated procedure. You must be able to identify and select the data you wish to count. In addition, the RTF procedure is a moderate to difficult level of difficulty, due to software installations required.
Counting the charactersOnce you have your file into a plain text format (with extension ".txt"), you can count the characters in your file using either the Perl Character Count utility or the PadTools Character Count program. Counting the characters with PerlIf you have Perl installed, use Perl Character Count. This is a much easier process than the one below and will save you hours of work. Then skip to Verifying the Results below. Counting the characters with DOS (Installation)Before beginning you will need a copy of the old Windows Character Count program:
To install, double-click the .exe file. Click . Click . All the files will be placed in a subdirectory of Program Files called PadTools. The programs do not run independently of the other files included in the download, so keep them all together.Note You may wish to create a shortcut to the Character Count (wccount.exe) program. To do this, see Microsoft's How to Create a Shortcut on the Desktop. Counting the Characters with DOSDouble-click the program name: C:Program FilesPadToolswccount.exe. Note You can also run the program by hand: Click This program will present a small window and ask for your filename. Open File window Browse to your .txt file. You may need to change the drive and folder name. Note Wccount is an older program that does not recognize long filenames. It will function with them, but will shorten them using a tilde and a numeric value, so that they are 8 characters with a 3 character extension (maximum). As soon as you identify the file, Character Count counts all the codepoints and displays the results, based on the standard Latin set. Choose the number of columns you would like displayed (1-4). It may be easiest to work with 1. Character Counts Note Character Count is only counting the codepoints in your file. No information is available in a plain text file except the codepoint numbers. The report will not show you your legacy characters, but rather the standard Latin characters. Save the results by clicking the button.Run Microsoft Word or another editor. Paste the data you're holding in the clipboard from Character Count into a new document. Raw Counts Note If Copy & Paste fails, or it pastes the wrong data, Press the button on the clipboard in Microsoft Word. Then re-try the copy and paste.the Character Count program. If an icon for Character Count stays in your tray, right-click it and again. Rearrange as you prefer. Microsoft Word is a good editor for this file because it allows you to cut and paste columns. Select All and set the font to a monospace font, like Courier New. This should help make the columns line up. Then press the Left-Alt key and hold while selecting a column with the mouse. Cut and paste where desired. You may find it easier to set Number of Columns to 1 instead of 4, before Copying to Clipboard. Here is a text file showing each decimal number and its corresponding standard value. You can use this as a starting basis or cut and paste from it. Your document will probably have less than half of these characters. Cut and paste each character's count into the rawcount file, last column. The first column gives the decimal value of the character that will need to be mapped in your TECkit mapping. Determining this set of characters is the primary goal of doing a character count.
Here is a sample of hand-edited results: Hand edited results On the left, you have a set of decimal numbers for a standard font. These are only the ones we are concerned with in conversion. The second column shows the standard character found in each codepoint. The last column, which you have added, should show the number of times this codepoint was used in your legacy file. This is the set of characters which must have a mapping in your TECkit mapping file, to be created in a later step. You may find it easier to copy & paste from rawcount.txt into your character count results file, if you have only a small set of characters used. Verifying the ResultsYou can check that the counts are accurate by doing the following:
Uses for Character CountsYou may find running a character count useful for various reasons. In this How to Write a Conversion Mapping for your Legacy Font procedure, there are two uses for running a character count of your legacy file. The first is for you to get an idea of how many of the 256 characters in your legacy font are actually used. Many legacy fonts were developed from standard ASCII or ANSI encoded fonts. A small number of changes may have been made, leaving most of the font unused. Counting the characters will quickly show you which characters are regularly being used. When you go to write the mapping, these are the characters that require a mapping. Others can be ignored. Just make sure your Character Count is done with enough test data that you get all the characters. The second is to provide a means of verifying that your mapping worked properly. This is done in step 10 Test Your Mapping. Depending on the complexity of your mapping, the counts should be identical. Page History2008-02-22 JW: reviewed, minor updates 2005-06-24 JW: Page created © 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page. |