Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


NRSI: Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE

You are here: Encoding > Unicode > Training
Short URL: http://scripts.sil.org/UTTCCount

What's in Your File?

How to do a Character Count of a Legacy File

Joan Wardell, 2005-06-24

Goals for this Step

Before converting your data to Unicode, it may be useful to find out what's in your file. By the end of these instructions you should have an accurate count of the codepoints which occur in your legacy plain text, Word document, or RTF file. The reason this can be helpful is to identify the characters that are actually used in your file. These characters must be accurately identified in order to map to Unicode correctly. Other characters in the font (those not used in your documents) may be ignored if necessary.

This step is part of the procedure How to Write a Conversion Mapping for your Legacy Font.

Quick Check

You can do an easy check of your file by opening it in Microsoft Word. Does it display correctly? If so, Save As a different name, and set the file type to "Plain Text". When the File Conversion window comes up, choose the radio button "Other Encoding" and select "Unicode" from the list. Can you view all the data correctly? Data that turns into square boxes is legacy data, which will likely need conversion. Close and then open the new file. Does it display correctly? If so, your data has already been correctly converted to Unicode. This process is built-in to Microsoft Word.

The File Conversion window may also come up when you first open the file.

You may delete the plain text file just created.

Preparing the file

Now we want to prepare the legacy file. First you need to get just your legacy data into a plain text file (with the ".txt" format).

Note

This is not a completely automated procedure. You must be able to identify and select the data you wish to count. In addition, the RTF procedure is a moderate to difficult level of difficulty, due to software installations required.

  1. Text file: If your entire file is text (normally has the extension ".txt"), and only data from a single legacy font is in the file. you can skip this section. If your data is mixed, edit as needed and save under a different name.
  2. Microsoft Word document: If your file is a Word document (with the extension "doc"):
    • Open the file in Word.
    • Select only the legacy data. This will normally be all the data which uses a single IPA font, for example.
    • Copy & Paste to a new blank WordPad document. (Do not substitute another program.)
    • Clean up as necessary, so that only data from a single legacy font is in the new document.
    • Save the WordPad document as "text document" with the extension ".txt" chosen.
  3. RTF document: If your file is an RTF document (with the extension ".rtf"),
    • Open the file in Word.
    • Select only the legacy data. This will normally be all the data which uses a single IPA font, for example.
    • Copy & Paste to a new blank Word document.
    • Clean up as necessary, so that only data from a single legacy font is in the new document.
    • Convert the data in this new file out of PUA and back to legacy by this method:
      • Click Tools|Data Conversion. If you do not have this menu item, see SILConverters 4.0 to install. This is a moderate to difficult installation. You may need on-site assistance.
      • Select SymboltoPUA as the converter. If you haven't previously installed this, see Mapping Files to download this converter. Select and Add New the converter named SymboltoPUA. Choose "TECkit map" from the list and browse to the SymboltoPUA.tec file you've just downloaded. Add it to the  System Repository  and click  Apply . This is a Legacy<>Unicode Encoding Conversion.
      • Checkmark the Reverse direction of conversion table option.
      • Select the radio button for Whole Document.
      • Go to Apply specific font: and choose "Times New Roman" or something similar. You just need a font with the standard Latin characters.
      • Click  OK:  to run the conversion. All data should now be in the new font and in standard Latin characters. If the conversion is incomplete or looks odd, you may need to reopen the original RTF file and re-try the conversion.
      • Once the conversion is right, click File | Save As. Choose the File Type txt. Give the file a new name. Word will add the extension ".txt".

Counting the characters

Once you have your file into a plain text format (with extension ".txt"), you can count the characters in your file using either the Perl Character Count utility or the PadTools Character Count program.

Counting the characters with Perl

If you have Perl installed, use Perl Character Count. This is a much easier process than the one below and will save you hours of work. Then skip to Verifying the Results below.

Counting the characters with DOS (Installation)

Before beginning you will need a copy of the old Windows Character Count program:

Character and Diacritic Count programs
Joan Wardell, 2005-06-24
Download "PADTools.exe", Windows application, 467KB [2515 downloads]


To install, double-click the .exe file. Click  OK . Click  Start . All the files will be placed in a subdirectory of Program Files called PadTools. The programs do not run independently of the other files included in the download, so keep them all together.

Note

You may wish to create a shortcut to the Character Count (wccount.exe) program. To do this, see Microsoft's  How to Create a Shortcut on the Desktop.

Counting the Characters with DOS

Double-click the program name: C:Program FilesPadToolswccount.exe.

Note

You can also run the program by hand:

Click Start menu | Run.
To run the program, type the program name: C:Program FilesPadToolswccount.exe. Windows may offer you a list of files as you type and you can select the filename above.

This program will present a small window and ask for your filename.

Open File window



Browse to your .txt file. You may need to change the drive and folder name.

Note

Wccount is an older program that does not recognize long filenames. It will function with them, but will shorten them using a tilde and a numeric value, so that they are 8 characters with a 3 character extension (maximum).

As soon as you identify the file, Character Count counts all the codepoints and displays the results, based on the standard Latin set.

Choose the number of columns you would like displayed (1-4). It may be easiest to work with 1.

Character Counts



Note

Character Count is only counting the codepoints in your file. No information is available in a plain text file except the codepoint numbers. The report will not show you your legacy characters, but rather the standard Latin characters.

Save the results by clicking the  Copy to Clipboard  button.

Run Microsoft Word or another editor. Paste the data you're holding in the clipboard from Character Count into a new document.

Raw Counts



Note

If Copy & Paste fails, or it pastes the wrong data, Press the  Clear All  button on the clipboard in Microsoft Word. Then re-try the copy and paste.

 Close  the Character Count program. If an icon for Character Count stays in your tray, right-click it and Close again.

Rearrange as you prefer. Microsoft Word is a good editor for this file because it allows you to cut and paste columns. Select All and set the font to a monospace font, like Courier New. This should help make the columns line up. Then press the Left-Alt key and hold while selecting a column with the mouse. Cut and paste where desired. You may find it easier to set Number of Columns to 1 instead of 4, before Copying to Clipboard.

Here is a text file showing each decimal number and its corresponding standard value. You can use this as a starting basis or cut and paste from it. Your document will probably have less than half of these characters.

Cut and paste each character's count into the rawcount file, last column. The first column gives the decimal value of the character that will need to be mapped in your TECkit mapping. Determining this set of characters is the primary goal of doing a character count.

rawcount.txt
Joan Wardell, 2005-07-08
Download "rawcount.txt", Text document, 2KB [2689 downloads]


Here is a sample of hand-edited results:

Hand edited results



On the left, you have a set of decimal numbers for a standard font. These are only the ones we are concerned with in conversion. The second column shows the standard character found in each codepoint. The last column, which you have added, should show the number of times this codepoint was used in your legacy file. This is the set of characters which must have a mapping in your TECkit mapping file, to be created in a later step.

You may find it easier to copy & paste from rawcount.txt into your character count results file, if you have only a small set of characters used.

Verifying the Results

You can check that the counts are accurate by doing the following:

  • Select a codepoint from your results document which has 5 or fewer occurrences.
  • Determine the decimal number.
  • Convert this number to its Windows-assigned PUA number using the Windows Calculator. To find out how, work through this short How to Convert tutorial and then find your PUA number with the final procedure Converting a decimal codepoint to PUA
  • Open your .txt or original legacy document in Microsoft Word. You can set your .txt document to the legacy font if you wish.
  • Using the Unicode Macros Find icon , search for the hex codepoint value. Count each hit in the file from beginning to end. It should be equal to the result given by wccount for that item. If it isn't, you may have missed copying some data or made an error in the procedure at some point.

Uses for Character Counts

You may find running a character count useful for various reasons. In this How to Write a Conversion Mapping for your Legacy Font procedure, there are two uses for running a character count of your legacy file. The first is for you to get an idea of how many of the 256 characters in your legacy font are actually used. Many legacy fonts were developed from standard ASCII or ANSI encoded fonts. A small number of changes may have been made, leaving most of the font unused. Counting the characters will quickly show you which characters are regularly being used. When you go to write the mapping, these are the characters that require a mapping. Others can be ignored. Just make sure your Character Count is done with enough test data that you get all the characters.

The second is to provide a means of verifying that your mapping worked properly. This is done in step 10 Test Your Mapping. Depending on the complexity of your mapping, the counts should be identical.

Page History

2008-02-22 JW: reviewed, minor updates

2005-06-24 JW: Page created


© 2003-2017 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Non-Roman Script Initiative. Contact us here.