This is an archive of the original scripts.sil.org site, preserved as a historical reference. Some of the content is outdated. Please consult our other sites for more current information: software.sil.org, ScriptSource, FDBP, and silfontdev



Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE | PRIVACY POLICY

You are here: Encoding > Resources
Short URL: https://scripts.sil.org/UnicodeCharacterCount

Unicode Character Count utility

Bob Hallissy, 2004-11-16

Updated 2005-11-16

New sorting options -f and -r added.

UnicodeCCount is a quick-and-dirty Unicode-aware replacement for CCount, the character count utility. Written in Perl, the program is available both as the Perl source (requires Perl 5.8.1 or newer) and as a stand-alone Windows EXE.

Contents

Syntax

UnicodeCCount is a command line utility. When executed without any parameters it emits a short help message:

Usage:
    UnicodeCCount [-e encoding] [-o outputfile] [-c|-d] [-m] [-u|-f] [-r] file ...
    UnicodeCCount -l

A quick and dirty character counter that understands various encodings.

Input defaults to utf8, but you can choose other encodings with -e.
Data is converted from the specified encoding to Unicode as it is
read, and the output data is always utf-8.

-l outputs a list of available encodings.

-c or -d enforce Unicode normalization (NFC or NFD) as data is read.

-m combining mark sequences (base + diacritics) counted separately.

-u use the Unicode Collation Algorithm (UCA) rather than the default sort.

-f sort by frequency

-r reverse sort order

Version 0.3

Example

Suppose I have the first paragraph of the Russian translation of the Universal Declaration of Human Rights in a plain-text UTF-8 file called mytext.txt. The text looks like this:

Принимая во внимание, что признание достоинства, присущего всем членам человеческой семьи, и равных и неотъемлемых прав их является основой свободы, справедливости и всеобщего мира; и

Then the following command:

UnicodeCCount mytext.txt >counts.txt

(note the redirection of standard out in order to capture the output to a file) would result in the following UTF-8 data in counts.txt:

Character count for 'mytext.txt':
    U+000A         1
    U+000D         1
    U+0020         25
    U+002C    ,    4
    U+003B    ;    1
    U+041F    П    1
    U+0430    а    9
    U+0431    б    2
    U+0432    в    13
    U+0433    г    2
    U+0434    д    3
    U+0435    е    16
    U+0437    з    1
    U+0438    и    17
    U+0439    й    2
    U+043A    к    1
    U+043B    л    5
    U+043C    м    8
    U+043D    н    10
    U+043E    о    16
    U+043F    п    4
    U+0440    р    7
    U+0441    с    12
    U+0442    т    6
    U+0443    у    1
    U+0445    х    3
    U+0447    ч    4
    U+0449    щ    2
    U+044A    ъ    1
    U+044B    ы    3
    U+044C    ь    1
    U+044F    я    4
    U+FEFF         1

Note that the output is tab-separated.

Downloads

Unicode Character Count Windows executable v0.3
Bob Hallissy, 2005-11-16
Download "UnicodeCCount.exe", Windows application, 2MB [4576 downloads]
Unicode Character Count Perl source v0.3
Bob Hallissy, 2005-11-16
Download "UnicodeCCount-0_3.zip", ZIP archive, 2KB [5893 downloads]

Previous versions

Unicode Character Count Windows executable v0.2
Bob Hallissy, 2004-09-08
Download "UnicodeCCount.exe", Windows application, 2MB [3600 downloads]
Unicode Character Count Perl source v0.2
Bob Hallissy, 2004-09-08
Download "UnicodeCCount-0_2.zip", ZIP archive, 2KB [3024 downloads]
Unicode Character Count Windows executable
Bob Hallissy, 2004-08-03
Download "UnicodeCCount.exe", Windows application, 2MB [3402 downloads]
Unicode Character Count Perl source v0.1
Bob Hallissy, 2004-08-03
Download "UnicodeCCount.zip", ZIP archive, 1KB [2943 downloads]

Related Resources

 LetterMeter, text analysis tool — For MacOSX only

Support

As this program is distributed at no cost, I am unable to provide a commercial level of personal technical support. I am interested in hearing from you, however, and will try to resolve problems that are reported to me. You can send feedback to me here.


© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.