Introduction
Counting occurrences of symbols can answer the question: "How many blanks are in this text?". Among other amenities, it could be used as the preliminary step in some data compression algorithms.
Here, I am going to show the implementation of a trivial algorithm using the programming languages C and Lua. In addition, a C++ implementation is available for downloading. For the sake of brevity, I don't report it here.
The Problem
Given some input data, say "hello world!
", produce an output similar to:
symbol | code | occurrences |
o | 111 | 2 |
w | 119 | 1 |
r | 114 | 1 |
h | 104 | 1 |
e | 101 | 1 |
d | 100 | 1 |
| 32 | 1 |
Where 'code' is the ASCII code of the character (the value of the byte). Since we want to process large binary files, let's state it this way: "count the occurrences of each byte value (0..255) in a binary file, then sort them by occurrences".
The Algorithm
The algorithm is very simple: suppose we are reading data, one byte at time and the value of the current one is, say, 104
(i.e., the 'h'
character was just read): we have to increment the occurrences of 104
and repeat such operation with the next byte, until the whole input data has been processed. A fast way for "incrementing the occurrences of a byte value" uses an array of occurrences, with the implicit assumption that the occurrences of a value are stored by the item having index equal to the value itself (e.g. the occurrences of 104
are stored in occurrences[104]
). The array is eventually sorted in order to obtain the expected output.
C Code
The C implementation is pretty straightforward: The program initializes the array of occurrences, then reads data from stdin
using a 4KiB
buffer. After each read operation, a function is called for updating the occurrences array. Finally, the sort step is performed using the C library qsort
function.
Here follows the struct
holding a symbol code and its occurrences (we need to store the symbol code in order to perform the sort step).
#define BYTES 256
typedef struct TagSymbolInfo
{
int symbol; int occurrences; } SymbolInfo;
The initialization function (sets the symbol codes, resets the occurrences).
void reset_occurrences(SymbolInfo si[])
{
int n;
for (n = 0; n < BYTES; ++n)
{
si[n].symbol = n;
si[n].occurrences = 0;
}
}
The following function is the workhorse of the algorithm: it processes the input data in order to update the occurrences array:
int increment_occurrences(SymbolInfo si[], const void * buffer, size_t size)
{
unsigned char * p = (unsigned char *) buffer;
int success = 0;
while ( size-- )
{
si[p[size]].occurrences++; if ( ! si[p[size]].occurrences )
success = -1; }
return success;
}
The function used for ordering items by occurrences (and code, when occurrences are equal)
int compare_symbols(const void * symbola, const void * symbolb)
{
const SymbolInfo * a = (const SymbolInfo *) symbola;
const SymbolInfo * b = (const SymbolInfo *) symbolb;
if ( a->occurrences > b->occurrences ||
(a->occurrences == b->occurrences && a->symbol > b->symbol)) return -1;
return 1;
}
Finally the main function, with some boilerplate code:
#define INBUFSIZE 4*1024
int main()
{
long total = 0; unsigned char buffer[INBUFSIZE]; int n;
SymbolInfo si[BYTES];
reset_occurrences(si);
for(;;)
{
int nread = fread (buffer, sizeof(buffer[0]), INBUFSIZE, stdin);
if ( ! nread ) break;
if ( increment_occurrences(si, buffer, nread) )
printf("occurrences counter(s) overflow\n");
}
qsort( si, BYTES, sizeof(si[0]), compare_symbols);
for (n=0; n<BYTES; ++n)
{
char c = si[n].symbol;
if (! si[n].occurrences) break;
total += si[n].occurrences;
c = isprint(c) ? c : '.';
printf("{symbol ='%c', code= %3d,
occurrences = %d}\n", c, si[n].symbol, si[n].occurrences);
}
printf("total occurrences = %ld\n", total);
return 0;
}
Lua Code
Lua is a wonderful scripting language, lightweight, fast and easily embeddable (more information can be found at Lua website).
Since it is a somewhat 'exotic language' here at CodeProject, I am going to give some details of the posted code.
local INBUFSIZE = 4096
local function init_occurrences()
local si = {}
for i=1,256 do
si[i] = { occurrences = 0, symbol = (i-1) }
end
return si
end
The init_occurrences
function creates the occurrences array si
and initializes it.
In more detail, the local si = {}
statement creates a table
, the only data structure Lua provides.
A table is a kind of associative array. Inside the for
loop, a fresh table is created, initialized and assigned to every item of si.
The statement...
si[i] = { occurrences = 0, symbol = (i-1) }
...is a shortcut for:
s[i] = {}
s[i]["occurrences"] = 0
s[i]["symbol"] = i-1
or...
s[i] = {}
s[i].occurrences = 0
s[i].symbol = i-1
...using the 'syntactic sugar' provided with the dot operator.
Finally return si
terminates the function and makes si
available to the caller, avoiding its removal on next garbage collection.
The increment_occurrences
function, pretty straightforward, shows the lack of the handy increment operators:
local function increment_occurrences(si, buffer)
for i=1,#buffer do
local index = buffer:byte(i)+1
si[index].occurrences = si[index].occurrences + 1
end
end
The remaining part of the program, boilerplate code...
local function main()
local total = 0
local si = init_occurrences()
repeat
local buf = io.read(INBUFSIZE)
if buf then
increment_occurrences(si, buf)
end
until not buf
table.sort(si,
function (a,b)
if (a.occurrences > b.occurrences or (a.occurrences == b.occurrences and a.symbol > b.symbol)) then
return true
end
end)
for i=1,256 do
if si[i].occurrences == 0 then break end
local c = string.char(i):match("([%g])") or "-"
print(string.format("{symbol ='%s', code= %3d, occurrences = %d}", c, si[i].symbol, si[i].occurrences));
total = total + si[i].occurrences
end
print("total occurrences = " .. total)
end
main()
With a nice twist: the following statements sort the si
table, using the table.sort
function fed with an anonymous function.
table.sort(si,
function (a,b)
if (a.occurrences > b.occurrences or (a.occurrences == b.occurrences and a.symbol > b.symbol)) then
return true
end
end)
C++ Code
David Delaune, aka Randor
suggested me a neat approach for reading input (you may find his comments in the thread at bottom of the page). Now it makes sense to show the C++ code:
struct SymbolInfo
{
int symbol; int occurrences; };
int main()
{
std::cin >> std::noskipws;
std::array<SymbolInfo , 256> si={{0}};
std::istream_iterator<unsigned char> input(std::cin),end;
std::for_each(input,end,[&](unsigned char c){si[c].occurrences++;});
for (int n=0; n<256; ++n) si[n].symbol=n;
std::sort(si.begin(), si.end(),
[](const SymbolInfo & a, const SymbolInfo &b) {return (a.occurrences > b.occurrences || (a.occurrences == b.occurrences && a.symbol > b.symbol));}
);
}
That is! (I have just omitted the include
directives and the boilerplate printout code).
Using the Code
At the moment, the C/C++ code is ready to run on a Linux console (although modifying it for running on a Windows console is trivial) .
Lua code requires the Lua intepreter, available on Ubuntu-like distributions as package lua5.2 ( apt-get install lua5.2
). In addition, binaries, also for Windows, are available here: Lua binaries at SourceForge.
C code
Compile with:
gcc -Wall -O3 -o occurrences occurrences.c
Run with:
./occurrences < <FILE PATH>
(e.g. ./occurrences < myfile.bin
)
The input redirection operator (<
) is mandatory in order to count occurrences of a file (otherwise the commands would count occurrences on stdin
, that is characters typed by the user until CTRL^D
.
C++ code
Compile progrma (1) with:
g++ -std=c++11 -Wall -O3 -o occurrences occurrences.cpp
and program (2) with:
g++ -std=c++11 -Wall -O3 -o occurrences_ran occurrences_ran.cpp
(The C++ programs requires a fairly recent version of g++, I used the 4.7.3 one)
Run the same way the C one does.
Lua code
Of course it requires no compilation. You may run it this way:
lua occurrences.lua < <FILE PATH>
Benchmark
I know it is not fair to compare the performances of Lua and C/C++ code using a rather low-level algorithm, however I am not fair, so here you are a little test.
Details
- Executed on a
Lubuntu 13.04
system. - Input file:
lubuntu-13.10-desktop-i386.iso
(729,808,896 bytes). g++
and gcc
optimization level -O3
. Lua
version 5.2.1
. - Reported timings are generated with the Linux
time
command.
language | file size (MiB) | real (s) | user (s) | sys (s) | speed (MiB/s) |
C++ (1) | 696 | 9.07 | 1.31 | 1.51 | 76.74 |
C++ (2) | 696 | 41.47 | 37.20 | 3.50 | 16.78 |
C | 696 | 9.94 | 0.62 | 1.63 | 70.02 |
Lua | 696 | 165.38 | 155.87 | 4.38 | 4.21 |
Notes
- 'Traditional' C++ program (the source file occurrences.cpp is available in the download section)
- More elegant program (the one shown in the article, available for download as occurrences_ran.cpp)
Remarks
- Apparently C and C++ (1) performances are very similar. They are dominated by the I/O as the
real
vs user
and sys
figures show. - C++ (2) is noticeably slower that C++ (1).
Lua
is about 17
times slower than C/C++
, its execution time being dominated by the user
part.
History
- 8 Feb 2014 - Added the C++ code (thanks David Delaune aka
Randor
) - 6 Feb 2014 - Added the missing source files for download
- 6 Feb 2014 - First release