Counting Occurrences

CPallini

4.91/5 (7 votes)

6 Feb 2014CPOL5 min read

32.7K

A simple algorithm for counting occurrences of symbols in files

Download C, C++, Lua source files - 3.7 KB

Introduction

Counting occurrences of symbols can answer the question: "How many blanks are in this text?". Among other amenities, it could be used as the preliminary step in some data compression algorithms.

Here, I am going to show the implementation of a trivial algorithm using the programming languages C and Lua. In addition, a C++ implementation is available for downloading. For the sake of brevity, I don't report it here.

The Problem

Given some input data, say "hello world!", produce an output similar to:

symbol	code	occurrences
o	111	2
w	119	1
r	114	1
h	104	1
e	101	1
d	100	1
	32	1

Where 'code' is the ASCII code of the character (the value of the byte). Since we want to process large binary files, let's state it this way: "count the occurrences of each byte value (0..255) in a binary file, then sort them by occurrences".

The Algorithm

The algorithm is very simple: suppose we are reading data, one byte at time and the value of the current one is, say, 104 (i.e., the 'h' character was just read): we have to increment the occurrences of 104 and repeat such operation with the next byte, until the whole input data has been processed. A fast way for "incrementing the occurrences of a byte value" uses an array of occurrences, with the implicit assumption that the occurrences of a value are stored by the item having index equal to the value itself (e.g. the occurrences of 104 are stored in occurrences[104]). The array is eventually sorted in order to obtain the expected output.

C Code

The C implementation is pretty straightforward: The program initializes the array of occurrences, then reads data from stdin using a 4KiB buffer. After each read operation, a function is called for updating the occurrences array. Finally, the sort step is performed using the C library qsort function.

Here follows the struct holding a symbol code and its occurrences (we need to store the symbol code in order to perform the sort step).

C++

#define BYTES 256
 
typedef struct TagSymbolInfo
{
  int symbol; // the byte [0..255] value
  int occurrences; // number of occurrences of the symbol in the given buffer
} SymbolInfo;

The initialization function (sets the symbol codes, resets the occurrences).

C++

// initialise the symbol array
void reset_occurrences(SymbolInfo si[])
{
  int n;
  for (n = 0; n < BYTES; ++n)
  {
    si[n].symbol = n;
    si[n].occurrences = 0;
  }
}

The following function is the workhorse of the algorithm: it processes the input data in order to update the occurrences array:

C++

 // counts the occurrences in the given buffer, updating accordingly the array of symbols
int increment_occurrences(SymbolInfo si[], const void * buffer, size_t size)
{
  unsigned char * p = (unsigned char *) buffer;
  int success = 0;
 
  while ( size-- )
  {
    si[p[size]].occurrences++; // since si[n].symbol == n, we can use p[size] as index.
    if ( ! si[p[size]].occurrences )
      success = -1; // error on counter overflow
  }
  return success;
}

The function used for ordering items by occurrences (and code, when occurrences are equal)

C++

// this is the predicate for qsort.
int compare_symbols(const void * symbola, const void * symbolb)
{
  const SymbolInfo * a = (const SymbolInfo *) symbola;
  const SymbolInfo * b = (const SymbolInfo *) symbolb;
 
  if ( a->occurrences > b->occurrences || 
  (a->occurrences == b->occurrences && a->symbol > b->symbol)) return -1;
  return 1;
}

Finally the main function, with some boilerplate code:

C++

 #define INBUFSIZE 4*1024
 
int main()
{
  long total = 0; // total number of occurrences, must be equal to file size
  unsigned char buffer[INBUFSIZE]; // the read buffer
  int n;
  SymbolInfo si[BYTES]; // the array of symbols

  reset_occurrences(si); // initialization

  // reading stdin until EOF (or error)
  for(;;)
  {
    int nread = fread (buffer, sizeof(buffer[0]), INBUFSIZE, stdin);
    if ( ! nread ) break;
    if ( increment_occurrences(si, buffer, nread) )
      printf("occurrences counter(s) overflow\n");
  }
  // calling qsort for the sorting step
  qsort( si, BYTES, sizeof(si[0]), compare_symbols);
 
 
  // some output
  for (n=0; n<BYTES; ++n)
  {
    char c = si[n].symbol;
    if (! si[n].occurrences) break;
    total += si[n].occurrences;
 
    c = isprint(c) ?  c : '.';
 
    printf("{symbol ='%c', code= %3d,  
    occurrences = %d}\n", c, si[n].symbol, si[n].occurrences);
  }
  printf("total occurrences = %ld\n", total);
 
  return 0;
}

Lua Code

Lua is a wonderful scripting language, lightweight, fast and easily embeddable (more information can be found at Lua website).

Since it is a somewhat 'exotic language' here at CodeProject, I am going to give some details of the posted code.

Lua

local INBUFSIZE = 4096
 
local function init_occurrences()
  local si = {}
  for i=1,256 do
    si[i] = { occurrences = 0, symbol = (i-1) }
  end
  return si
end

The init_occurrences function creates the occurrences array si and initializes it.

In more detail, the local si = {} statement creates a table, the only data structure Lua provides.

A table is a kind of associative array. Inside the for loop, a fresh table is created, initialized and assigned to every item of si.

The statement...

si[i] = { occurrences = 0, symbol = (i-1) }

...is a shortcut for:

Lua

s[i] = {}
s[i]["occurrences"] = 0
s[i]["symbol"] = i-1

or...

Lua

s[i] = {}
s[i].occurrences = 0
s[i].symbol = i-1

...using the 'syntactic sugar' provided with the dot operator.

Finally return si terminates the function and makes si available to the caller, avoiding its removal on next garbage collection.

The increment_occurrences function, pretty straightforward, shows the lack of the handy increment operators:

Lua

 local function increment_occurrences(si, buffer)
  for i=1,#buffer do
    local index = buffer:byte(i)+1
    si[index].occurrences = si[index].occurrences + 1
  end
end

The remaining part of the program, boilerplate code...

Lua

 local function main()
 
  local total = 0
  local si = init_occurrences()
 
  repeat
    local buf = io.read(INBUFSIZE)
    if  buf then
      increment_occurrences(si, buf)
    end
  until not buf
 
  table.sort(si,
    function (a,b)
      if (a.occurrences > b.occurrences or (a.occurrences == b.occurrences and a.symbol > b.symbol)) then
        return true
      end
    end)
 
  for i=1,256 do
    if si[i].occurrences == 0 then break end
    local c = string.char(i):match("([%g])") or "-"
 
    print(string.format("{symbol ='%s', code= %3d,  occurrences = %d}", c, si[i].symbol, si[i].occurrences));
 
    total = total + si[i].occurrences
  end
 
  print("total occurrences = " .. total)
end 

main()

With a nice twist: the following statements sort the si table, using the table.sort function fed with an anonymous function.

table.sort(si,
  function (a,b)
    if (a.occurrences > b.occurrences or (a.occurrences == b.occurrences and a.symbol > b.symbol)) then
      return true
    end
  end)

C++ Code

David Delaune, aka Randor suggested me a neat approach for reading input (you may find his comments in the thread at bottom of the page). Now it makes sense to show the C++ code:

C++

struct SymbolInfo
{
  int symbol; // the byte value [0..255]
  int occurrences; // the number of occurrences of the symbol
};

int main()
{
  std::cin >> std::noskipws;
  std::array<SymbolInfo , 256> si={{0}};
  std::istream_iterator<unsigned char> input(std::cin),end;
  std::for_each(input,end,[&](unsigned char c){si[c].occurrences++;});

  for (int n=0; n<256; ++n) si[n].symbol=n;

  std::sort(si.begin(), si.end(),
    [](const SymbolInfo & a, const SymbolInfo &b) {return (a.occurrences > b.occurrences || (a.occurrences == b.occurrences && a.symbol > b.symbol));}
  );
}

That is! (I have just omitted the include directives and the boilerplate printout code).

Using the Code

At the moment, the C/C++ code is ready to run on a Linux console (although modifying it for running on a Windows console is trivial) .

Lua code requires the Lua intepreter, available on Ubuntu-like distributions as package lua5.2 ( apt-get install lua5.2). In addition, binaries, also for Windows, are available here: Lua binaries at SourceForge.

C code

Compile with:

gcc -Wall -O3 -o occurrences occurrences.c

Run with:

./occurrences < <FILE PATH>

(e.g. ./occurrences < myfile.bin)

The input redirection operator (<) is mandatory in order to count occurrences of a file (otherwise the commands would count occurrences on stdin, that is characters typed by the user until CTRL^D.

C++ code

Compile progrma (1) with:

g++ -std=c++11 -Wall -O3 -o occurrences occurrences.cpp

and program (2) with:

g++ -std=c++11 -Wall -O3 -o occurrences_ran occurrences_ran.cpp

(The C++ programs requires a fairly recent version of g++, I used the 4.7.3 one)

Run the same way the C one does.

Lua code

Of course it requires no compilation. You may run it this way:

lua occurrences.lua < <FILE PATH>

Benchmark

I know it is not fair to compare the performances of Lua and C/C++ code using a rather low-level algorithm, however I am not fair, so here you are a little test.

Details

Executed on a Lubuntu 13.04 system.
Input file: lubuntu-13.10-desktop-i386.iso (729,808,896 bytes).
g++ and gcc optimization level -O3.
Lua version 5.2.1.
Reported timings are generated with the Linux time command.

language	file size (MiB)	real (s)	user (s)	sys (s)	speed (MiB/s)
C++ (1)	696	9.07	1.31	1.51	76.74
C++ (2)	696	41.47	37.20	3.50	16.78
C	696	9.94	0.62	1.63	70.02
Lua	696	165.38	155.87	4.38	4.21

Notes

'Traditional' C++ program (the source file occurrences.cpp is available in the download section)
More elegant program (the one shown in the article, available for download as occurrences_ran.cpp)

Remarks

Apparently C and C++ (1) performances are very similar. They are dominated by the I/O as the real vs user and sys figures show.
C++ (2) is noticeably slower that C++ (1).
Lua is about 17 times slower than C/C++ , its execution time being dominated by the user part.

History

8 Feb 2014 - Added the C++ code (thanks David Delaune aka Randor)
6 Feb 2014 - Added the missing source files for download
6 Feb 2014 - First release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)