What's New?
GetData
: The new parameter Pristine
indicates if you want to get data in the same order as entered. That might be of importance in case of time-dependant data. The new Add
-method stores the position(s) of each value using a Dictionary<T, List<int>>
.
Introduction
The exploration of empirical data is a common task in various fields. Especially the computational analysis of such data is often cumbered by the nature (the type) of the data. So - while writing statistical routines by my own - I decided to develop a "Frequency Table Class" which accepts any possible type of data.
Requirements
My class has to satisfy requirements as follows:
- Accept any type of data (especially multiple precision types)
- Accept a given array
- Provide methods for adding and removing single values
- Automatically update the absolute frequency when a value is added/removed
- A simple way to fetch mode, highest frequency ...
- Provide a method to get the table as an array
- Returned arrays must be sortable by frequency and value
- Provide fields/properties to describe the table
Code
The backbone of this class is the FrequencyTableEntry<T>
structure:
A generic structure storing the frequency information for each value
public struct FrequencyTableEntry<T> where T : IComparable<T>
{
Constructor
val: The value counted
absFreq: The absolute frequency
relFreq: The relative frequency
percentage: The percentage
public FrequencyTableEntry(T val, int absFreq, double relFreq, double percentage)
{
Value = val;
AbsoluteFreq = absFreq;
RelativeFreq = relFreq;
Percentage = percentage;
}
public T Value;
public int AbsoluteFreq;
public double RelativeFreq;
public double Percentage;
}
The specified type T
has to implement the IComparable
-Interface (needed for the sorting routine). The class stores the data in a generic Dictionary: _entries = new Dictionary<T,int>()
: the _entries.Keys
-Collection contains the values to count, the _entries.Values
-Collection contains the absolute frequency for this particular value:
public FrequencyTable(int initialCapacity)
{
_entries = new Dictionary<T, int>(initialCapacity);
.
.
}
To provide easy access to table entries, the implemented enumerator returns the structure above:
public IEnumerator<FrequencyTableEntry<T>> GetEnumerator()
{
the structure to return
FrequencyTableEntry<T> _output;
the frequency
int int_f;
the "double-typed" frequency
double dbl_f;
foreach (T key in _entries.Keys)
{
int_f = _entries[key];
dbl_f = (double)int_f;
fill the structure
_output = new FrequencyTableEntry<T>
(key, int_f, dbl_f / _count, dbl_f / _count * 100.0);
yielding - cool thing that
yield return _output;
}
}
The general Add(T value)
method looks like this:
public void Add(T value)
{
List<int> _tempPos;
if the Dictionary already contains value, then
we have to update frequency and _count
if (_entries.ContainsKey(value))
{
update the frequency
_entries[value]++;
update mode and highest frequency
if (_entries[value] > _high)
{
_high = _entries[value];
_mode = value;
}
add 1 to sample size
_count++;
foreach (T key in _entries.Keys)
{
_relFrequencies[key] = (double)_entries[key] / _count;
}
UpdateSumAndMean(value);
store the actual position of the entry in the dataset
_positions.TryGetValue(value, out _tempPos);
the position is equal to _count
_tempPos.Add(_count);
remove old entry
_positions.Remove(value);
store new entry
_positions.Add(value, _tempPos);
}
else if the dictionary does not contain value add a new entry */
{
if the highest frequency is still zero, set it to one
if (_high < 1)
{
_high = 1;
_mode = value;
}
add a new entry - frequency is one
_entries.Add(value, 1);
add 1 to table length
_length++;
add 1 to sample size
_count++;
update relative frequencies
_relFrequencies.Add(value, 0.0);
foreach (T key in _entries.Keys)
{
_relFrequencies[key] = (double)_entries[key] / _count;
}
UpdateSumAndMean(value);
create a new entry and set position to _count
_tempPos = new List<int>();
_tempPos.Add(_count);
store it
_positions.Add(value, _tempPos);
}
}
To simplify the analysis of a given text, I have implemented a special constructor:
Constructor - the created instance analyzes the
frequency of characters in a given string
Text: String to analyze
public FrequencyTable(T Text, TextAnalyzeMode mode)
{
_positions = new Dictionary<T, List<int>>();
if T is not string -> Exception
if (!(Text is string))
throw new ArgumentException();
the table itself
_entries = new Dictionary<T, int>();
_relFrequencies = new Hashtable();
number of entries in _entries
_length = 0;
sample size
_count = 0;
description of the table
_description = "";
a user defined tag
_tag = 0;
the highest frequency
_high = 0;
_dblSum = double.NaN;
_mean = double.NaN;
_alpha = double.NaN;
AnalyzeString(Text, mode);
_p = double.NaN;
}
The associated Add
method:
public void Add(T Text, TextAnalyzeMode mode)
{
if (!(Text is string))
throw new ArgumentException();
AnalyseString(Text, mode);
}
In my opinion, it is useful to provide different modes regarding literal analysis. These modes are provided by TextAnalyzeMode
:
public enum TextAnalyzeMode
{
AllCharacters,
NoNumerals,
NoSpecialCharacters,
LettersOnly,
NumeralsOnly,
SpecialCharactersOnly
}
The analysis itself is performed by AnalyzeString(T Text, TextAnalyzeMode mode)
:
private void AnalyzeString(T Text, TextAnalyzeMode mode)
{
character strings
string str_specialChars = @"""!§$%&/()=?@€<>|µ,.;:-_#'*+~²³ ";
string str_Numbers = "0123456789";
Adding the entries according to mode
switch (mode)
{
case TextAnalyzeMode.AllCharacters:
foreach (char v in Text.ToString())
this.Add((T)Convert.ChangeType((object)v, Text.Getype()));
break;
case TextAnalyzeMode.LettersOnly:
foreach (char v in Text.ToString())
{
if ((str_specialChars.IndexOf(v) == -1) &
(str_Numbers.IndexOf(v) == -1))
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
case TextAnalyzeMode.NoNumerals:
foreach (char v in Text.ToString())
{
if (str_Numbers.IndexOf(v) == -1)
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
case TextAnalyzeMode.NoSpecialCharacters:
foreach (char v in Text.ToString())
{
if (str_specialChars.IndexOf(v) == -1)
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
case TextAnalyzeMode.NumeralsOnly:
foreach (char v in Text.ToString())
{
if (str_Numbers.IndexOf(v) != -1)
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
case TextAnalyzeMode.SpecialCharactersOnly:
foreach (char v in Text.ToString())
{
if (str_specialChars.IndexOf(v) != -1)
this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
}
break;
}
}
Test for Normality
The question if given data are "Gaussian-distributed" is often raised. There are some robust and valid tests to answer this question. I have implemented the "good old" Kolmogorov-Smirnov test (KS-Test). Alternatively one can use the D'Agostino-Pearson test. There are two new properties concerning normality testing:
IsGaussian
: Returns true
if data are numerical and the computed p-value is greater than Alpha (see below)
Alpha
: Defines the "significance level" for the KS-Test
The KS-Test method is shown below. The method returns true
, if the test is applicable. In case of non-numerical data, the method returns false
. The out
-parameter p
contains the p
-value on exit. This value can be accessed by calling the P_Value
property.
private bool KS_Test(out double p)
{
D-statistic
double D = double.NaN;
CumulativeFrequencyTableEntry<T>[] empCDF =
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat.EachDatapointOnce);
store the test CDF
double testCDF;
array to store datapoints
double[] data = new double[empCDF.Length];
FrequencyTableEntry<T>[] table = GetTableAsArray
(FrequencyTableSortOrder.Value_Ascending);
int i = 0;
prevent exceptions if T is not numerical
try
{
foreach (FrequencyTableEntry<T> entry in table)
{
data[i] = (double)Convert.ChangeType(entry.Value, TypeCode.Double);
i++;
}
}
catch
{
p = double.NaN;
return false;
}
estimate the parameters of the expected Gaussian distribution
first: compute the mean
double mean = Mean;
compute the bias-corrected variance
as an estimator for the population variance (actually we need the
square root)
double _sqrt_var = Math.Sqrt(VariancePop);
now we have to determine the greatest difference between the
sample cumulative distribution function (empCDF) and
the distribution function to test (testCDF)
double _sqrt2 = Math.Sqrt(2.0);
double _erf;
double max1 = 0.0;
double max2 = 0.0;
double _temp;
for (i = 0; i < empCDF.Length; i++)
{
compute the expected distribution using the error function
_erf = Erf(((data[i] - mean) / _sqrt_var) / _sqrt2);
testCDF = 0.5 * (1.0 + _erf);
_temp = Math.Abs(empCDF[i].CumulativeRelativeFrequency - testCDF);
if (_temp > max1)
max1 = _temp;
if (i > 0)
_temp = Math.Abs(empCDF[i - 1].CumulativeRelativeFrequency - testCDF);
else
_temp = testCDF;
if (_temp > max2)
max2 = _temp;
}
the statistics to use is
max{diff1,diff2}
D = max1 > max2 ? max1 : max2;
now compute the p-value using a z-transformation
if (!Double.IsNaN(D))
{
double z = Math.Sqrt((double)SampleSize) * D;
p = KS_Prob_Smirnov(z);
}
else
p = double.NaN;
return true;
}
To compute the "test distribution" (the Gaussian CDF in this case) we need the so called error function. I have used the Erf-implementation written by Miroslav Stampar (see Special Function(s) for C#), which is a translation of the Cephes Math Library by Stephen L. Moshier.
Descriptive Statistics
I think it is useful to implement some fundamental statistical properties inside the class.
Cumulative Frequencies
First of all, it is needed to implement a method which returns the empirical distribution function (the cumulative density function) of the given data:
public CumulativeFrequencyTableEntry<T>[]
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
{
CumulativeFrequencyTableEntry<T>[] _output = null;
get the frequency table as array for easier processing
FrequencyTableEntry<T>[] _freqTable =
GetTableAsArray(FrequencyTableSortOrder.Value_Ascending);
temporary values
double tempCumRelFreq = 0.0;
int tempCumAbsFreq = 0;
int i, k;
switch (Format)
{
each datapoint will returned
case CumulativeFrequencyTableFormat.EachDatapoint:
initialize the result
_output = new CumulativeFrequencyTableEntry<T>[SampleSize];
for (i = 0; i < _freqTable.Length; i++)
{
update the cumulative frequency - relative and absolute
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
fill the array
for (k = tempCumAbsFreq - _freqTable[i].AbsoluteFreq;k < tempCumAbsFreq; k++)
{
_output[k] = new CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
}
break;
here each different entry will be returned once
case CumulativeFrequencyTableFormat.EachDatapointOnce:
initialize the result
_output = new CumulativeFrequencyTableEntry<T>[Length];
for (i = 0; i < _freqTable.Length; i++)
{
update the cumulative frequency - relative and absolute
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
fill the array
_output[i] = new CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
break;
}
done
return _output;
}
(Sorry for that strange formatting - this edit tool...)
Where Are My Data??
Ok - you need an array of the added data? That is the way:
public T[] GetData(bool Pristine)
{
T[] result = new T[SampleSize];
if the order is not important
if (!Pristine)
{
CumulativeFrequencyTableEntry<T>[] cf = GetCumulativeFrequencyTable
(CumulativeFrequencyTableFormat.EachDatapoint);
for (int i = 0; i < SampleSize; i++)
result[i] = cf[i].Value;
}
else return the data in same order as entered */
{
List<int> l;
foreach (T key in _positions.Keys)
{
_positions.TryGetValue(key, out l);
foreach (int k in l)
{
result[k - 1] = key;
}
}
}
return result;
}
What Else?
There are some public
properties concerning descriptive statistics:
Mean
Median
Mode
Minimum
Maximum
VarianceSample
VariancePop
(unbiased estimator)
StandardDevSample
StandardDevPop
(unbiased estimator)
StandardError
Sum
SampleSize
- the number of data (read only)
HighestFrequency
- the highest frequency observed
SmallestFrequency
- the smallest frequency
ScarcestValue
- the scarcest value
Kurtosis
KurtosisExcess
Skewness
If the data is not numerical, all properties above will return double.NaN
.
Miscellaneous
Here is a list of the remaining public
properties and methods.
Properties
Length
- The number of table entries (read only)
Tag
- An object which can be set by the user (writable)
Description
- The description of the table (writable)
P_Value
(contains the p
value computed by the Kolmogorov-Smirnov Test)
Methods
Add(T Value)
and Add(T Test, TextAnalyzeMode mode)
Remove(T Value)
GetTableAsArray()
and GetTableAsArray(FrequencyTableSortOrder order)
(sorting is done by using the Quicksort-Algorithm)
GetEnumerator()
ContainsValue(T value)
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
GetData(bool Pristine)
- Returns the data as an array (sorted or in input order)
GetRelativeFrequency(T value, out double relFreq)
The code is (I think so) well documented so you can use it to get a detailed insight into my solution. I am sure that this solution is not perfect, but it is a good starting point.
For a better overview, I have added a compiled help file (see download at the top of this page).
History
- Version 1.0 - 18 Jan '07
- Version 1.5 - 04 Feb '07
- Minor bug fixes (highest frequency was not set correctly)
- Added normality testing
- Added descriptive statistics
- 09 Feb '07
P_Value
added, release number not changed
- Version 2.0 26 Feb '07
GetData(bool Pristine)
added