What's New?
GetData
: The new parameter Pristine
indicates if you want to get data in the same order as entered. That might be of importance in case of time-dependant data. The new Add
-method stores the position(s) of each value using a Dictionary<</span />T, List<</span />int</span />></span />></span />
.
Introduction
The exploration of empirical data is a common task in various fields. Especially the computational analysis of such data is often cumbered by the nature (the type) of the data. So - while writing statistical routines by my own - I decided to develop a "Frequency Table Class" which accepts any possible type of data.
Requirements
My class has to satisfy requirements as follows:
- Accept any type of data (especially multiple precision types)
- Accept a given array
- Provide methods for adding and removing single values
- Automatically update the absolute frequency when a value is added/removed
- A simple way to fetch mode, highest frequency ...
- Provide a method to get the table as an array
- Returned arrays must be sortable by frequency and value
- Provide fields/properties to describe the table
Code
The backbone of this class is the FrequencyTableEntry<</span />T></span />
structure:
public</span /> struct</span /> FrequencyTableEntry<T> where T : IComparable<T>
{
public</span /> FrequencyTableEntry(T val, int</span /> absFreq, double</span /> relFreq, double</span /> percentage)
{
Value = val;
AbsoluteFreq = absFreq;
RelativeFreq = relFreq;
Percentage = percentage;
}
public</span /> T Value;
public</span /> int</span /> AbsoluteFreq;
public</span /> double</span /> RelativeFreq;
public</span /> double</span /> Percentage;
}
The specified type T
has to implement the IComparable
-Interface (needed for the sorting routine). The class stores the data in a generic Dictionary: _entries = new</span /> Dictionary<</span />T,int</span />></span />()
: the _entries.Keys
-Collection contains the values to count, the _entries.Values
-Collection contains the absolute frequency for this particular value:
public</span /> FrequencyTable(int</span /> initialCapacity)
{
_entries = new</span /> Dictionary<T, int>(initialCapacity);
.
.
}
To provide easy access to table entries, the implemented enumerator returns the structure above:
public</span /> IEnumerator<FrequencyTableEntry<T>> GetEnumerator()
{
FrequencyTableEntry<T> _output;
int</span /> int_f;
double</span /> dbl_f;
foreach</span /> (T key in</span /> _entries.Keys)
{
int_f = _entries[key];
dbl_f = (double</span />)int_f;
_output = new</span /> FrequencyTableEntry<T>
(key, int_f, dbl_f / _count, dbl_f / _count * 100</span />.0</span />);
yield return</span /> _output;
}
}
The general Add(T value</span />)
method looks like this:
public</span /> void</span /> Add(T value)
{
List<int> _tempPos;
if</span /> (_entries.ContainsKey(value))
{
_entries[value]++;
if</span /> (_entries[value] > _high)
{
_high = _entries[value];
_mode = value;
}
_count++;
foreach</span /> (T key in</span /> _entries.Keys)
{
_relFrequencies[key] = (double</span />)_entries[key] / _count;
}
UpdateSumAndMean(value);
_positions.TryGetValue(value, out</span /> _tempPos);
_tempPos.Add(_count);
_positions.Remove(value);
_positions.Add(value, _tempPos);
}
else</span /> </span />
{
if</span /> (_high < 1</span />)
{
_high = 1</span />;
_mode = value;
}
_entries.Add(value, 1</span />);
_length++;
_count++;
_relFrequencies.Add(value, 0</span />.0</span />);
foreach</span /> (T key in</span /> _entries.Keys)
{
_relFrequencies[key] = (double</span />)_entries[key] / _count;
}
UpdateSumAndMean(value);
_tempPos = new</span /> List<int>();
_tempPos.Add(_count);
_positions.Add(value, _tempPos);
}
}
To simplify the analysis of a given text, I have implemented a special constructor:
public</span /> FrequencyTable(T Text, TextAnalyzeMode mode)
{
_positions = new</span /> Dictionary<T, List<int>>();
if</span /> (!(Text is</span /> string</span />))
throw</span /> new</span /> ArgumentException();
_entries = new</span /> Dictionary<T, int>();
_relFrequencies = new</span /> Hashtable();
_length = 0</span />;
_count = 0</span />;
_description = "</span />"</span />;
_tag = 0</span />;
_high = 0</span />;
_dblSum = double</span />.NaN;
_mean = double</span />.NaN;
_alpha = double</span />.NaN;
AnalyzeString(Text, mode);
_p = double</span />.NaN;
}
The associated Add
method:
public</span /> void</span /> Add(T Text, TextAnalyzeMode mode)
{
if</span /> (!(Text is</span /> string</span />))
throw</span /> new</span /> ArgumentException();
AnalyseString(Text, mode);
}
In my opinion, it is useful to provide different modes regarding literal analysis. These modes are provided by TextAnalyzeMode
:
public</span /> enum</span /> TextAnalyzeMode
{
AllCharacters,
NoNumerals,
NoSpecialCharacters,
LettersOnly,
NumeralsOnly,
SpecialCharactersOnly
}
The analysis itself is performed by AnalyzeString(T Text, TextAnalyzeMode mode)
:
private</span /> void</span /> AnalyzeString(T Text, TextAnalyzeMode mode)
{
string</span /> str_specialChars = @"</span />"</span />"</span />!§$%&/()=?@€<>|µ,.;:-_#'*+~²³ "</span />;
string</span /> str_Numbers = "</span />0123456789"</span />;
switch</span /> (mode)
{
case</span /> TextAnalyzeMode.AllCharacters:
foreach</span /> (char</span /> v in</span /> Text.ToString())
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.Getype()));
break</span />;
case</span /> TextAnalyzeMode.LettersOnly:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> ((str_specialChars.IndexOf(v) == -1) &
(str_Numbers.IndexOf(v) == -1))
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.NoNumerals:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_Numbers.IndexOf(v) == -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.NoSpecialCharacters:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_specialChars.IndexOf(v) == -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.NumeralsOnly:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_Numbers.IndexOf(v) != -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
case</span /> TextAnalyzeMode.SpecialCharactersOnly:
foreach</span /> (char</span /> v in</span /> Text.ToString())
{
if</span /> (str_specialChars.IndexOf(v) != -1)
this</span />.Add((T)Convert.ChangeType((object</span />)v, Text.GetType()));
}
break</span />;
}
}
Test for Normality
The question if given data are "Gaussian-distributed" is often raised. There are some robust and valid tests to answer this question. I have implemented the "good old" Kolmogorov-Smirnov test (KS-Test). Alternatively one can use the D'Agostino-Pearson test. There are two new properties concerning normality testing:
IsGaussian
: Returns true</span />
if data are numerical and the computed p-value is greater than Alpha (see below) Alpha
: Defines the "significance level" for the KS-Test
The KS-Test method is shown below. The method returns true</span />
, if the test is applicable. In case of non-numerical data, the method returns false</span />
. The out
-parameter p
contains the p
-value on exit. This value can be accessed by calling the P_Value
property.
private</span /> bool</span /> KS_Test(out</span /> double</span /> p)
{
double</span /> D = double</span />.NaN;
CumulativeFrequencyTableEntry<T>[] empCDF =
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat.EachDatapointOnce);
double</span /> testCDF;
double</span />[] data = new</span /> double</span />[empCDF.Length];
FrequencyTableEntry<T>[] table = GetTableAsArray
(FrequencyTableSortOrder.Value_Ascending);
int</span /> i = 0</span />;
try</span />
{
foreach</span /> (FrequencyTableEntry<T> entry in</span /> table)
{
data[i] = (double</span />)Convert.ChangeType(entry.Value, TypeCode.Double</span />);
i++;
}
}
catch</span />
{
p = double</span />.NaN;
return</span /> false</span />;
}
double</span /> mean = Mean;
double</span /> _sqrt_var = Math.Sqrt(VariancePop);
double</span /> _sqrt2 = Math.Sqrt(2</span />.0</span />);
double</span /> _erf;
double</span /> max1 = 0</span />.0</span />;
double</span /> max2 = 0</span />.0</span />;
double</span /> _temp;
for</span /> (i = 0</span />; i < empCDF.Length; i++)
{
_erf = Erf(((data[i] - mean) / _sqrt_var) / _sqrt2);
testCDF = 0</span />.5</span /> * (1</span />.0</span /> + _erf);
_temp = Math.Abs(empCDF[i].CumulativeRelativeFrequency - testCDF);
if</span /> (_temp > max1)
max1 = _temp;
if</span /> (i > 0</span />)
_temp = Math.Abs(empCDF[i - 1</span />].CumulativeRelativeFrequency - testCDF);
else</span />
_temp = testCDF;
if</span /> (_temp > max2)
max2 = _temp;
}
D = max1 > max2 ? max1 : max2;
if</span /> (!Double.IsNaN(D))
{
double</span /> z = Math.Sqrt((double</span />)SampleSize) * D;
p = KS_Prob_Smirnov(z);
}
else</span />
p = double</span />.NaN;
return</span /> true</span />;
}
To compute the "test distribution" (the Gaussian CDF in this case) we need the so called error function. I have used the Erf-implementation written by Miroslav Stampar (see Special Function(s) for C#), which is a translation of the Cephes Math Library by Stephen L. Moshier.
Descriptive Statistics
I think it is useful to implement some fundamental statistical properties inside the class.
Cumulative Frequencies
First of all, it is needed to implement a method which returns the empirical distribution function (the cumulative density function) of the given data:
public</span /> CumulativeFrequencyTableEntry<T>[]
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
{
CumulativeFrequencyTableEntry<T>[] _output = null</span />;
FrequencyTableEntry<T>[] _freqTable =
GetTableAsArray(FrequencyTableSortOrder.Value_Ascending);
double</span /> tempCumRelFreq = 0</span />.0</span />;
int</span /> tempCumAbsFreq = 0</span />;
int</span /> i, k;
switch</span /> (Format)
{
case</span /> CumulativeFrequencyTableFormat.EachDatapoint:
_output = new</span /> CumulativeFrequencyTableEntry<T>[SampleSize];
for</span /> (i = 0</span />; i < _freqTable.Length; i++)
{
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
for</span /> (k = tempCumAbsFreq - _freqTable[i].AbsoluteFreq;k < tempCumAbsFreq; k++)
{
_output[k] = new</span /> CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
}
break</span />;
case</span /> CumulativeFrequencyTableFormat.EachDatapointOnce:
_output = new</span /> CumulativeFrequencyTableEntry<T>[Length];
for</span /> (i = 0</span />; i < _freqTable.Length; i++)
{
tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
tempCumRelFreq += _freqTable[i].RelativeFreq;
_output[i] = new</span /> CumulativeFrequencyTableEntry<T>
(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
}
break</span />;
}
return</span /> _output;
}
(Sorry for that strange formatting - this edit tool...)
Where Are My Data??
Ok - you need an array of the added data? That is the way:
public</span /> T[] GetData(bool</span /> Pristine)
{
T[] result = new</span /> T[SampleSize];
if</span /> (!Pristine)
{
CumulativeFrequencyTableEntry<T>[] cf = GetCumulativeFrequencyTable
(CumulativeFrequencyTableFormat.EachDatapoint);
for</span /> (int</span /> i = 0</span />; i < SampleSize; i++)
result[i] = cf[i].Value;
}
else</span /> </span />
{
List<int> l;
foreach</span /> (T key in</span /> _positions.Keys)
{
_positions.TryGetValue(key, out</span /> l);
foreach</span /> (int</span /> k in</span /> l)
{
result[k - 1</span />] = key;
}
}
}
return</span /> result;
}
What Else?
There are some public</span />
properties concerning descriptive statistics:
Mean
Median
Mode
Minimum
Maximum
VarianceSample
VariancePop
(unbiased estimator) StandardDevSample
StandardDevPop
(unbiased estimator) StandardError
Sum
SampleSize
- the number of data (read only) HighestFrequency
- the highest frequency observed SmallestFrequency
- the smallest frequency ScarcestValue
- the scarcest value Kurtosis
KurtosisExcess
Skewness
If the data is not numerical, all properties above will return double</span />.NaN
.
Miscellaneous
Here is a list of the remaining public</span />
properties and methods.
Properties
Length
- The number of table entries (read only) Tag
- An object which can be set by the user (writable) Description
- The description of the table (writable) P_Value
(contains the p
value computed by the Kolmogorov-Smirnov Test)
Methods
Add(T Value)
and Add(T Test, TextAnalyzeMode mode)
Remove(T Value)
GetTableAsArray()
and GetTableAsArray(FrequencyTableSortOrder order)
(sorting is done by using the Quicksort-Algorithm) GetEnumerator()
ContainsValue(T value</span />)
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
GetData(bool</span /> Pristine)
- Returns the data as an array (sorted or in input order) GetRelativeFrequency(T value</span />, out double</span /> relFreq)
The code is (I think so) well documented so you can use it to get a detailed insight into my solution. I am sure that this solution is not perfect, but it is a good starting point.
For a better overview, I have added a compiled help file (see download at the top of this page).
History
- Version 1.0 - 18 Jan '07
- Version 1.5 - 04 Feb '07
- Minor bug fixes (highest frequency was not set correctly)
- Added normality testing
- Added descriptive statistics
- 09 Feb '07
P_Value
added, release number not changed
- Version 2.0 26 Feb '07
GetData(bool</span /> Pristine)
added