Introduction
The 80-20 rules applies: even with the advances of statistics, most of our work requires only univariate descriptive statistics – those involve the calculations of mean, standard deviation, range, skewness, kurtosis, percentile, quartiles, etc. This article describes a simple way to construct a set of classes to implement descriptive statistics in C#. The emphasis is on the ease of use at the users' end.
Requirements
To run the code, you need to have the following:
- .NET Framework 2.0 and above
- Microsoft Visual Studio 2005 if you want to open the project files included in the download project
- Nunit 2.4 if you want to run the unit tests included in the download project
The download included in this article is implemented as a class library. You will need to make a reference to the project to make use of the functionalities.
The download also includes a NUnit test in case you want to make changes to the code and run your own unit test.
The Code
The goal of the code design is to simplify the usage. We envisage that the user will perform the following code to get the desired results. This involves a simple 3-steps process:
- Instantiate a
Descriptive
object - Invoke its
.Analyze()
method - Retrieve results from its
.Result
object
Here is a typical user’s code:
double[] x = {1, 2, 4, 7, 8, 9, 10, 12};
Descriptive desp = new Descriptive(x);
desp.Analyze();
Console.WriteLine("Result is: " + desp.Result.FirstQuartile.ToString());
Two classes are implemented:
DescriptiveResult
Descriptive
DescriptiveResult
is a class from which a result object derives, which holds the analysis results. In our implementation, the .Result
member variable is defined as follows:
public class DescriptiveResult
{
internal double[] sortedData;
public DescriptiveResult() { }
public uint Count;
public double Sum;
public double Mean;
public double GeometricMean;
public double HarmonicMean;
public double Min;
public double Max;
public double Range;
public double Variance;
public double StdDev;
public double Skewness;
public double Kurtosis;
public double IQR;
public double Median;
public double FirstQuartile;
public double ThirdQuartile;
internal double SumOfError;
internal double SumOfErrorSquare;
For simplicity, most member variables are implemented as public
variables. The only member function - Percentile
- allows the user to pass the argument (in percentage, e.g. 30 for 30%) and receive the percentile result.
The following table lists the available results (assuming that the Descriptive
object name you use is desp
:
Result | Result stored in variable |
Number of data points | desp.Result.Count |
Minimum value | desp.Result.Min |
Maximum value | desp.Result.Max |
Range of values | desp.Result.Range |
Sum of values | desp.Result.Sum |
Arithmetic mean | desp.Result.Mean |
Geometric mean | desp.Result.GeometricMean |
Harmonic mean | desp.Result.HarmonicMean |
Sample variance | desp.Result.Variance |
Sample standard deviation | desp.Result.StdDev |
Skewness of the distribution | desp.Result.Skewness |
Kurtosis of the distribution | desp.Result.Kurtosis |
Interquartile range | desp.Result.IQR |
Median (50% percentile) | desp.Result.Median |
FirstQuartile: 25% percentile | desp.Result.FirstQuartile |
ThirdQuartile: 75% percentile | desp.Result.ThirdQuartile |
Percentile | desp.Result.Percentile() * |
* The argument of percentile is values from 0 to 100, which indicates the percentile desired.
Descriptive Class
The Descriptive
class does all the analysis, and it is implemented as follows:
public class Descriptive
{
private double[] data;
private double[] sortedData;
public DescriptiveResult Result = new DescriptiveResult();
#region Constructors
public Descriptive() { }
public Descriptive(double[] dataVariable)
{
data = dataVariable;
}
#endregion // Constructors
Note that we need a sortedData
class to facilitate percentile and quartile-related statistics. It stores the sorted version of the user data.
The constructor of Descriptive
class allows the user to assign the data array during the object instantiation:
double[] x = {1, 2, 4, 7, 8, 9, 10, 12};
Descriptive desp = new Descriptive(x);
Once the Descriptive
object is instantiated, the user only needs to call the .Analyze()
method to perform the analysis. Subsequently, the user can retrieve the analysis results from the .Result
object in the Descriptive
object.
The Analyze()
method is implemented as follows:
public void Analyze()
{
Result.Count = 0;
Result.Min = Result.Max = Result.Range = Result.Mean =
Result.Sum = Result.StdDev = Result.Variance = 0.0d;
double sumOfSquare = 0.0d;
double sumOfESquare = 0.0d;
double[] squares = new double[data.Length];
double cumProduct = 1.0d;
double cumReciprocal = 0.0d;
for (int i = 0; i < data.Length; i++)
{
if (i==0)
{
Result.Min = data[i];
Result.Max = data[i];
Result.Mean = data[i];
Result.Range = 0.0d;
}
else
{
if (data[i] < Result.Min) Result.Min = data[i];
if (data[i] > Result.Max) Result.Max = data[i];
}
Result.Sum += data[i];
squares[i] = Math.Pow(data[i], 2);
sumOfSquare += squares[i];
cumProduct *= data[i];
cumReciprocal += 1.0d / data[i];
}
Result.Count = (uint)data.Length;
double n = (double)Result.Count;
Result.Mean = Result.Sum / n;
Result.GeometricMean = Math.Pow(cumProduct, 1.0 / n);
Result.HarmonicMean = 1.0d / (cumReciprocal / n);
Result.Range = Result.Max - Result.Min;
double m1 = 0.0d;
double m2 = 0.0d;
double m3 = 0.0d;
double m4 = 0.0d;
for (int i = 0; i < data.Length; i++)
{
double m = data[i] - Result.Mean;
double mPow2 = m * m;
double mPow3 = mPow2 * m;
double mPow4 = mPow3 * m;
m1 += Math.Abs(m);
m2 += mPow2;
m3 += mPow3;
m4 += mPow4;
}
Result.SumOfError = m1;
Result.SumOfErrorSquare = m2;
sumOfESquare = m2;
Result.Variance = sumOfESquare / ((double)Result.Count - 1);
Result.StdDev = Math.Sqrt(Result.Variance);
double skewCum = 0.0d;
for (int i = 0; i < data.Length; i++)
{
skewCum += Math.Pow((data[i] - Result.Mean) / Result.StdDev, 3);
}
Result.Skewness = n / (n - 1) / (n - 2) * skewCum;
double m2_2 = Math.Pow(sumOfESquare, 2);
Result.Kurtosis = ((n + 1) * n * (n - 1)) / ((n - 2) * (n - 3)) *
(m4 / m2_2) -
3 * Math.Pow(n - 1, 2) / ((n - 2) * (n - 3));
sortedData = new double[data.Length];
data.CopyTo(sortedData, 0);
Array.Sort(sortedData);
Result.sortedData = new double[data.Length];
sortedData.CopyTo(Result.sortedData, 0);
Result.FirstQuartile = percentile(sortedData, 25);
Result.ThirdQuartile = percentile(sortedData, 75);
Result.Median = percentile(sortedData, 50);
Result.IQR = percentile(sortedData, 75) - percentile(sortedData, 25);
}
The calculations of descriptive statistics are quite straightforward, except for the percentile function (and the subsequent quartile calculations), is a little tricky. Therefore, I have a separate function to handle it, as follows:
internal static double percentile(double[] sortedData, double p)
{
if (p >= 100.0d) return sortedData[sortedData.Length - 1];
double position = (double)(sortedData.Length + 1) * p / 100.0;
double leftNumber = 0.0d, rightNumber = 0.0d;
double n = p / 100.0d * (sortedData.Length - 1) + 1.0d;
if (position >= 1)
{
leftNumber = sortedData[(int)System.Math.Floor(n) - 1];
rightNumber = sortedData[(int)System.Math.Floor(n)];
}
else
{
leftNumber = sortedData[0];
rightNumber = sortedData[1];
}
if (leftNumber == rightNumber)
return leftNumber;
else
{
double part = n - System.Math.Floor(n);
return leftNumber + part * (rightNumber - leftNumber);
}
}
The percentile algorithm is derived from Amir Aczel’s book "Complete Business Statistics".
Conclusion
The descriptive statistics program presented here provides a simple way to obtain commonly used descriptive statistics, including standard deviations, skewness, kurtosis, percentiles, quartiles, etc.
History
- 28th June, 2008: Initial post
About the Author
Jan Low, PhD, is a senior software architect at Foundasoft.com, Malaysia. He is also the author of various text analysis software, statistical libraries, image processing libraries, and security encryption component. He programs primarily in C#, C++ and VB.NET.
Occupation: Senior software architect
Location: Malaysia