Introduction
Have you ever wondered how those big tables of P-Values for the Chi-Squared Test are Calculated?
Have you ever wanted to calculate those values yourself?
Have you ever asked for help online to calculate those values, and someone gave you an annoying link to a webpage that had either the tables or a form applet that would calculate it for you, with no mention What. So. Ever of the actual FORMULA!?!?!?!?
Then you're in good company. Keep on reading, this is for you.
Background
The Chi-Squared Distribution is probably the most widely used distribution in Statistics today.
It is most well known for its use in Pearson's Chi-Squared Test which is used to measure goodness of fit.
int Observed[256];
int Expected[256];
double CriticalValue = 0.0;
double XSqr;
for(int I = 0; I < 256; I++)
{
XSqr = Observed[I] - Expected[I];
CriticalValue += ((XSqr * XSqr) / Expected[I]);
}
Above is a brief example of the first part of Pearson's Chi-Squared Test. This demonstrates how we obtain the two values we need to calculate the P-Value: the Degrees of Freedom and the Critical Value.
What is this 'P', and why is it so Valuable?
For those of you who are new to this, the P-Value is the probability that the difference between the Observed and the Expected values would happen purely by chance. For instance, each side of a fair coin should have an equal probability of
occurring.
Heads = 50.0%
Tails = 50.0%
So in theory if you flip a coin 1,000 times, Heads and Tails should each occur 500 times. In reality, this rarely happens. You're much more likely to have Heads land head up 507 times, and Tails 493, or vice versa. So if it's a little bit different, that's understandable.
But when does that difference become just to much. A fair coin will probably land Heads up 507/1000. But what if it's 515? What if it's 600?
This is what the P-Value is for. It allows you to Objectively assign a probability to an Observed value
randomly occurring by comparing it with the Expected Value.
Using the code
Above I gave a brief example of how you calculate the Critical Value and Degrees of Freedom for a Chi-Squared Test. This was for demonstration only, and is not in the included code, as you will undoubtedly have to customize it for your own situation. If you have any problems with this, there are plenty of excellent tutorials online demonstrating how to calculate these values. I would highly recommend any number of these.
But that's not what I'm here to look at. I'm here to show you how to Calculate the actual P-Value. So let's look at some code.
double chisqr(int Dof, double Cv)
{
if(Cv < 0 || Dof < 1)
{
return 0.0;
}
double K = ((double)Dof) * 0.5;
double X = Cv * 0.5;
if(Dof == 2)
{
return exp(-1.0 * X);
}
double PValue = igf(K, X);
if(isnan(PValue) || isinf(PValue) || PValue <= 1e-8)
{
return 1e-14;
}
PValue /= gamma(K);
return (1.0 - PValue);
}
This is the main function that you will be using, found in chisqr.c. It's very simple; you put in your values, and it spits out the P-Value; You'll notice,
if you only have 2 Degrees of Freedom then you're in luck. There is a shortcut to calculating the P-Value using only one call to exp()
,
which is considerably faster than chisqr()
would be otherwise
There are two other functions that you will need to in order to finish calculating the P-Value.
static double igf(double S, double Z)
{
if(Z < 0.0)
{
return 0.0;
}
double Sc = (1.0 / S);
Sc *= pow(Z, S);
Sc *= exp(-Z);
double Sum = 1.0;
double Nom = 1.0;
double Denom = 1.0;
for(int I = 0; I < 200; I++)
{
Nom *= Z;
S++;
Denom *= S;
Sum += (Nom / Denom);
}
return Sum * Sc;
}
Next up is the Incomplete Gamma Function. This is what computes the first part of the P-Value. If you're looking for an explanation as to how this relates to calculating the Chi-Squared Value, you can find it in this section on Wiki's Chi Square Page.
As a side note, in chisqr.c. I give a brief explanation as to why I have the for loop go for 200
iterations. The short answer is, this is what I arrived at by trial and error. If you have any experience in this area and know anything more about this, please tell me in the comments below.
And last but not least...
double gamma(double N)
{
const long double SQRT2PI = 2.5066282746310005024157652848110452530069867406099383;
long double Z = (long double)N;
long double Sc = powl((Z + A), (Z + 0.5));
Sc *= expl(-1.0 * (Z + A));
Sc /= Z;
long double F = 1.0;
long double Ck;
long double Sum = SQRT2PI;
for(int K = 1; K < A; K++)
{
Z++;
Ck = powl(A - K, K - 0.5);
Ck *= expl(A - K);
Ck /= F;
Sum += (Ck / Z);
F *= (-1.0 * K);
}
return (double)(Sum * Sc);
}
Say hello to the Gamma Function. For those of you who are unfamiliar with it, the Gamma Function allows you to compute the Factorial of decimals (i.e. 5.5!).
Quick Sidenote. If you've never used the Gamma Function before, there's something you should know about it. It will actually compute the factorial of the Number you put in, Minus 1. Gamma(X)
= (X - 1)! So if you put in Gamma(5)
you won't get 120, you'll get 24. "Why", you ask? I don't know: mathematicians are just weird like that. I think it's how they punish society for calling them nerds for all those years, but that's just my theory.
Depending whatever language you're using, you're standard math library might already have the gamma function built in. For instance in C,
math.h you can calculate the gamma function by calling tgamma()
. I would suggest you check to see if you have it, as the implementations in the standard libraries are usually more efficient, faster, and more accurate.
Now the Gamma Function you see above is an implementation of Spouge's Approximation. The reason that I like it, is that it allows you to compute the Gamma Function to any accuracy you desire, so long as you have a datatype big enough to hold the required data. You can implement this using an Arbitrarily Precise Library (in fact, I think it already is) and calculate your desired accuracy.
It is not, however, the fastest version of the Gamma Function. There are several approximations that actually give a value that is more than accurate enough for a Chi-Squared Test.
double approx_gamma(double Z)
{
const double RECIP_E = 0.36787944117144232159552377016147;
const double TWOPI = 6.283185307179586476925286766559;
double D = 1.0 / (10.0 * Z);
D = 1.0 / ((12 * Z) - D);
D = (D + Z) * RECIP_E;
D = pow(D, Z);
D *= sqrt(TWOPI / Z);
return D;
}
This is a simple version of Stirling's Approximation. This will give you around 6-7 digits of accuracy.
I compared the P-Value I got using this one versus using my other implementation and the one in math.h.
chisqr(255, 290.285192) = 0.06364235 chisqr(255, 290.285192) = 0.06364235 chisqr(255, 290.285192) = 0.06364235
So yeah, tried the gamma function in the math library, my implementation of Spouge's Approximation, and Stirling's
Approximation printed them out to the screen: no difference. Some of the lower bits are probably different, but this is already much more accurate than any table or calculator you'd find online. So unless you are going for pure accuracy,
I'd suggest sticking with approx_gamma()
.
Conclusion
Jeez, that's a lot of code. Well thank you everyone for sticking with me through all of this. I've found the algorithms to be very important, and I'm happy I get to share them with you all. If you have any questions about, please feel free to ask and I'll do my best to answer them. If you want to know anything about the specific formulas themselves, then I'd suggest looking in the links I gave. If you have any comments, criticism, suggestions, or improvements, please feel free to post a comment.
One last thing. Every once in a while over the last few years I've been in need of these functions and have scrambled about trying to find them, often to find them placed under heavy and restrictive licenses. This in and of itself isn't a bad thing, but considering how important these are and how popular they are in math and testing equipment, I was simply amazed that there were so few implementations freely available.
So, to make sure this is absolutely clear.
This code is place in the Public Domain, and may be used by anyone for any reason for any purpose with no charge, no permission needed, etc.
I'll admit these aren't the best implementations of these functions, but they work, and they're better than nothing at all.
Thanks,
Jacob W.
Update
8/2/2012: I was doing some testing when I realized I had forgotten to account for when the PValue
is nan
or inf
. I've updated the example with a corrected version, as well as the ZIP file. My apologies if this confused anybody.
11/6/2012: Uploaded chisqr_v2.zip
. ChiSqr() has had a significant upgrade. I've modified the code so that it now uses Natural Logarithms to help calculate the ChiSqr P-Value. This gives it a huge increase in range and accuracy. It is slower though. Sadly I haven't found away to improve this one aspect. Despite that, I would highly recommend using this new Version over the original, as I believe it is more than worth it. I'll update the examples in the Article in a few days. Once again, all code is placed in the Public Domain.