Introduction
MongoDb Aggregation Framework helps to analyse survey data. One of the most common analysis done on survey data is calculating the dependent probability of an event. In this article, how to calculate the probability of dependent events using Bayes' theorem is demonstrated by giving an example.
Background
Bayes' theorem is stated mathematically as the following equation:
P(A|X) = [P(X|A) P(A)]/P(X) = [P(X|A) P(A)] / [P(X|A) P(A) + P(X | ~A) P(~A)]
Where A
and X
are events.
P
(A
) and P
(X
) are the probabilities of A
and X
without regard to each other.
P
(A
| X
), a conditional probability, is the probability of observing event A
given that X
is true
.
P
(X
| A
) is the probability of observing event X
given that A
is true
.
Suppose we want to know the probability of having breast cancer given we have a positive test result.
i.e., A
= the event of having breast cancer
X
= the event of testing positive
Using the Code
Let’s assume we have tested 200 volunteers of which some with preexisting condition. We maintained the records in MongoDB database as follows:
{ "_id" : ObjectId("56a92adab2326d187c099531"), "patientName" : "cxb yyy", _
"mammogramResult" : 0, "diagnosedBefore" : 0 }
{ "_id" : ObjectId("56a92adab2326d187c09953d"), "patientName" : "pxx yyy", _
"mammogramResult" : 0, "diagnosedBefore" : 1 }
{ "_id" : ObjectId("56a92adab2326d187c099540"), "patientName" : "sxx yyy", _
"mammogramResult" : 1, "diagnosedBefore" : 0 }
{ "_id" : ObjectId("56a92adab2326d187c09953d"), "patientName" : "pxx yyy", _
"mammogramResult" : 0, "diagnosedBefore" : 1 }
{ "_id" : ObjectId("56aa4cda358f09b19d447bab"), "patientName" : "fxx rrr", _
"mammogramResult" : 1, "diagnosedBefore" : 1 }
Based on the device manufacturer’s specification, let’s assume the mammogram has the following attributes:
- 80% of mammograms detect breast cancer when it is there (and therefore 20% miss it).
- 9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result).
We want to know the probability of having a breast cancer given the mammogram test gave a positive result
.
Let:
A
: The event of having a breast cancer
X
: The event of mammogram positive result
P(X)
is the probability of having a positive mammogram test result
P(A)
is the probability of having breast cancer
P(A | X)
is the probability of having breast cancer given you a positive test result
P(X | A)
is the probability of a positive test result given you have a breast cancer
From the given mammogram specification, we can have the following figures:
- Mammogram true test positive result = 80%
- Mammogram false positive result = 20%
- Mammogram true test negative result = 94%
- Mammogram false test negative result = 9.6%
Let’s aggregate our sample data in survey collection (the database is called bayesdb
):
var connectionString = "mongodb://localhost:27017";
var client = new MongoClient(connectionString);
var db = client.GetDatabase("bayesdb");
var col = db.GetCollection<BsonDocument>("survey");
var surveyList = await col.Find(new BsonDocument()).ToListAsync();
Console.Write("Count of collected survey: {0}", surveyList.Count());
var totalSurveyCount = surveyList.Count();
Determine how many of those tested have preexisting condition of breast cancer. To figure out how many of those tested +ve and –ve from the survey, we use aggregation on mammogramResult
.
var previousDiagnosis = col.Aggregate()
.Group(new BsonDocument { { "_id", "$diagnosedBefore" }
, { "count", new BsonDocument("$sum", 1) } });
var results = await previousDiagnosis.ToListAsync();
var mammogramTestResults = await col.Aggregate()
.Group(new BsonDocument { { "_id", "$mammogramResult" }
, { "count", new BsonDocument("$sum", 1) } }).ToListAsync();
From the survey data, we calculate the following parameters:
P(A)
: the probability of having breast cancer:
var PreviousDiagnosisPositive = previousDiagnosis.First();
int PositiveCount = (int)PreviousDiagnosisPositive["count"];
double PrA = (double)PositiveCount / totalSurveyCount;
P(~A)
: the probability of NOT having cancer = 1-P(A)
var PreviousDiagnosisPositive = previousDiagnosis.First();
int PositiveCount = (int)PreviousDiagnosisPositive["count"];
double PrA = (double)PositiveCount / totalSurveyCount;
P(X|A)
: the probability of getting positive result from those with preexisting conditions of breast cancer
var PrXA = mammogramConstants.mammogramTP * PrA;
The struct mammogramConstants
contains all the necessary constants.
public struct mammogramConstants
{
public const double mammogramTP = 0.800;
public const double mammogramFN = 0.200;
public const double mammogramFP = 0.096;
public const double mammogramTN = 0.904;
}
P(X | ~A)
: the probability of getting positive result from those with no breast cancer (Probability of False Positive)
var PrXnA = mammogramConstants.mammogramFP * PrnA;
P(X)
: the probability of getting any positive result (True Positive + False Positive)
i.e., P(X) = P(X | A) + P(X | ~A)
var PrX = PrXA + PrXnA;
Finally:
P(A|X)
: the probability of having cancer given a positive test result
var PrAX = (double)(mammogramConstants.mammogramTP * PrA) / PrX;
Show Result:
Console.WriteLine("Based on the given survey data, we can see that having a positive result means:");
Console.WriteLine("you are {0: 0.00}% likely of having a cancer.", PrAX * 100);
Points of Interest
Even if our mammogram specification states that there is an 80% chance of having a breast cancer given a positive result, based on Bayes' theorem and the data in the survey collection, we can see that a positive result mean only ~7.76% chance of contracting breast cancer.
History