Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Calculating the Probability of Related Events Based on Bayes' Theorem Using MongoDB Aggregation Framework in C#

0.00/5 (No votes)
30 Jan 2016 1  
Calculating the Probability of Related Events Based on Bayes' Theorem Using MongoDB Aggregation Framework in C#

Introduction

MongoDb Aggregation Framework helps to analyse survey data. One of the most common analysis done on survey data is calculating the dependent probability of an event. In this article, how to calculate the probability of dependent events using Bayes' theorem is demonstrated by giving an example.

Background

Bayes' theorem is stated mathematically as the following equation:

P(A|X) = [P(X|A) P(A)]/P(X) = [P(X|A) P(A)] / [P(X|A) P(A) + P(X | ~A) P(~A)]

Where A and X are events.

  • P(A) and P(X) are the probabilities of A and X without regard to each other.
  • P(A | X), a conditional probability, is the probability of observing event A given that X is true.
  • P(X | A) is the probability of observing event X given that A is true.

Suppose we want to know the probability of having breast cancer given we have a positive test result.

i.e.,  A  =  the event of having breast cancer
       X  =  the event of testing positive

Using the Code

Let’s assume we have tested 200 volunteers of which some with preexisting condition. We maintained the records in MongoDB database as follows:

{ "_id" : ObjectId("56a92adab2326d187c099531"), "patientName" : "cxb yyy", _
"mammogramResult" : 0, "diagnosedBefore" : 0 }
{ "_id" : ObjectId("56a92adab2326d187c09953d"), "patientName" : "pxx yyy", _
"mammogramResult" : 0, "diagnosedBefore" : 1 }
{ "_id" : ObjectId("56a92adab2326d187c099540"), "patientName" : "sxx yyy", _
"mammogramResult" : 1, "diagnosedBefore" : 0 }
{ "_id" : ObjectId("56a92adab2326d187c09953d"), "patientName" : "pxx yyy", _
"mammogramResult" : 0, "diagnosedBefore" : 1 }
{ "_id" : ObjectId("56aa4cda358f09b19d447bab"), "patientName" : "fxx rrr", _
"mammogramResult" : 1, "diagnosedBefore" : 1 }

Based on the device manufacturer’s specification, let’s assume the mammogram has the following attributes:

  • 80% of mammograms detect breast cancer when it is there (and therefore 20% miss it).
  • 9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result).

We want to know the probability of having a breast cancer given the mammogram test gave a positive result.

Let:

  • A: The event of having a breast cancer
  • X: The event of mammogram positive result
  • P(X)  is the probability of having  a positive mammogram test result
  • P(A) is the probability of having breast cancer
  • P(A | X) is the probability of having breast cancer given you a positive test result
  • P(X | A) is the probability of a positive test result given you have a breast cancer

From the given mammogram specification, we can have the following figures:

  • Mammogram true test positive result  =  80%
  • Mammogram false positive result = 20%
  • Mammogram true test negative result  = 94%
  • Mammogram false test negative result = 9.6% 

Let’s aggregate our sample data in survey collection (the database is called bayesdb):

var connectionString = "mongodb://localhost:27017";
var client = new MongoClient(connectionString);
var db = client.GetDatabase("bayesdb");
var col = db.GetCollection<BsonDocument>("survey");
var surveyList = await col.Find(new BsonDocument()).ToListAsync();
Console.Write("Count of collected survey: {0}", surveyList.Count());
var totalSurveyCount = surveyList.Count();

Determine how many of those tested have preexisting condition of breast cancer. To figure out how many of those tested +ve and –ve from the survey, we use aggregation on mammogramResult.

var previousDiagnosis = col.Aggregate()
            .Group(new BsonDocument { { "_id", "$diagnosedBefore" }
                                 , { "count", new BsonDocument("$sum", 1) } });
var results = await previousDiagnosis.ToListAsync();
var mammogramTestResults = await col.Aggregate()
                .Group(new BsonDocument { { "_id", "$mammogramResult" }
                                , { "count", new BsonDocument("$sum", 1) } }).ToListAsync();

From the survey data, we calculate the following parameters:
P(A): the probability of having breast cancer:

//Pr(A) is the probability of having cancer  => cancerExistingConditionsCount/totalCount = ~ 1%
    var PreviousDiagnosisPositive = previousDiagnosis.First();
    int PositiveCount = (int)PreviousDiagnosisPositive["count"];
    double PrA = (double)PositiveCount / totalSurveyCount;

P(~A): the probability of NOT having cancer = 1-P(A)

//Pr(A) is the probability of having cancer  => cancerExistingConditionsCount/totalCount = ~ 1%
    var PreviousDiagnosisPositive = previousDiagnosis.First();
    int PositiveCount = (int)PreviousDiagnosisPositive["count"];
    double PrA = (double)PositiveCount / totalSurveyCount;

P(X|A): the probability of getting positive result from those with preexisting conditions of breast cancer

var PrXA = mammogramConstants.mammogramTP * PrA;

The struct mammogramConstants contains all the necessary constants.

public struct mammogramConstants
{
    public const double mammogramTP = 0.800;
    public const double mammogramFN = 0.200;
    public const double mammogramFP = 0.096;
    public const double mammogramTN = 0.904;
}

P(X | ~A): the probability of getting positive result from those with no breast cancer (Probability of False Positive)

var PrXnA = mammogramConstants.mammogramFP * PrnA;

P(X): the probability of getting any positive result (True Positive + False Positive)

i.e., P(X) = P(X | A) + P(X | ~A)

var PrX = PrXA + PrXnA;

Finally:

P(A|X): the probability of having cancer given a positive test result

var PrAX = (double)(mammogramConstants.mammogramTP * PrA) / PrX;

Show Result:

Console.WriteLine("Based on the given survey data, we can see that having a positive result means:");
Console.WriteLine("you are {0: 0.00}% likely of having a cancer.", PrAX * 100);

Points of Interest

Even if our mammogram specification states that there is an 80% chance of having a breast cancer given a positive result, based on Bayes' theorem and the data in the survey collection, we can see that a positive result mean only ~7.76% chance of contracting breast cancer.

History

  • Version 2.0

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here