Step-by-Step Guide to Implement Machine Learning XII - Apriori

Ryukkkk

3.70/5 (6 votes)

4 Jul 2019CPOL2 min read

14.4K

Easy to implement machine learning

Introduction

Apriori is an algorithm to learn frequent item set and association rule. Apriori is a bottom-up method, which extends an object to frequent item set at each iteration. When there is no object to be extended, algorithm terminates.

Apriori Model

Frequent itemset

Frequent itemset is the set of objects which usually occur simultaneously.

The traditional method will be traversing the dataset several times. To reduce the computation, researchers identified something called the Apriori principle. The Apriori principle helps us reduce the number of possible interesting itemsets. The Apriori principle says that if an itemset is frequent, then all of its subsets are frequent. It does not seem useful at first sight. However, if we turn it inside out, it says if a itemset is not frequent, all of its supersets are not frequent. It means that in the bottom-up process, if the bottom itemset is not frequent, we don't need to process the set at top. For example, there exists a set A= {1, 2, 3, 4}. All possible itemsets generated by A are shown below:

If itemset {0, 1} is not frequent, all the itemsets which contain {0, 1} are not frequent, which means that we can skip these itemset as marked with blue circle.

The generation of frequent itemset includes the following steps:

First, calculate the singleton set. All the elements are regarded as the candidate set.

Python

def createSingletonSet(self, data):
    singleton_set = []
    for record in data:
        for item in record:
            if [item] not in singleton_set:
                singleton_set.append([item])
    singleton_set.sort()
    singleton_set = list(map(frozenset, singleton_set))   # generate a invariable set
    return singleton_set

Merge singleton set to generate more candidate set.

Python

def createCandidateSet(self, frequent_items, k):
    candidate_set = []
    items_num = len(frequent_items)
    # merge the sets which have same front k-2 element
    for i in range(items_num):
        for j in range(i+1, items_num):
            L1 = list(frequent_items[i])[: k-2]
            L2 = list(frequent_items[j])[: k-2]
            if L1 == L2:
                candidate_set.append(frequent_items[i] | frequent_items[j])
    return candidate_set

Calculate the support degree of all the candidate sets and select candidate sets whose support degrees are larger than the minimum support degree as the frequent itemsets.

Python

def calculateSupportDegree(self, data, candidate_set):
    sample_sum = len(data)
    data = map(set, data)                   # transform data into set

    # calculate the frequency of each set in candidate_set appearing in data
    frequency = {}
    for record in data:
        for element in candidate_set:
            if element.issubset(record):  # elements in record
                frequency[element] = frequency.get(element, 0) + 1

    # calculate the support degree for each set
    support_degree = {}
    frequent_items = []
    for key in frequency:
        support = frequency[key]/sample_sum
        if support >= self.min_support:
            frequent_items.insert(0, key)
        support_degree[key] = support
    return frequent_items, support_degree

Combination of the above functions:

Python

def findFrequentItem(self, data):
    singleton_set = self.createSingletonSet(data)
    sub_frequent_items, sub_support_degree =
              self.calculateSupportDegree(data, singleton_set)
    frequent_items = [sub_frequent_items]
    support_degree = sub_support_degree
    k = 2

    while len(frequent_items[k-2]) > 0:
        candidate_set = self.createCandidateSet(frequent_items[k-2], k)
        sub_frequent_items, sub_support_degree =
                 self.calculateSupportDegree(data, candidate_set)
        support_degree.update(sub_support_degree)
        if len(sub_frequent_items) == 0:
            break
        frequent_items.append(sub_frequent_items)
        k = k + 1
    return frequent_items, support_degree

Association Rule

Association rule is that something takes place leading to other things happening. For example, there is a frequent itemset {beer, diaper}. There may be a association rule diaper->beer, which means that someone buy diaper may also by beer. Note that the rule is unidirectional, which means rule beer->diaper may be not correct.

The generation of association rule includes the following steps:

Calculate the confidence degree of each frequent itemset, and choose the itemsets whose confidence degree are larger than the mimimum confidence degree.

Python

def calculateConfidence(self, frequent_item, support_degree, candidate_set, rule_list):
    rule = []
    confidence = []
    for item in candidate_set:
        temp = support_degree[frequent_item]/support_degree[frequent_item - item]
        confidence.append(temp)
        if temp >= self.min_confidence:
            rule_list.append((frequent_item - item, item, temp))  # association rule
            rule.append(item)
    return rule

Merge frequent itemsets to generate more association rules:

Python

def mergeFrequentItem(self, frequent_item, support_degree, candidate_set, rule_list):
    item_num = len(candidate_set[0])
    if len(frequent_item) > item_num + 1:
        candidate_set = self.createCandidateSet(candidate_set, item_num+1)
        rule = self.calculateConfidence
             (frequent_item, support_degree, candidate_set, rule_list)
        if len(rule) > 1:
            self.mergeFrequentItem(frequent_item, support_degree, rule, rule_list)

Combination of the above functions:

Python

def generateRules(self, frequent_set, support_degree):
    rules = []
    for i in range(1, len(frequent_set)):    # generate rule from sets
                                             # which contain more than two elements
        for frequent_item in frequent_set[i]:
            candidate_set = [frozenset([item]) for item in frequent_item]
            if i > 1:
                self.mergeFrequentItem
                  (frequent_item, support_degree, candidate_set, rules)
            else:
                self.calculateConfidence
                   (frequent_item, support_degree, candidate_set, rules)
    return rules

Conclusion and Analysis

Apriori is a simple algorithm to generate frequent itemsets and association rules. But it is more suitable sprase dataset. Now, we have a dataset as follows. Let's see the result of Apriori. The result is a tuple as (X, Y, confidence degree).

The related code and dataset in this article can be found in MachineLearning.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)