Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / artificial-intelligence / machine-learning

Step-by-Step Guide To Implement Machine Learning II - Decision Tree

4.00/5 (4 votes)
5 May 2019CPOL3 min read 7.6K  
Easy to implement machine learning

This article is an entry in our Machine Learning and Artificial Intelligence Challenge. Articles in this sub-section are not required to be full articles so care should be taken when voting.

Introduction

The principle of decision is not complex. Start from the root node, compare the feature value stored in node with the corresponding feature value of the test object. Then, turn to left substree or right substree recursively according to the comparison result. Finally, the label of the leaf node is the prediction of the test object.

For example, there is a decision tree below, where the feature set is {hungry, money} and the label set is {go to restaurant, go to buy a hamburger, go to sleep}.

Image 1

In the decision process, first if I am hungry, turn to the right subtree, which means I will go to sleep. Else turn to the left subtree. Then, check how much money in my wallet. If I have more than 25$, I will go to restaurant. Else, I can only go to buy a hamburger.

Decision Tree Model

Decision Tree model consists of feature selection, generation of decision tree and classify.

Feature Selection

The aim of feature selection is to choose the feature which has the ability of classification for training data to make the learning process more efficient. Feature selection includes two parts, namely, feature selection and feature value selection. The selected tuple (feature, feature value) are applied as the node in the decision tree.

Feature selection is usually based on information gain or information gain ratio. Information gain is defined as the difference between the empirical entropy of set D, H(D) and the conditional entropy under selecting feature A,H\left( D|A \right), which is calculated as:

g\left( D,A \right)=H\left( D \right)-H\left( D|A \right)

Specifically, when the data set D is divided into two subset by feature A, the information gain is expressed as:

g\left( D|A \right)=H\left( D \right)-\alpha H\left( D_{1} \right)-\left( 1-\alpha\right) H\left( D_{2} \right)

where \alpha is the ratio of the first subset, namely:

\alpha=\frac{\left|D_{1}\right|}{\left|D\right|}

The code of calculation of information gain is shown below:

Python
left_set, right_set = self.divideData(data, i, value)
# calculate information gain
ratio = float(len(left_set)/sample_num)
if ratio == 0.0:
    info_gain = init_entropy - (1 - ratio) * self.getEntropy(right_set[:, -1])
elif ratio == 1.0:
    info_gain = init_entropy - ratio*self.getEntropy(left_set[:, -1])
else:
    info_gain = init_entropy - ratio *
    self.getEntropy(left_set[:, -1]) - (1 - ratio) * self.getEntropy(right_set[:, -1])

So far, we have learned how to calculate the information gain. But, how to calculate the empirical entropy? The empirical entropy is the entropy of the labels of the training set, which is given by:

H\left( D\right)=-\sum_{k=1}^{K}\frac{\left| C_{k}\right|}{\left| D\right|}log_{2}\frac{\left| C_{k}\right|}{\left|D\right|}

The above equation looks a bit complex but it is very easy to implement. Let's look at the code of it:

Python
def getEntropy(self, labels):
    labels_num = len(labels)
    label_count = self.uniqueCount(labels)

    entropy = 0.0
    for j in label_count:
        prop = label_count[j]/labels_num
        entropy = entropy + (-prop*math.log(prop, 2))

    return entropy

Generation of Decision Tree

There are many algorithms to generate decision tree, such as ID3, C4.5. In this article, we take ID3 algorithm as the example to generate the decision tree.

First, let's figure out the split process after feature selection. It is known to us that feature selection is to make the data classifiable. Thus, the split process is to divide the training data according to the selected feature index and its selected value value. Specifically, the split process is that if the index feature value in a sample is larger than value, then add the sample into right subtree and delete the index feature in the sample; else if the index feature value in a sample is smaller than value, then add the sample into left subtree and delete the index feature in the sample. The code of split process is:

Python
def divideData(self, data, index, value):
    left_set = []
    right_set = []
    # select feature in index with value
    for temp in data:
        if temp[index] >= value:
            # delete this feature
            new_feature = np.delete(temp, index)
            right_set.append(new_feature)
        else:
            new_feature = np.delete(temp, index)
            left_set.append(new_feature)
    return np.array(left_set), np.array(right_set)

Before generating a decision tree, we define a data structure to save the node in the decision tree:

Python
class DecisionNode:
    def __init__(self, index=-1, value=None, results=None, right_tree=None, left_tree=None):
        self.index = index                    # the index of feature
        self.value = value                    # the value of the feature with index
        self.results = results                # current decision result
        self.right_tree = right_tree
        self.left_tree = left_tree

Then, we can generate the decision tree recursively. If there is no feature in the data set, stop. If the information gain is smaller than a given threshold, stop. Else, split the data set according to the best selected feature and its value as shown below:

Python
def createDecisionTree(self, data):
   # if there is no feature in data, stop division
   if len(data) == 0:
       self.tree_node = DecisionNode()
       return self.tree_node

   best_gain = 0.0
   best_criteria = None
   best_set = None

   feature_num = len(data[0]) - 1
   sample_num = len(data[:, -1])
   init_entropy = self.getEntropy(data[:, -1])

   # get the best division
   for i in range(feature_num):
       uniques = np.unique(data[:, i])
       for value in uniques:
           left_set, right_set = self.divideData(data, i, value)
           # calculate information gain
           ratio = float(len(left_set)/sample_num)
           if ratio == 0.0:
               info_gain = init_entropy - (1 - ratio) * self.getEntropy(right_set[:, -1])
           elif ratio == 1.0:
               info_gain = init_entropy - ratio*self.getEntropy(left_set[:, -1])
           else:
               info_gain = init_entropy - ratio * self.getEntropy
                 (left_set[:, -1]) - (1 - ratio) * self.getEntropy(right_set[:, -1])
           if info_gain > best_gain:
               best_gain = info_gain
               best_criteria = (i, value)
               best_set = (left_set, right_set)

   # create the decision tree
   if best_gain < self.t:
       self.tree_node = DecisionNode(results=self.uniqueCount(data[:, -1]))
       return self.tree_node
   else:
       ltree = self.createDecisionTree(best_set[0])
       rtree = self.createDecisionTree(best_set[1])
       self.tree_node = DecisionNode(index=best_criteria[0],
                        value=best_criteria[1], left_tree=ltree, right_tree=rtree)
       return self.tree_node

Classify

The principle of classification is like the binary sort tree, namely, comparing the feature value stored in node with the corresponding feature value of the test object. Then, turn to left substree or right substree recursively as shown below:

Python
def classify(self, sample, tree):
    if tree.results != None:
        return tree.results
    else:
        value = sample[tree.index]
        branch = None
        if value >= tree.value:
            branch = tree.right_tree
        else:
            branch = tree.left_tree
        return self.classify(sample, branch)

Conclusion and Analysis

There exist pruning process by dynamic programming after generation of decision tree to get the best decision tree. Moreover, Classification and Regression Tree (CART) is a type of decision tree which can be not only applied in classification but also regression. Finally, let's compare our Decision tree with the tree in Sklearn and the detection performance is displayed below:

Image 9

From the figure, we can learn that the decision tree is not as good as sklearn's, which may be that we don't apply the pruning process.

The related code and dataset in this article can be found in MachineLearning.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)