Credit card fraud detection is an important application of machine learning techniques, including Decision Trees. The goal is to identify and detect fraudulent transactions and separate them from valid transactions to prevent financial loss and protect user accounts.
Introduction
In this article, we'll implement Decision Tree algorithm for credit card fraud detection. The Decision Tree algorithm is a popular and powerful supervised machine learning algorithm used for both classification and regression tasks.
Background
Decision Tree algorithm builds a tree-like model of decisions based on the features of the data. Each internal node of the tree represents a decision based on a feature, and each leaf node represents a class label or a predicted value.
Please refer to my Medium article "Machine Learning - Decision Tree" to understand Decision Tree concept in detail.
High Level Steps
Below is the overview of high level steps involved in detecting credit card fraud detection using Decision Tree algorithm in Machine Learning
Data Collection: Collect a labeled dataset that includes historical credit card transactions, where each transaction is labeled as either fraudulent or legitimate. The dataset should contain relevant features such as transaction amount, merchant information, transaction time, and other related variables.
Data Preprocessing: Preprocess the dataset by performing tasks such as data cleaning, handling missing values, feature selection and normalization. Ensure that the dataset is balanced, meaning it has a similar number of fraudulent and valid transactions to prevent bias in the model.
Splitting the Dataset: Split the preprocessed dataset into training and testing sets. The training set will be used to build the Decision Tree model, while the testing set will be used to evaluate the model's performance.
Decision Tree Model: Build a Decision Tree model on the training data. The features of the dataset will serve as inputs, and the label (fraudulent or legitimate) will be the target variable. The Decision Tree algorithm will learn patterns and decision rules based on the features to classify transactions as either fraudulent or legitimate.
Model Training: Train the Decision Tree model on the training data, using a suitable metric such as Information Gain or Gini Impurity to determine the best feature to split the data at each node.
Model Evaluation: Evaluate the trained model using the testing data. Calculate metrics such as accuracy, precision, recall, and F2-score to assess the model's performance in correctly identifying fraudulent transactions and minimizing false positives and false negatives.
Fine Tuning: Adjust the Decision Tree model's parameters and hyperparameters, such as maximum depth, minimum samples per leaf and splitting criteria, to optimize its performance thereby preventing overfitting and improve the model's generalization ability.
Prediction: Use the trained Decision Tree model to make predictions on new, unseen credit card transactions. The model will classify each transaction as either fraudulent or legitimate based on the learned decision rules.
Using the Code
Below is the implementation of algorithm and code is written in Python with the help of jupyter notebook.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
You can use any dataset containing credit card transactions. Dataset used in this implementation is downloaded from Kaggle.
creditdata_df = pd.read_csv("~path~//creditcard.csv")
print(f"Dataset Shape :-")
print (creditdata_df.shape)
Output
Dataset Shape :-
(284807, 31)
After loading creditcard.csv data in dataframe
, let us view or inspect the data.
creditdata_df.head(10)
Output
Let us find legitimate and fraudulent records from dataset
:
false = creditdata_df[creditdata_df['Class']==1]
true = creditdata_df[creditdata_df['Class']==0]
n=len(false)/float(len(true))
print (n)
print('False Detection : {}'.format(len(creditdata_df[creditdata_df['Class']==1])))
print('True Detection:{}'.format(len(creditdata_df[creditdata_df['Class']==0])),"\n")
Output
0.0017304750013189597
False Detection : 492
True Detection:284315
Check for statistical view of both type of records:
print("False Detection Transaction")
print("============================")
print(false.Amount.describe(),"\n")
print("True Detection Transaction")
print("============================")
print(true.Amount.describe(),"\n")
Output
Now it's time to separate features and target variable:
X = creditdata_df.drop('Class', axis=1)
y = creditdata_df.drop['Class']
Split data into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a Decision Tree classifier:
classifier = DecisionTreeClassifier()
Now let us train the classifier:
classifier.fit(X_train, y_train)
Now let us try to make predictions on the test set:
y_pred = classifier.predict(X_test)
Calculate accuracy of the model:
accuracy = accuracy_score(y_test, y_pred) * 100
print("Accuracy:", accuracy)
confusion_mat = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion_mat)
Output
Accuracy: 99.90695551420245
Confusion Matrix:
[[56833 31]
[ 22 76]]
Now, at the end, it's time to validate and evaluate our model:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
precision=precision_score(y_test, y_pred, pos_label=1)*100
print('\n Score Precision :\n',precision )
recall=recall_score(y_test, y_pred, pos_label=1)*100
print("\n Recall Score :\n", recall)
fscore=f1_score(y_test, y_pred, pos_label=1)*100
print("\n F1 Score :\n", fscore)
Output
Score Precision :
71.02803738317756
Recall Score :
77.55102040816327
F1 Score :
74.14634146341463
As you can see, Decision Tree algorithm implemented with dataset
creditcard.csv resulted in 99.90 accuracy.
Conclusion
In conclusion, our credit card fraud detection system, powered by a decision tree classifier, holds great potential in safeguarding financial transactions from fraudulent activities.
History
- 15th August, 2023: Initial version