Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / artificial-intelligence / machine-learning

How to Quickly Compare Multiple ML Models on your Data

5.00/5 (1 vote)
17 Jan 2020MIT3 min read 7.4K   140  
Find the best way to predict column value based on other columns, quickly and simply

Introduction

You have a data file with multiple columns.

You want to predict the value of one column based on the values of the other columns.

You want to find the best algorithm to do it.

Using the Code

Download the complete code example.

In this example, all the data is found in a CSV file - data.csv, that looks like this:

* c1,c2,c3 are continuous features between 0 and 1, we want to use them in order to predict the binary label.

Image 1

Make sure you have all the required python packages:

Python
pip install pandas
pip install sklearn
pip install json-config-expander
pip install xgboost

You can download the complete code example, or follow the instructions to understand each step.

Open a new python file, and add those imports:

Python
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from json_config_expander import expand_configs
from xgboost import XGBClassifier
import pandas as pd

Read the data:

Python
df = pd.read_csv('data.csv')

Split it to X (the feature columns - the columns we base on) and y ( the column we want to predict):

Python
X = df[[column for column in df.columns if column != 'label']]
y = df['label']

Split the data to train and test:

Python
X_train_complete, X_test, y_train_complete, y_test = train_test_split(X, y, test_size=0.2)

Because we are going to tune algorithms parameters, it is a good idea to leave the test set aside for later steps and perform another split on the train set, to create a validation set. If you are not familiar with the purpose of a validation set - please read this.

Python
X_train, X_validation, y_train, 
    y_validation = train_test_split(X_train_complete, y_train_complete, test_size=0.2)

Now, we are going to train the model on the train set, and evaluate it on the validation set:

Python
results = expand_configs(BASE_CONFIG,
   lambda config: evaluate_model(config, X_train, X_validation, y_train, y_validation))

In the code above, we have two missing things - BASE_CONFIG and evaluate_model function, so we need to define them. BASE_CONFIG represents all the different models we want to test, and evaluate_model is a function that returns a representation of "how well the model performed".

Define BASE_CONFIG:

Python
BASE_CONFIG = {"classifier*": [
   {
      "name": "random_forest",
      "parameters": {"max_depth*": [3, 5], "n_estimators*": [50, 100, 200]}
   },
   {
      "name": "logistic_regression",
      "parameters": {"max_iter*": [10, 100, 1000], "C*": [0.1, 0.5, 1]}
   },
   {
      "name": "xgboost",
      "parameters": {"max_depth*": [3, 4, 5], "n_estimators*": [50, 100, 200], 
      "learning_rate*": [0.01, 0.05, 0.1]}
   }]
}

We are going to use three different models - Random Forest, Logistic Regression, and XGBoost. Every algorithm has its own parameters, I picked some of them in the code above, to test what works better on this data. The expand_configs code is going to run all the options there are to pick the parameters (Disclaimer - expand_configs is part of an open-source library, of which I was one of the writers).

Some mapping of the model's names to the sklearn classes:

Python
CLASSIFIER_MAPPINGS = {"random_forest": RandomForestClassifier, 
    "xgboost": XGBClassifier, "logistic_regression": LogisticRegression}

Now, we need to write the evaluation function:

Python
def evaluate_model(config, X_train, X_test, y_train, y_test):
   classifier = CLASSIFIER_MAPPINGS[config["classifier"]["name"]]
                (**config["classifier"]["parameters"])
   classifier.fit(X_train, y_train)
   test_scores = classifier.predict_proba(X_test)[:, 1]
   roc_auc = roc_auc_score(y_test, test_scores)
   return {"config": config, "roc_auc": roc_auc}

This function receives a model configuration, a training and test set - trains the model on the training data and evaluates it on the test set. In order to evaluate, we need to decide how we want to evaluate our model - it depends on the real problem we want to solve. More on this topic can be found here.

For the purpose of this example, I decided to use ROC-AUC to evaluate the model.

After expand_configs function runs, these are the results on all the possible combinations of parameters:

Image 2

Select the best model, with the best parameters:

Python
best_result = max(results, key=lambda res: res["roc_auc"])

And the best classifier with the best parameters is:

Python
'classifier': {
    'name': 'xgboost', 
    'parameters': {'max_depth': 3, 'n_estimators': 50, 'learning_rate': 0.01}
} 

We got an AUC-ROC of 0.89, which is higher than any other model I ran.

Do you remember the test set we put aside before? Now we can use it to see how well the model we trained performs on it:

Python
result_on_test = evaluate_model
  (best_config, X_train_complete, X_test, y_train_complete, y_test)

We got 0.89 ROC-AUC on the validation set and 0.84 ROC-AUC on the test set. This is a common result, the selected model on validation usually performs better on the validation than on the test set.

Complete Example

Python
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from json_config_expander import expand_configs
from xgboost import XGBClassifier
import pandas as pd

CLASSIFIER_MAPPINGS = {
   "random_forest": RandomForestClassifier,
   "xgboost": XGBClassifier,
   "logistic_regression": LogisticRegression}

BASE_CONFIG = {"classifier*": [
   {
      "name": "random_forest",
      "parameters": {"max_depth*": [3, 5], "n_estimators*": [50, 100, 200]}
   },
   {
      "name": "logistic_regression",
      "parameters": {"max_iter*": [10, 100, 1000], "C*": [0.1, 0.5, 1]}
   },
   {
      "name": "xgboost",
      "parameters": {"max_depth*": [3, 4, 5], 
      "n_estimators*": [50, 100, 200], "learning_rate*": [0.01, 0.05, 0.1]}
   }]
}

def evaluate_model(config, X_train, X_test, y_train, y_test):
   classifier = CLASSIFIER_MAPPINGS[config["classifier"]["name"]]
                (**config["classifier"]["parameters"])
   classifier.fit(X_train, y_train)
   test_scores = classifier.predict_proba(X_test)[:, 1]
   roc_auc = roc_auc_score(y_test, test_scores)
   return {"config": config, "roc_auc": roc_auc}

if __name__ == '__main__':
   df = pd.read_csv('data.csv')

   X = df[[column for column in df.columns if column != 'label']]
   y = df['label']

   X_train_complete, X_test, y_train_complete, 
   y_test = train_test_split(X, y, test_size=0.2)
   X_train, X_validation, y_train, y_validation = 
            train_test_split(X_train_complete, y_train_complete, test_size=0.2)

   results = expand_configs(BASE_CONFIG,
      lambda config: evaluate_model
             (config, X_train, X_validation, y_train, y_validation))

   best_result = max(results, key=lambda res: res["roc_auc"])
   best_config = best_result["config"]

   result_on_test = evaluate_model
         (best_config, X_train_complete, X_test, y_train_complete, y_test)

   print(f"Best config: {best_config}")
   print(f"ROC AUC on Validation: {best_result['roc_auc']}")
   print(f"ROC AUC on Test: {result_on_test['roc_auc']}")

Points of Interest

History

  • 17th January, 2020: Initial version

License

This article, along with any associated source code and files, is licensed under The MIT License