This tutorial was part of the Cloud AI Challenge with SAP HANA and Amazon SageMaker. The contest is now over, so the instructions in this article on connecting to the SAP instance will no longer work.
Introduction
Data science, machine learning, and big data analytics are hot topics. It seems like they’re always in the news, with companies using these technologies to try to gain an edge over their competitors.
But what do these things mean to us, as developers? There’s so much breathless hype and marketing fluff on these topics that it can feel like there’s no substance. Fortunately, that’s not the case. There are some great tools in this space that developers can use to deliver amazing applications quickly, and we’re going to learn how to use two of them together: SAP HANA and Amazon SageMaker.
In this tutorial, you’ll learn how to load a dataset from SAP HANA and explore it using an Amazon SageMaker notebook. Then, you’ll learn how to copy data to Amazon S3 and train a machine learning model with SageMaker. Finally, you’ll learn how to query data in HANA that was imported from a Tensorflow Serving model.
What is SAP HANA?
SAP describes HANA as a ‘business data platform that processes transactions and analytics at the same time on any data type.’ That sounds great, but it doesn’t tell us much about what HANA can do or how it can to use it.
HANA is essentially a high-speed in-memory database that offers multiple ways to access the data it stores. HANA’s SQL interface lets you treat it as a relational database, which is something most developers are familiar with. You can even access it using an ORM like Entity Framework in .NET, or JPA in Java.
Relational data isn’t all HANA can handle. It also supports spatial queries and can act as a graph database like Neo4J or a document database like MongoDB. It can also handle full-text search and time-series analysis.
It might sound like this is a case of “jack of all trades, master of none”. And while specialized tools are sometimes the right choice, HANA has the advantage of being fast - very fast. As an in-memory database, HANA can perform real-time streaming analysis on large quantities of data.
Even better, HANA has some very cool built-in integrations with Python and Tensorflow that help it fit nicely into a data science workflow. Since we don’t want this tutorial to take 10 hours to complete, we’ll be limiting ourselves to a relatively small subset of HANA’s capabilities. However, if you’re interested in diving more in-depth, I strongly recommend installing HANA Express Edition and exploring its capabilities.
What is Amazon SageMaker?
Amazon SageMaker is an AWS service that helps developers and data scientists analyze data and then use that data to build, train, and deploy machine learning models in the cloud.
SageMaker can help you automatically label training data, which is typically a time-consuming activity. Then, with built-in Jupyter notebooks, it’s easy to analyze and visualize your data to decide what kind of model you’d like to create.
Next, it’s simple to spin up a training instance from within your notebooks. You can even ask SageMaker to use many instances - or even many GPUs - simultaneously to reduce the time you spend waiting for your model to train.
Then, by calling a couple of SageMaker SDK functions from your notebook, you can start up a serving instance that lets you call your trained model from AWS Lambda functions and other AWS services.
Put together, this workflow results in much faster progression from raw data to analysis to model training to deployment than you’re likely to achieve with a more traditional data science workflow.
How Do They Work Together?
At this point, SAP HANA and Amazon SageMaker sound great as individual tools, but you’re probably wondering how to use them together. After all, they seem to serve different purposes.
Fortunately, a couple of the HANA features I mentioned earlier - integration with Python and Tensorflow - are useful here.
HANA’s Python library includes a dataframe, which is drop-in compatible with a Pandas dataframe. This is great because most Python machine learning tools know how to work with Pandas dataframes. We’ll be using this functionality to read data directly from a HANA instance and explore it inside a SageMaker notebook.
We’ll also be taking a look at the results generated by HANA’s Tensorflow integration. Although we’ll be using a read-only HANA instance and won’t be able to call Tensorflow ourselves during this tutorial, I’ll show you query you’d need to run in HANA to call a Tensorflow model and ask it to perform inference on a dataset. I think you’ll be impressed by how simple it is.
Creating a SageMaker Notebook
The first step on our SageMaker + HANA journey is the creation of a SageMaker notebook. To do this, you’ll need an AWS account. If you don’t have one yet, navigate to https://aws.amazon.com/ and click on the big orange “Create an AWS Account” button at the top of the page:
Next, sign in to your AWS account and open the AWS Management Console. Click on the “All Services” dropdown to expand it and see the full list of services. You’ll find Amazon SageMaker under the “Machine Learning” heading. I’ve outlined it in red below:
Click on Amazon SageMaker, and you’ll land on the SageMaker dashboard. From the dashboard, click on Notebook Instances:
And then, click on the “Create notebook instance” button:
On the next screen, enter a notebook instance name, and choose the ml.t2.medium instance type. Make sure you choose this instance type because you can use it for free for 250 hours if you are a new SageMaker user. Also, make sure you shut down and delete your SageMaker notebook when you're finished with it.
When you’re done, click “Create notebook instance” at the bottom of the screen. At this point, you’ll need to wait a few minutes for SageMaker to spin up your notebook instance. While this is happening, the notebook’s status will show up as “Pending”:
This is the perfect time to go grab a coffee to provide your brain with a little extra kick to power through the rest of the tutorial.
Now that we’ve had a coffee break, let’s check the status of our SageMaker notebook. When it is ready, you’ll see that its status has changed to “InService”:
Next, click on the “Open JupyterLab” link. You’ll see a fancy spinning JupyterLab loading screen for a few seconds, then you’ll land on the Launcher screen:
You might notice that the JupyterLab UI is less than helpful here because it cuts off the names of the notebook types so you can’t actually read them.
Fortunately, if you hover over each notebook type, the full name will pop up. The one you want to choose is called “conda_tensorflow_p36”. In the screenshot above, it’s the last option before R (Beta). When you’ve located “conda_tensorflow_p36”, click it and you’ll be taken to an empty Python notebook.
If you’ve never used a Jupyter notebook before, you might be wondering what the heck this thing is. Fortunately, it’s simple to understand and use. A Jupyter notebook provides an interactive environment for writing and running code. In a notebook, you enter code in cells. Your new notebook contains an empty cell that’s ready for you to use:
Let’s try it out by entering and running some code. Click on the cell, and then enter the following code:
Then click the button to run your code, and you’ll see the following result:
That’s pretty neat! Our notebook runs our code, shows the result inline, and then creates a new cell for us to run more code.
With our quick Jupyter tutorial out of the way, let’s move on and give our notebook that ability to connect to an instance of SAP HANA. All we have to do is tell our notebook to download and install a Python library that will let us import HANA data into a Pandas-compatible dataframe.
On the menu bar, select File > New > Terminal:
In the terminal that opens, paste the following code:
mkdir -p ~/SageMaker/downloads/hanaclient
cd ~/SageMaker/downloads
wget --no-cookies \
--header "Cookie: eula_3_1_agreed=tools.hana.ondemand.com/developer-license-3_1.txt" \
"https://tools.hana.ondemand.com/additional/hanaclient-2.4.162-linux-x64.tar.gz" \
-P ~/SageMaker/downloads
tar -xvf ~/SageMaker/downloads/hanaclient-*-linux-x64.tar.gz
rm -f ~/SageMaker/downloads/hanaclient-*-linux-x64.tar.gz
Note that you’ll have to press Ctrl-Shift-V to paste into the terminal; Ctrl-V won’t work. These commands will download and unzip the HANA client. Next, we’ll paste in the following commands to install the HANA Python client library:
mkdir -p ~/SageMaker/sap/hdbclient
cd ~/SageMaker/downloads/client
./hdbinst --path=/home/ec2-user/SageMaker/sap/hdbclient/
cd ~
rm -Rf ~/SageMaker/downloads/client
source activate tensorflow_p36
pip install /home/ec2-user/SageMaker/sap/hdbclient/hdbcli-*.tar.gz
Finally, run these commands to install the HANA Machine Learning Python library:
source activate tensorflow_p36
pip install /home/ec2-user/SageMaker/sap/hdbclient/hana_ml-*.tar.gz
And with that, our notebook is ready to connect to SAP HANA. We’re ready to move on to analyzing a data set in our Jupyter notebook.
Exploring the Data
In JupyterLab, click on the tab containing your notebook - it should be called “Untitled.ipynb”. If you’d like to give your notebook a fancier name, you can find “Rename Notebook” in the file menu. If not, that’s fine too. We can get by just fine with an untitled notebook.
To begin, click the cell that contains the print function call we added above. Then, click the scissor icon to delete this cell. Doing this will leave you with a fresh blank cell to write new code in. In your new cell, type or paste the following code:
!pip install -q /home/ec2-user/SageMaker/sap/hdbclient/hdbcli-*.tar.gz
!pip install -q /home/ec2-user/SageMaker/sap/hdbclient/hana_ml-*.tar.gz
import os
import pandas as pd
import seaborn as sns
import numpy as np
from hana_ml import dataframe
from matplotlib import pyplot as plt
%matplotlib inline
This code ensures that the HANA libraries are ready to use, and then imports a few dependencies. As mentioned, Pandas is a data processing library. Seaborn and matplotlib are both data visualization libraries, and Numpy is a scientific computing library. After adding the code, click the Run button to execute it.
Next, we’re going to import our Iris data from HANA. Add the following code to a new cell:
col_features = ['SEPALLENGTH', 'SEPALWIDTH', 'PETALLENGTH', 'PETALWIDTH']
col_label = 'SPECIES'
col_label_class = ['setosa', 'versicolor', 'virginica']
host = "54.208.22.69"
port = 39015
user = "CONTEST_USER"
pswd = "CodeProject2019"
conn = dataframe.ConnectionContext(host, port, user, pswd)
training_data = conn.table("IRIS_TRAINING", schema="ML_DATA").collect()
training_data.SPECIES.replace(sorted(training_data.SPECIES.unique()),
col_label_class, inplace=True)
print(type(training_data))
This code loads the data from HANA into a Pandas dataframe, then replaces the label column numeric values into their respective string values. Finally, we print the class types returned by Pandas.
Next, paste the following code into a cell and run it:
training_data.head(5)
When you do, you’ll get the following result:
How nice! Here, we see that the output we can display isn’t limited to text. We can also get good-looking tables. And as we’ll soon see, we can do even better!
Next, let’s take a look at how each of the iris features is distributed. Paste this into a new cell and run it:
features_data = training_data.loc[:, col_features]
fig, axes = plt.subplots(figsize=(10,10),
nrows=2,
ncols=2,
)
for i, ax in enumerate(axes.flat):
features_data[col_features[i]].value_counts().sort_index().plot
(kind='line', legend=True, ax=ax)
plt.show()
When you do, you’ll see this set of charts:
How interesting! Sepal width appears to be normally distributed, while the other features appear to have a peak and then trail off.
To complete our analysis of the dataset, we’re going to create scatterplots that compare two features at a time for each of the three iris species. Run this code in a new cell:
fig, axes = plt.subplots(figsize=(15,15),
nrows=len(col_features),
ncols=len(col_features),
sharex='col',
sharey='row'
)
fig.suptitle("Edgar Anderson's Iris Data", fontsize=18)
plot_colors = ['blue', 'white', 'red']
for x in range(len(col_features)):
for y in range(len(col_features)):
ax = axes[x,y]
if x == y:
ax.text(0.5, 0.5, col_features[x],
transform = ax.transAxes,
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18
)
else:
for idx_label, color in zip(range(len(col_label_class)), plot_colors):
idx = np.where(training_data.SPECIES == idx_label)
ax.scatter(
training_data.loc[training_data.SPECIES == col_label_class[idx_label],
col_features[x]],
training_data.loc[training_data.SPECIES == col_label_class[idx_label],
col_features[y]],
c = color,
cmap = plt.cm.RdYlBu,
label = col_label_class[idx_label],
edgecolor = 'black',
s = 20
)
ax.legend()
plt.show()
This code will give the following result:
Again, a very interesting result! When looking at two features, there are many cases where each species forms a fairly distinct cluster. This is a good sign; it means there’s a great chance that we’ll be able to build a machine learning model that can learn how to determine which species an iris belongs to based on its petal length, petal width, and sepal width.
I strongly encourage you to look through the documentation and examples sections of both matplotlib and seaborn. See what other interesting visualizations you can come up with using this data set!
At this point, I’d like to do a quick recap. Without much code, we were able to read data directly from HANA into a Pandas dataframe and then work with it in a Jupyter notebook. This might not seem like a monumental achievement, but it’s great that we were able to accomplish it so easily. It would typically take more work to get data from a database into a dataframe.
Keep in mind that due to space limitations, we’re only looking at a small slice of what HANA can do. In a typical usage scenario, you’d have a huge data set in HANA already because HANA makes it easy to slice, dice, and analyze big data very quickly. Given that your data is already in HANA for analysis, the ability to plug it directly into a SageMaker notebook for further exploration is a huge win.
Interlude - Training a Model with SageMaker
Due to space constraints, we’re not going to walk through the entire process of training a machine learning model with SageMaker. It’s not that the process is difficult. In fact, it’s very easy! We just want to make sure you can get through this tutorial in under an hour.
The basic process of training in SageMaker involves:
- Saving your data to an AWS S3 bucket
- Setting up your ML classifier
- Asking SageMaker to train your model
All of these steps can be accomplished by running a bit of code in a SageMaker notebook. To get a look at what it would look like to train a Tensorflow model using the same iris data set we’ve been looking at, check out this awesome tutorial by SAP.
SageMaker also offers you the option to deploy your model and call it via a REST API. For a complete end-to-end tutorial that teaches you how to train a model and deploy it, I recommend SageMaker’s Get Started Guide.
Tensorflow Data in HANA
We’re going to jump forward a bit now. Using the data that we previously analyzed in our SageMaker notebook, I trained a model in SageMaker and then deployed it using Tensorflow Serving Amazon’s Elastic Container Service.
One of the neat features of HANA’s built-in machine learning support is that it can access Tensorflow Serving instances via a gRPC connection. It can then pass test data directly from HANA to Tensorflow for classification, and then writes the results to a new table in HANA.
Since we’re using a shared, read-only instance of HANA, you won’t be able to do this yourself. Instead, I’ve pre-populated a HANA table with the results that were received from Tensorflow so you can see what the results look like.
For the sake of thoroughness, let’s take a look at the call you’d make to send data from HANA to Tensorflow. You don’t need to enter this code anywhere; it’s just for reference:
CREATE TYPE TT_IRIS_FEATURES AS TABLE (
SEPALLENGTH FLOAT,
SEPALWIDTH FLOAT,
PETALLENGTH FLOAT,
PETALWIDTH FLOAT
);
CREATE TYPE TT_IRIS_PARAMS AS TABLE (
"Parameter" VARCHAR(100),
"Value" VARCHAR(100)
);
CREATE TYPE TT_IRIS_RESULTS AS TABLE (
PREDICTED_CLASS_ID INTEGER,
CLASSES VARCHAR(100),
BIAS_0 FLOAT, BIAS_1 FLOAT, BIAS_2 FLOAT,
PROBABILITY_0 FLOAT, PROBABILITY_1 FLOAT, PROBABILITY_2 FLOAT
);
CREATE COLUMN TABLE IRIS_PROC_PARAM_TABLE (
POSITION INTEGER,
SCHEMA_NAME NVARCHAR(256),
TYPE_NAME NVARCHAR(256),
PARAMETER_TYPE VARCHAR(7)
);
INSERT INTO IRIS_PROC_PARAM_TABLE VALUES (1, CURRENT_SCHEMA, 'TT_IRIS_PARAMS' , 'in');
INSERT INTO IRIS_PROC_PARAM_TABLE VALUES (2, CURRENT_SCHEMA, 'TT_IRIS_FEATURES' , 'in');
INSERT INTO IRIS_PROC_PARAM_TABLE VALUES (3, CURRENT_SCHEMA, 'TT_IRIS_RESULTS' , 'out');
CALL SYS.AFLLANG_WRAPPER_PROCEDURE_DROP(CURRENT_SCHEMA, 'MY_IRIS');
CALL SYS.AFLLANG_WRAPPER_PROCEDURE_CREATE_
('EML', 'PREDICT', CURRENT_SCHEMA, 'MY_IRIS', IRIS_PROC_PARAM_TABLE);
CREATE TABLE IRIS_RESULTS LIKE TT_IRIS_RESULTS;
CREATE TABLE IRIS_PARAMS LIKE TT_IRIS_PARAMS;
INSERT INTO IRIS_PARAMS VALUES ('Model', 'iris%predict');
INSERT INTO IRIS_PARAMS VALUES ('RemoteSource', 'TensorFlow');
INSERT INTO IRIS_PARAMS VALUES ('Deadline', '10000');
CALL MY_IRIS (IRIS_PARAMS, IRIS_TEST, IRIS_RESULTS) WITH OVERVIEW;
That looks like a lot of code, but it’s all fairly easy to follow. We’re defining a few new types, creating a wrapper procedure, creating a table to store results in, then running the procedure that causes HANA to call Tensorflow.
As we can see, HANA is asked to save the results into the IRIS_RESULTS
table. This is the table I’ve pre-populated with results in our shared HANA database, so let’s do a few queries to explore the data.
There's one small catch here: to connect to our HANA instance, we're going to have to add an entry to our HOSTS file. All of HANA's web-based tools expect to be running on hxehost
- so we'll need to make a HOSTS file entry that will take all calls to hxehost
and map them to the HANA instance's IP address.
On Windows, you can find your HOSTS file in C:\Windows\System32\drivers\etc\HOSTS. On Linux and MacOS, you can find it at /etc/hosts. On whatever OS you're using, you'll need administrator or root privileges to edit the file. Once you've opened your HOSTS file, add the following line:
54.208.22.69 hxehost
To connect to our read-only HANA instance, open a web browser and navigate to https://hxehost:53075. Your browser will probably issue a security warning, because HANA is using a self-signed TLS certificate. Normally, when you get an alert like this, you'll want to run as fast as you can in the other direction.
In this case, there's no risk. Even if the certificate were compromised, we're not going to be transferring any confidential information. I'm posting the login credentials here for everyone to see, and the read-only information in the database is freely available to anyone on the internet who would like to sign in and take a look.
At the login prompt, use the following credentials:
- User name: CONTEST_USER
- Password: CodeProject2019
It might take 30 seconds or so for the HANA Web IDE to load completely. When it finishes, you'll see a panel at the left listing a single database: HXE@localhost
. Right-click on the database, and in the menu that appears, choose "Open SQL Console".
Now, we're going to do a query to look at the results that HANA retrieved from our Tensorflow model. Run the following query in the HANA Web IDE:
SET SCHEMA ML_DATA;
SELECT
D.SPECIES AS "CURRENT_SPECIES",
R.PREDICTED_CLASS_ID AS "PREDICTED_SPECIES",
CASE WHEN D.SPECIES != R.PREDICTED_CLASS_ID THEN 'INCORRECT' _
ELSE 'CORRECT' END AS "STATUS",
ROUND (PROBABILITY_0, 5) AS "PROBABILITY_0",
ROUND (PROBABILITY_1, 5) AS "PROBABILITY_1",
ROUND (PROBABILITY_2, 5) AS "PROBABILITY_2"
FROM
(SELECT *, ROW_NUMBER() OVER() AS RN FROM IRIS_RESULTS) R
JOIN
(SELECT *, ROW_NUMBER() OVER() AS RN FROM IRIS_TEST) D ON R.RN = D.RN;
This query joins the IRIS_RESULTS
table data from Tensorflow with the IRIS_TEST
table containing the test data was submitted to Tensorflow. This lets us display both the test data and the resulting Tensorflow prediction.
For convenience, we're also adding an extra 'Status
' field that will tell us whether the Tensorflow prediction was correct or incorrect.
And that’s it! With a simple query, we were able to explore the data that HANA imported directly from Tensorflow. Keep in mind that HANA is a high speed, in-memory database. There’s plenty of further transformation and analysis you can do to the data now that it’s in HANA. What we’ve looked at is just a starting point.
Consider how convenient this is: without built-in ML tools, you’d typically have to write a separate program to read test data, send it to Tensorflow, receive the results, and write the results to your database. HANA saves you a ton of time here; instead of spending time writing extra code, you can instead spend that time working with your data.
Entry Code
To create en entry code for this part of the challenge, run the following code in the HANA Web IDE:
CALL GENERATE_ENTRY_CODE(123456789);
But replace 123456789 with your CodeProject member number. You can find your member number on your CodeProject profile page, directly above your avatar image.
Once you've generated an entry code, go to the contest entry page. Select contest round 2 in the drop-down box, punch in your entry code, and click Submit.
Conclusion
Wow! We’ve been on a real whirlwind tour of SAP HANA and AWS SageMaker, so let’s pause for a moment to catch our breath.
We’ve covered a lot in a short time: we created our own SageMaker notebook, and then used it to visualize data pulled directly from HANA. We then walked through how easy it would be to drop this data into S3 and automatically training it using SageMaker. Next, we looked at how we’d deploy this model to make it easy to use from services like AWS Lambda, AWS EC2, and AWS Lightsail. Finally, we dove back into HANA to see how easy it is to ask HANA to send data to Tensorflow, store the inference results in a new table, and then explore the results using a SQL query.
Since we’ve just scratched the surface of what’s possible with HANA and SageMaker, I strongly recommend trying them both in more depth.
For SageMaker, the Getting Started documentation is a great place to begin. It gives you an easy-to-understand overview of how to use AWS SageMaker through the entire lifecycle of a typical data science project.
For HANA, I recommend the HANA Express Edition official tutorials. These tutorials will walk you through the setup of HANA and then guide you through the creation of several different types of applications that will help you explore all of HANA’s capabilities.
History
- 31st August, 2019: Initial version