Introduction
In my research, we do a lot of data mining and machine learning. While we do not employ SQL Server 2005 to perform the operations, the underlying concepts are the same. Data mining can reveal statistical patterns in large amounts of data that we may have never even known existed. This article is intended to be used as a starting point and introduction into the world of data mining.
In this article, I will demonstrate basic data mining by predicting what grade a student will earn based on the student's gender, age, and grade level (like 4th grade, 5th grade etc.). We will train our classifier using known grades from existing students. Of course, there are lots of other real-world uses for data mining. The "Customers who bought this also bought..." link on Amazon.com is a perfect example. A more fun example is employing data mining on Subversion repositories to determine what files are usually changed together, or analyzing software profiles of passed and failed executions.
This is my first CodeProject article, so please be nice. ;) My intentions are that this will be the first article in a series of articles pertaining to data mining and SQL Server 2005. I did not see many articles on data mining on CodeProject, so perhaps this will be helpful. I listed this article as Intermediate, because it does not require much programming knowledge, but it does require a basic understanding of SQL Server 2005 and data mining, which I do not present.
Background
The book Data Mining with SQL Server 2005 (ISBN 0471462616) is a good starting point for anyone wanting to learn about data mining with SQL Server 2005. If you want more general information on data mining (and its statistical algorithms), then pick a different book. Like the book, this article requires that you have SQL Server 2005 Analysis Services and the SQL Server Business Intelligence Development Studio available.
Using the Code
Before we dive into the code part, let's set up our database of students. In the database (called Grades), we'll have two tables: PostedGrades and NewGrades. PostedGrades has the grades that have already been entered, and NewGrades will have the grades that we are going to predict. To create the PostedGrades table:
CREATE TABLE [dbo].[PostedGrades](
[studentid] [nchar](2) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[age] [int] NULL,
[gender] [bit] NULL,
[gradelevel] [int] NULL,
[grade] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_PostedGrades] PRIMARY KEY CLUSTERED (
[studentid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
To create the NewGrades table:
CREATE TABLE [dbo].[NewGrades](
[studentid] [nchar](2) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[age] [int] NULL,
[gender] [bit] NULL,
[gradelevel] [int] NULL,
CONSTRAINT [PK_NewGrades] PRIMARY KEY CLUSTERED (
[studentid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Note that the two tables are identical, except that the NewGrades table does not have a grade
field because this is the field we will be predicting. The gender
field is a Boolean
value, and your imagination can decide whether TRUE = boy or if TRUE = girl. Next, we'll populate the PostedGrades table with some records:
INSERT INTO PostedGrades VALUES ('1' , 10, TRUE, 4, 'A')
INSERT INTO PostedGrades VALUES ('2' , 9, TRUE, 4, 'B')
INSERT INTO PostedGrades VALUES ('3' , 10, TRUE, 4, 'A')
Now, we'll populate the NewGrades table with a record, too:
INSERT INTO NewGrades VALUES ('10' , 10, TRUE, 4)
That's all for the database setup. There are two projects associated with this article - an ASP.NET project to demonstrate data mining (source given below), and an Analysis Services project (instructions given below on how to create this project). First, let's consider the Analysis Services project. Here's how to create this project:
- Create a new Analysis Services project using SQL Server Business Intelligence Development Studio.
- Add a new Data Source, call it Grades, and have it point to the newly created Grades database.
- Add a new Data Source View employing the new data source (from the last step) and name it Grades, too.
- Add a new Mining Structure and have it use the Microsoft Decision Trees algorithm. When prompted, the
studentid
field is the key, and all remaining fields are inputs. gender
should be marked as Boolean
, age
and gradelevel
as Long
, and the remaining fields as Text
.
Now, publish the project to the database. You will need to open the project's properties, click the Deployment options, and enter in the name of the database you want to publish the Analysis Services project to (in this example, the database name is Grades). We are now ready to utilize the data mining on our NewGrades table. To do so, we'll use a very simple ASP.NET page with two buttons, TrainModelButton
and PredictButton
, which do exactly what you'd expect:
Imports System.Data.OleDb
Partial Class _Default
Inherits System.Web.UI.Page
Private cs As String = "Provider=MSOLAP.3;Data Source=localhost;Initial Catalog=Grades"
Private conn As New OleDbConnection()
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs)
Handles Me.Load
conn.Open()
End Sub
Protected Sub Page_Unload(ByVal sender As Object, ByVal e As System.EventArgs)
Handles Me.Unload
conn.Close()
End Sub
Protected Sub TrainModelButton_Click(ByVal sender As Object, _
BVal e As System.EventArgs) Handles TrainModelButton.Click
Dim SQL As String = "DELETE FROM MINING STRUCTURE PostedGrades"
Dim CMDDelete As OleDbCommand = New OleDbCommand(SQL, conn)
CMDDelete.ExecuteNonQuery()
CMDDelete.Dispose()
Dim PipeDataToModel As String = "INSERT INTO MINING STRUCTURE PostedGrades " _
& "(studentid, gender, age, gradelevel, grade) " _
& "OPENQUERY (Grades, 'SELECT studentid, gender, age, gradelevel, grade " _
& FROM PostedGrades
Dim CMD As New OleDbCommand(PipeDataToModel, conn)
CMD.ExecuteNonQuery()
CMD.Dispose()
End Sub
Protected Sub PredictButton_Click(ByVal sender As Object, _
BVal e As System.EventArgs) Handles PredictButton.Click
Dim Query As String = "SELECT T.studentid, PostedGrades.grade, " _
& PredictProbability(grade) FROM " _
& "PostedGrades NATURAL PREDICTION JOIN OPENQUERY(Grades, " _
& 'SELECT * FROM NewGrades') AS T"
Dim CMD As New OleDbCommand(Query, conn)
Dim myReader As OleDbDataReader = CMD.ExecuteReader()
If myReader.HasRows = True Then
While myReader.Read()
Response.Write(myReader(0).ToString & " " & _
& myReader(1).ToString & " " & _
myReader(2).ToString & "<BR>")
End While
End If
myReader.Close()
CMD.Dispose()
End Sub
End Class
The purpose of the code in TrainModelButton_Click()
is to read the data from the PostedGrades table into the mining structure (train the classifier). Once that is done, PredictButton_Click()
will predict the grades for the students in the NewGrades table and write the results to the page.
The mining structure PostedGrades
is first dropped in TrainModelButton_Click()
because we can not incrementally train a classifier - we just have to delete it and make it again. Re-training the classifier is probably not something we would do every day in a real-world application. (You do not have to re-train it every time - just the first time.)
For the possible exception of the OpenQuery
function, the SQL used is pretty straight-forward. The OpenQuery
function lets you perform a query on a linked server. (See the MSDN for a more detailed description.) In our case, the linked server is just the database server.
Points of Interest
Data mining works best with large amounts of data. Using three records to train a classifier is essentially worthless, but it works for illustrative purposes. If you want to try this on your own, the hardest part can often be coming up with a large amount of data to use. I recommend using logs to learn with, because they typically have lots of data in them (web logs, email access logs, and so on). Sequence clustering will be best for long lists of ordered events.
History
- June 12, 2008 - Initial publish.