Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Data Mining with SQL Server 2005

0.00/5 (No votes)
12 Jun 2008 1  
An introduction to basic data mining with SQL Server 2005.

Introduction

In my research, we do a lot of data mining and machine learning. While we do not employ SQL Server 2005 to perform the operations, the underlying concepts are the same. Data mining can reveal statistical patterns in large amounts of data that we may have never even known existed. This article is intended to be used as a starting point and introduction into the world of data mining.

In this article, I will demonstrate basic data mining by predicting what grade a student will earn based on the student's gender, age, and grade level (like 4th grade, 5th grade etc.). We will train our classifier using known grades from existing students. Of course, there are lots of other real-world uses for data mining. The "Customers who bought this also bought..." link on Amazon.com is a perfect example. A more fun example is employing data mining on Subversion repositories to determine what files are usually changed together, or analyzing software profiles of passed and failed executions.

This is my first CodeProject article, so please be nice. ;) My intentions are that this will be the first article in a series of articles pertaining to data mining and SQL Server 2005. I did not see many articles on data mining on CodeProject, so perhaps this will be helpful. I listed this article as Intermediate, because it does not require much programming knowledge, but it does require a basic understanding of SQL Server 2005 and data mining, which I do not present.

Background

The book Data Mining with SQL Server 2005 (ISBN 0471462616) is a good starting point for anyone wanting to learn about data mining with SQL Server 2005. If you want more general information on data mining (and its statistical algorithms), then pick a different book. Like the book, this article requires that you have SQL Server 2005 Analysis Services and the SQL Server Business Intelligence Development Studio available.

Using the Code

Before we dive into the code part, let's set up our database of students. In the database (called Grades), we'll have two tables: PostedGrades and NewGrades. PostedGrades has the grades that have already been entered, and NewGrades will have the grades that we are going to predict. To create the PostedGrades table:

CREATE TABLE [dbo].[PostedGrades](
    [studentid] [nchar](2) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [age] [int] NULL,
    [gender] [bit] NULL,
    [gradelevel] [int] NULL,
    [grade] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_PostedGrades] PRIMARY KEY CLUSTERED (
    [studentid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, 
    ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

To create the NewGrades table:

CREATE TABLE [dbo].[NewGrades](
    [studentid] [nchar](2) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [age] [int] NULL,
    [gender] [bit] NULL,
    [gradelevel] [int] NULL,
CONSTRAINT [PK_NewGrades] PRIMARY KEY CLUSTERED (
    [studentid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, 
    ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Note that the two tables are identical, except that the NewGrades table does not have a grade field because this is the field we will be predicting. The gender field is a Boolean value, and your imagination can decide whether TRUE = boy or if TRUE = girl. Next, we'll populate the PostedGrades table with some records:

INSERT INTO PostedGrades VALUES ('1' , 10, TRUE, 4, 'A')
INSERT INTO PostedGrades VALUES ('2' , 9, TRUE, 4, 'B')
INSERT INTO PostedGrades VALUES ('3' , 10, TRUE, 4, 'A')

Now, we'll populate the NewGrades table with a record, too:

INSERT INTO NewGrades VALUES ('10' , 10, TRUE, 4)

That's all for the database setup. There are two projects associated with this article - an ASP.NET project to demonstrate data mining (source given below), and an Analysis Services project (instructions given below on how to create this project). First, let's consider the Analysis Services project. Here's how to create this project:

  1. Create a new Analysis Services project using SQL Server Business Intelligence Development Studio.
  2. Add a new Data Source, call it Grades, and have it point to the newly created Grades database.
  3. Add a new Data Source View employing the new data source (from the last step) and name it Grades, too.
  4. Add a new Mining Structure and have it use the Microsoft Decision Trees algorithm. When prompted, the studentid field is the key, and all remaining fields are inputs. gender should be marked as Boolean, age and gradelevel as Long, and the remaining fields as Text.

Now, publish the project to the database. You will need to open the project's properties, click the Deployment options, and enter in the name of the database you want to publish the Analysis Services project to (in this example, the database name is Grades). We are now ready to utilize the data mining on our NewGrades table. To do so, we'll use a very simple ASP.NET page with two buttons, TrainModelButton and PredictButton, which do exactly what you'd expect:

Imports System.Data.OleDb

Partial Class _Default
  Inherits System.Web.UI.Page

  ' Just change the Initial Catalog to your database (our example database is Grades).
  Private cs As String = "Provider=MSOLAP.3;Data Source=localhost;Initial Catalog=Grades"
  Private conn As New OleDbConnection()

  Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) 
    Handles Me.Load 
    conn.Open()
  End Sub

  Protected Sub Page_Unload(ByVal sender As Object, ByVal e As System.EventArgs) 
    Handles Me.Unload
    conn.Close()
  End Sub

  Protected Sub TrainModelButton_Click(ByVal sender As Object, _
    BVal e As System.EventArgs) Handles TrainModelButton.Click
    
    ' Delete the mining structure so we can re-train it.
    Dim SQL As String = "DELETE FROM MINING STRUCTURE PostedGrades"

    Dim CMDDelete As OleDbCommand = New OleDbCommand(SQL, conn)
    CMDDelete.ExecuteNonQuery()
    CMDDelete.Dispose()

    ' Train the mining structure with data from the PostedGrades table.
    Dim PipeDataToModel As String = "INSERT INTO MINING STRUCTURE PostedGrades " _
    & "(studentid, gender, age, gradelevel, grade) " _
    & "OPENQUERY (Grades, 'SELECT studentid, gender, age, gradelevel, grade " _
    & FROM PostedGrades')"

    Dim CMD As New OleDbCommand(PipeDataToModel, conn)
    CMD.ExecuteNonQuery()
    CMD.Dispose()
    
  End Sub

  Protected Sub PredictButton_Click(ByVal sender As Object, _
    BVal e As System.EventArgs) Handles PredictButton.Click

    ' Make our query to predict grades for the students in the NewGrades table.
    Dim Query As String = "SELECT T.studentid, PostedGrades.grade, " _
    & PredictProbability(grade) FROM " _
    & "PostedGrades NATURAL PREDICTION JOIN OPENQUERY(Grades, " _
    & 'SELECT * FROM NewGrades') AS T"        
        
    Dim CMD As New OleDbCommand(Query, conn)

    Dim myReader As OleDbDataReader = CMD.ExecuteReader()

    If myReader.HasRows = True Then
    While myReader.Read()
      ' Just output the results of the query.
      Response.Write(myReader(0).ToString & " " & _
        & myReader(1).ToString & " " & _
        myReader(2).ToString & "<BR>")
    End While
    End If

    myReader.Close()
    CMD.Dispose()

  End Sub

End Class

The purpose of the code in TrainModelButton_Click() is to read the data from the PostedGrades table into the mining structure (train the classifier). Once that is done, PredictButton_Click() will predict the grades for the students in the NewGrades table and write the results to the page.

The mining structure PostedGrades is first dropped in TrainModelButton_Click() because we can not incrementally train a classifier - we just have to delete it and make it again. Re-training the classifier is probably not something we would do every day in a real-world application. (You do not have to re-train it every time - just the first time.)

For the possible exception of the OpenQuery function, the SQL used is pretty straight-forward. The OpenQuery function lets you perform a query on a linked server. (See the MSDN for a more detailed description.) In our case, the linked server is just the database server.

Points of Interest

Data mining works best with large amounts of data. Using three records to train a classifier is essentially worthless, but it works for illustrative purposes. If you want to try this on your own, the hardest part can often be coming up with a large amount of data to use. I recommend using logs to learn with, because they typically have lots of data in them (web logs, email access logs, and so on). Sequence clustering will be best for long lists of ordered events.

History

  • June 12, 2008 - Initial publish.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here