Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / desktop / WinForms

Data Mining with SQL Server 2005 without Analysis Services

4.56/5 (4 votes)
31 Dec 2008Public Domain3 min read 39.9K   1.6K  
A C# implementation of CHAID for MSSQL 2005 à la Cognos scenario.

Image 1

Introduction

Long time ago, I worked with a Cognos data mining product called Scenario. I liked it so much that I decided to write my own implementation of the underlying CHAID algorithm (see Background below). Back then, I wrote it in Borland C++ Builder v. 4 (see BCB source download above) and tested it against Scenario. Having got what I wanted (identical results), I lost further interest and put my code in cold storage ...

After almost a decade, I am facing a data mining task again. In the meantime, the world has changed quite a bit. The old Cognos Scenario seems to have been forgotten (and I do not have access to it any more), and databases have evolved, too. At work, we have a SQL Server 2005 database suitable for data mining, and CHAID looks like a good candidate for the method. Naturally, I remembered my code, and looked for a way to reuse it as much as possible. I could not use it directly because of the change in the database platform; my Borland Database Engine license does not include any support for MS SQL Server. I briefly considered trying to bridge the gap with a third-party component, but I failed to find a free one on the web. Then, I discovered Microsoft Chart Controls for .NET Framework 3.5, and decided to re-write my application in C#. I must admit I was also inspired by the "Data Mining with SQL Server 2005" article here on CodeProject. Likewise, I listed this article as Intermediate, because I am not a professional programmer myself. The tough part might be understanding the CHAID algorithm and the statistical background, Bonferonni adjustments, etc.

Background

CHAID stands for Chi-square Automatic Interaction Detection. My implementation is based on an older article by G. V. Kass (An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics 29, No. 2, 1980, pp. 119-127). The Cognos documentation is publicly unavailable, but referring to it would be a major advantage, too. (I think I used to have access to a document entitled something like "How Scenario Works".)

Using the code

The code can be used to analyze categorical data where all predictors are "free" (not monotonic). Both dependent and predictor variables should be represented as strings in the database. I have only tested my application against a single MSSQL table named Records, but I believe database views could be used as well. For a quick start, I would recommend setting up the Records table as follows:

SQL
USE [DatabaseName]
GO
/****** Object:  Table [dbo].[Records]    Script Date: 12/31/2008 22:27:05 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Records](
    [ID] [int] NOT NULL,
    [FlavorPreference] [nvarchar](255) NULL,
    [FreeGoProductSelected] [nvarchar](255) NULL,
    [WrappingColorPreference] [nvarchar](255) NULL,
    [PricePoint] [nvarchar](255) NULL,
    [WouldBuy] [nvarchar](255) NULL,
    [SurveyLocation] [nvarchar](255) NULL,
    [Age] [nvarchar](255) NULL,
    [FlavorCode] [nvarchar](255) NULL,
    [Gender] [nvarchar](255) NULL,
    [WebAccess] [nvarchar](255) NULL,
    [LikesCamping] [nvarchar](255) NULL,
    [GoKnowledge] [nvarchar](255) NULL,
PRIMARY KEY CLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON)
ON [PRIMARY]
) ON [PRIMARY]

I have not includes any data in the downloads, but I am sure you can fill in your table with your own test data. Please do not forget to edit the app.config file.

I do not think anyone would want to actually use the original C++ implementation; it is only meant for reference. Please note, there is no configuration available, e.g., table and column names are hard-coded there.

If you succeed in building the C# application and populating your MSSQL database, you should be able to see something like this, with the corresponding BCB screenshot at the top.

Image 2

Points of interest

If you compare the C# code (MSSQL) with the original C++ version (BDE, MS Access 97), you will notice that the new SQL SELECT statements (auto-generated behind the scenes) include "order by" clauses as well as "group by" ones. For me, it was quite a surprise to find out it was necessary to specify both clauses in the case of SQL Server. Another lesson learned was the importance of the MultipleActiveResultSets=True specification in the database connection string, see app.config.

History

  • December 31, 2008 - Initial publishing.

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication