Introduction
Long time ago, I worked with a Cognos data mining product called Scenario. I liked it so much that I decided to write my own implementation of the underlying CHAID algorithm (see Background below). Back then, I wrote it in Borland C++ Builder v. 4 (see BCB source download above) and tested it against Scenario. Having got what I wanted (identical results), I lost further interest and put my code in cold storage ...
After almost a decade, I am facing a data mining task again. In the meantime, the world has changed quite a bit. The old Cognos Scenario seems to have been forgotten (and I do not have access to it any more), and databases have evolved, too. At work, we have a SQL Server 2005 database suitable for data mining, and CHAID looks like a good candidate for the method. Naturally, I remembered my code, and looked for a way to reuse it as much as possible. I could not use it directly because of the change in the database platform; my Borland Database Engine license does not include any support for MS SQL Server. I briefly considered trying to bridge the gap with a third-party component, but I failed to find a free one on the web. Then, I discovered Microsoft Chart Controls for .NET Framework 3.5, and decided to re-write my application in C#. I must admit I was also inspired by the "Data Mining with SQL Server 2005" article here on CodeProject. Likewise, I listed this article as Intermediate, because I am not a professional programmer myself. The tough part might be understanding the CHAID algorithm and the statistical background, Bonferonni adjustments, etc.
Background
CHAID stands for Chi-square Automatic Interaction Detection. My implementation is based on an older article by G. V. Kass (An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics 29, No. 2, 1980, pp. 119-127). The Cognos documentation is publicly unavailable, but referring to it would be a major advantage, too. (I think I used to have access to a document entitled something like "How Scenario Works".)
Using the code
The code can be used to analyze categorical data where all predictors are "free" (not monotonic). Both dependent and predictor variables should be represented as strings in the database. I have only tested my application against a single MSSQL table named Records, but I believe database views could be used as well. For a quick start, I would recommend setting up the Records table as follows:
USE [DatabaseName]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Records](
[ID] [int] NOT NULL,
[FlavorPreference] [nvarchar](255) NULL,
[FreeGoProductSelected] [nvarchar](255) NULL,
[WrappingColorPreference] [nvarchar](255) NULL,
[PricePoint] [nvarchar](255) NULL,
[WouldBuy] [nvarchar](255) NULL,
[SurveyLocation] [nvarchar](255) NULL,
[Age] [nvarchar](255) NULL,
[FlavorCode] [nvarchar](255) NULL,
[Gender] [nvarchar](255) NULL,
[WebAccess] [nvarchar](255) NULL,
[LikesCamping] [nvarchar](255) NULL,
[GoKnowledge] [nvarchar](255) NULL,
PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
ON [PRIMARY]
) ON [PRIMARY]
I have not includes any data in the downloads, but I am sure you can fill in your table with your own test data. Please do not forget to edit the app.config file.
I do not think anyone would want to actually use the original C++ implementation; it is only meant for reference. Please note, there is no configuration available, e.g., table and column names are hard-coded there.
If you succeed in building the C# application and populating your MSSQL database, you should be able to see something like this, with the corresponding BCB screenshot at the top.
Points of interest
If you compare the C# code (MSSQL) with the original C++ version (BDE, MS Access 97), you will notice that the new SQL SELECT
statements (auto-generated behind the scenes) include "order by
" clauses as well as "group by
" ones. For me, it was quite a surprise to find out it was necessary to specify both clauses in the case of SQL Server. Another lesson learned was the importance of the MultipleActiveResultSets=True
specification in the database connection string, see app.config.
History
- December 31, 2008 - Initial publishing.