High Level Data Profiler Script

Gene R. Browning

4.86/5 (4 votes)

3 Mar 2016CPOL2 min read

20.9K

Get a quick look at your data characteristics using this tool.

Introduction

In working with data, we frequently need to decipher unfamiliar files and tables and often we have the need for a high level understanding of our existing datasets. When these situations arise, a data profiling tool can come in very handy. The script I present here generates dynamic SQL to provide column level measures for Avg/Min/Max length (character data) or value (numeric and date data), distinct value counts and null value counts. The schema/table names analyzed can be controlled by simply changing the WHERE clause for the database schema selection. An example output follows:

Using the code

The script uses a cursor against the INFORMATION_SCHEMA views to loop through the selected schemas, tables and views to construct and execute a profiling SELECT statement for each column. The SELECT statement is constructed based on the generic data type of the column. Predefined strings exist for each data type and contain the arbitrary string of “@@replace” which is substituted with the correct column name using the REPLACE function.

The result of each select execution is stored in a table variable for final presentation.

SQL

SET NOCOUNT ON
SET ANSI_WARNINGS OFF
-- Suppresses the "Null value is eliminated by an aggregate..." warning

DECLARE @TableName sysname
DECLARE @TableType VARCHAR(15)
DECLARE @Schema sysname
DECLARE @Catalog sysname
DECLARE @ColumnName sysname
DECLARE @OrdinalPosition INT
DECLARE @DataType sysname
DECLARE @char INT
DECLARE @num TINYINT
DECLARE @date SMALLINT
DECLARE @sql VARCHAR(MAX)
DECLARE @stmtString VARCHAR(MAX)
DECLARE @stmtNum VARCHAR(MAX)
DECLARE @stmtDate VARCHAR(MAX)
DECLARE @stmtOther VARCHAR(MAX)
DECLARE @stmtUnsup VARCHAR(MAX)
DECLARE @q CHAR(1)
 -- single quote
DECLARE @qq CHAR(2)
 -- double quote

-----------------------------------------------------------------------------------
-- Table variable to collect the final results
-----------------------------------------------------------------------------------
DECLARE @Results TABLE
    (
      [Schema] sysname ,
      [Catalog] sysname ,
      [Table Name] sysname ,
      [Table Type] VARCHAR(10) ,
      [Column Name] sysname ,
      [Seq] INT ,
      [Data Type] sysname ,
      [Avg Len/Val] NUMERIC ,
      [Min Len/Val] NUMERIC ,
      [Max Len/Val] NUMERIC ,
      [Min Date] DATETIME ,
      [Max Date] DATETIME ,
      [Distinct Values] NUMERIC ,
      [Num NULL] NUMERIC
    )

-----------------------------------------------------------------------------------
-- quote char
-----------------------------------------------------------------------------------
SET @q = ''''
SET @qq = @q + @q

-----------------------------------------------------------------------------------
-- The dynamic replacement strings for various data types
-----------------------------------------------------------------------------------
SET @stmtUnsup = 'null, null, null, null, null, null, 0'
SET @stmtString = 'avg(len([@@replace])), ' + 'min(len([@@replace])), ' + 'max(len([@@replace])), ' + 'null, null, count(distinct [@@replace]), '
    + 'sum(case when [@@replace] is null then 1 else 0 end)'
SET @stmtNum = 'avg(CAST(isnull([@@replace], 0) AS FLOAT)), ' + 'min([@@replace]) AS [Min @@replace], ' + 'max([@@replace]) AS [Max @@replace], '
    + 'null, null, count(distinct @@replace) AS [Dist Count @@replace], ' + 'sum(case when @@replace is null then 1 else 0 end) AS [Num Null @@replace]'
SET @stmtDate = 'null, null, null, min([@@replace]) AS [Min @@replace], ' + 'max([@@replace]) AS [Max @@replace], '
    + 'count(distinct @@replace) AS [Dist Count @@replace], ' + 'sum(case when @@replace is null then 1 else 0 end) AS [Num Null @@replace]'
SET @stmtOther = 'null, null, null, null, null, count(distinct @@replace) AS [Dist Count @@replace], '
    + 'sum(case when @@replace is null then 1 else 0 end) AS [Num Null @@replace]'

-----------------------------------------------------------------------------------
-- The cursor to read through the schema.  Change the WHERE clause to control the tables/views used
-----------------------------------------------------------------------------------
DECLARE TableCursor CURSOR
FOR
    SELECT  c.TABLE_SCHEMA ,
            c.TABLE_CATALOG ,
            c.TABLE_NAME ,
            t.TABLE_TYPE ,
            c.COLUMN_NAME ,
            c.ORDINAL_POSITION ,
            c.DATA_TYPE ,
            c.CHARACTER_MAXIMUM_LENGTH ,
            c.NUMERIC_PRECISION ,
            c.DATETIME_PRECISION
    FROM    INFORMATION_SCHEMA.COLUMNS c
            INNER JOIN INFORMATION_SCHEMA.TABLES t ON t.TABLE_SCHEMA = c.TABLE_SCHEMA
                                                      AND t.TABLE_NAME = c.TABLE_NAME
    WHERE   c.TABLE_SCHEMA IN ( 'dbo' )  --------------------------------
            AND c.TABLE_NAME LIKE 't_per%'  -- <<< Schema, table and view names to analyze go here
            AND t.TABLE_TYPE NOT IN ( 'VIEW' ) --------------------------------
    ORDER BY c.TABLE_NAME ,
            c.ORDINAL_POSITION

OPEN TableCursor

FETCH NEXT
FROM TableCursor
INTO @Schema, @Catalog, @TableName, @TableType, @ColumnName, @OrdinalPosition, @DataType, @char, @num, @date

-----------------------------------------------------------------------------------
-- Process through the database schema
-----------------------------------------------------------------------------------
WHILE @@FETCH_STATUS = 0
    BEGIN
        SET @sql = CASE WHEN @DataType = 'image' THEN @stmtUnsup
                        WHEN @DataType = 'text' THEN @stmtUnsup
                        WHEN @DataType = 'ntext' THEN @stmtUnsup
                        WHEN @char IS NOT NULL THEN @stmtString
                        WHEN @num IS NOT NULL THEN @stmtNum
                        WHEN @date IS NOT NULL THEN @stmtDate
                        ELSE @stmtOther
                   END
        SET @sql = REPLACE(@sql, <a href="mailto:'@@replace'">'@@replace'</a>, @ColumnName)

        IF @sql <> ''
            BEGIN
                SET @Schema = @q + @Schema + @q
                SET @Catalog = @q + @Catalog + @q
                SET @TableName = @q + @TableName + @q
                SET @TableType = @q + @TableType + @q
                SET @ColumnName = @q + REPLACE(@ColumnName, @q, @qq) + @q
                SET @DataType = @q + @DataType + @q
  
                SET @sql = 'SELECT ' + @Schema + ', ' + @Catalog + ', ' + @TableName + ', ' + @TableType + ', ' + @ColumnName + ', '
                    + CONVERT(VARCHAR(5), @OrdinalPosition) + ', ' + @DataType + ', ' + @sql + ' FROM [' + REPLACE(@Schema, '''', '') + '].['
                    + REPLACE(@TableName, '''', '') + ']'

                PRINT @sql

                INSERT  INTO @Results
                        EXECUTE ( @sql
                               )
            END

        FETCH NEXT
  FROM TableCursor
  INTO @Schema, @Catalog, @TableName, @TableType, @ColumnName, @OrdinalPosition, @DataType, @char, @num, @date
    END

-----------------------------------------------------------------------------------
-- Clean-up
-----------------------------------------------------------------------------------
CLOSE TableCursor

DEALLOCATE TableCursor

-----------------------------------------------------------------------------------
-- Display the results
-----------------------------------------------------------------------------------
SELECT  [Schema] ,
        [Catalog] ,
        [Table Name] ,
        CASE [Table Type]
          WHEN 'BASE TABLE' THEN 'TABLE'
          ELSE [Table Type]
        END AS 'Table Type' ,
        [Column Name] ,
        [Seq] ,
        [Data Type] ,
        [Avg Len/Val] ,
        [Min Len/Val] ,
        [Max Len/Val] ,
        [Min Date] ,
        [Max Date] ,
        [Distinct Values] ,
        [Num NULL]
FROM    @Results
ORDER BY [Table Name] ,
        [Seq] ,
        [Column Name]

 
-----------------------------------------------------------------------------------
-- Reset
-----------------------------------------------------------------------------------
SET NOCOUNT OFF
SET ANSI_WARNINGS ON

Additional Remarks

My intended usage for this report is to provide a big-picture view of the data by pasting the results into Excel. This is why I share the Avg/Min/Max columns with length or value based on data type to maximize real estate in the spreadsheet. This is also why I include spaces in the column names to allow word wrap in the Excel column headers.

I used the ISO standard INFORMATION_SCHEMA views as opposed to other system schema tables to facilitate the conversion of the script for use with other databases that support ISO compliant views.

The PRINT statement in the loop displays the individual SELECT statements to aid as a starting point for more detailed column analysis.

I hope this helps out in the seemingly endless quest for clean data!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)