A continuation of the data dictionary discussion with information about how the documentation is done.
Today, I’m going to continue discussing the data dictionary tables we’ve set up at the Boston Public Schools. The SQL for creating the tables and inserting the baseline example records is in my previous data dictionary post. The data inserted uses a few tables in the AdventureWorks
database, and I’ll continue with this example.
At BPS, there were well over a hundred TableInfo
records and several thousand DataInfo
records that needed to be updated. We started with documenting the tables – here’s a SQL script for getting the necessary update
statements:
SELECT ' update TableInfo set tableGroup = '''', description = ''' +
description + ''' where tablename = ''' + tablename + ''''
FROM TableInfo
UPDATE TableInfo SET tableGroup = 'Location',
description = 'Address' WHERE tablename = 'Address'
UPDATE TableInfo SET tableGroup = 'Person',
description = 'Contact' WHERE tablename = 'Contact'
UPDATE TableInfo SET tableGroup = 'Location',
description = 'StateProvince' WHERE tablename = 'StateProvince'
We spent a few days focusing on building this out as any data analysis effort will begin at the table level before the field level. At Boston Public Schools, we divided tables into a number of different groups, including Registration, Assignment, Transportation, Enrollment, Attendance, Assessment, Administrative, and a handful of others. The groupings will help identify what tables to focus on when we’re working on a particular functional area for integration into the new Student Information System. For the description column, we generally provide a 1-2 sentence high-level description of what the table is.
After finalizing the table dictionary, we started focusing on the individual fields in these tables. Here’s how we got the update
statements for DataInfo
along with some example output:
SELECT ' update DataInfo set [description] = ''' + columnname + ''',
tableref = '''', lutyperef = '''',
comments = '''' where tableid = ' +
CAST(ti.tableid AS VARCHAR(5)) + ' and columnname = ''' +
columnname + '''' + ' -- Group: ' + tableGroup + ', Table: ' + tablename
FROM DataInfo di
INNER JOIN TableInfo ti ON ti.tableid = di.tableid
WHERE di.[description] = columnname
AND ti.tablename NOT IN
(
SELECT tablename FROM DataInfo d
INNER JOIN TableInfo t ON d.tableid = d.tableid
WHERE [columnname] <> d.[description]
GROUP BY tablename
HAVING COUNT(*) > 0
)
ORDER BY tableGroup, ti.tablename, columnorder
UPDATE DataInfo SET [description] = 'AddressID', _
tableref = '', lutyperef = '', comments = ''
WHERE tableid = 1 AND columnname = 'AddressID'
UPDATE DataInfo SET [description] = 'AddressLine1', _
tableref = '', lutyperef = '', comments = ''
WHERE tableid = 1 AND columnname = 'AddressLine1'
UPDATE DataInfo SET [description] = 'ContactID', _
tableref = '', lutyperef = '', comments = ''
WHERE tableid = 2 AND columnname = 'ContactID'
UPDATE DataInfo SET [description] = 'NameStyle', _
tableref = '', lutyperef = '', comments = ''
WHERE tableid = 2 AND columnname = 'NameStyle'
Notice that there are four values being updated: description
, tableref
, lutyperef
, and comments
. Here’s some information about each:
Description
is a phrase or sentence describing the field. We left this field unchanged if the field name was self-evident for what the column held. TableRef
is used to specify if the field is referenced elsewhere [typically but not always linked by a foreign key constraint]. For DataInfo
, the TableRef
column would be populated for column tableid
. LuTypeRef
is used to specify if the field is a value we have stored in our primary lookup table. We have hundreds of types, and many fields throughout the database store codes that map to a particular type in this lookup table. For example, in our student language information table, all first language codes map to the ‘Language
’ type in the lookup table. - Comments are used if there is some complicated or subtle information about the field that isn’t self-evident from the field name or the brief description. For example, the sequence number field in our primary enrollment table has a 4-5 sentence comment for what the field is used for and how to pull the active enrollment record based on the sequence number.
Have we finished the documentation process at Boston Public Schools? No, it is still ongoing, although at this point, only lower priority tables remain. So far, it has been a very interesting exercise – I’ve learned a good deal about the data and table groupings that I didn’t know previously.