Introduction
This article gives a quick introduction to the concept of Data Quality, and presents a solution to help you integrate Data Quality into your solutions in a (reasonably) painfree manner. Your users and customers will thank you - read on!
Background
Most developers who haven't their head stuck in the sand, will have at least some passing experience of code testing. Its one of those things that you can struggle with for a while, but once you get into it, and realise the benefits, you wonder how you did without it.
Its also the case that many times you ignore code testing because of a big ole list of excuses:
- its too hard
- its a waste of time
- my manager wont agree to it
- the client thinks its a waste of time
- I think its a waste of time
- my manager things its a waste of time
- I don't understand it
- I don't have time to learn it (too busy watching youtube, ahem!)
- etc... you know the way it goes ... don't say you don't :)
When you don't code test, and a project starts to get big, things can run amok very quickly and get out of control. Before you know where you are, one change to code over here, has an unintended impact over there, and you have a big ole grumpy legacy application thats only a few months old that seems to keep getting problem after problem.
Understandably, the customer is annoyed and shouts... 'X was working last week, then you changed/fixed some other thing, and now X is broken...', and you end up being stressed and unhappy working on this monster 'legacy' app (that in reality, is only a youngster in the grand scheme of things.
It's a mess, you know it... but it's still too hard, and a waste of time, and your manager really wont agree to it,
let alone the customer... <sigh, cry, oh dear, here comes another day...>
Just like code testing, Data Quality is one of things that we generally don't pay attention to until it comes and bites us, and when it does, its usually a customer that notices it and as always, we poor beleaguered developers get to pay the price. I'm starting into a Data Quality project, so I thought it might be good to have a talk about what it is, and how we can put some simple checks and balances in place to help us manage our data, and improve its quality.
Don't change data
We can approach Data Quality on a number of levels. First of all, we can put some simple, low hanging things in place that will help us maintain data integrity, and thus data quality, should things go wrong. The first golden rule, is simply don't change data. By this I mean once data is committed, always keep a solid record of the state of the data before you made any change. This can be on a macro level like data backups (before big changes), or, it can be a data audit trail system that treats data as immutable, storing the original and making a fresh copy of the actual 'working data' each time a change is detected. Some databases have inbuilt facilities for this (MS SQLs new Temporal table feature is extremely powerful in this regard) - if you don't have that option, you can always implement your own table/record audit system (my article on the subject of making a simple audit trail in a .net application may be useful).
The great benefit of using an auditing/log system, especially in a large system, is that you at least have a means where you can work backwards from an event, and find out what it actually was, that caused *that* piece of data to change (you know, the one the customer spotted and is simply _killing_ you over...). The things is, a data backup is all well and good. Sure, you can reset the state of the database back to a previous point in time, but what else changed? what else are you actually resetting with your restore? and if you only restore the particular bit that went foo-bar, are you also causing some other unintended effect? .... if you've been in this situation you know that unfortunately your questions seem to raise even more questions, and we still have an unhappy customer, and that *_makes_us_sad_*
Once you have a methodology that acts as a solid foothold in case something goes wrong, you can then get a little bit more exotic (swish!) and start looking at implementing a rules engine that is based around the core rules of Data Quality.
Set out rules
A lot of clever folk have addressed the issue of data quality over the years, a quick search for data quality in Google Scholar brings up over 6.3 million academic references alone, the top few of which have a few thousand citations (a pretty good indicator of academic quality as it were!). One of the themes you will come across when you start looking into the field of data quality, is whats referred to as the 'Core rules of Data Quality'. The rules can be looked at from a number of different perspectives. From our point of view for this article, we will be looking at them from a validation point of view when we either bring data into our system, or when we are evaluating it prior to use. Depending on our particular needs, we can evaluate the quality of the data, and either decide its simply go/no-go, good/bad data, or, we might decide to implement a traffic light kind of system where things pass, have warnings, or fail the quality check.
Categories of rules
Data Quality rules fall into four main categories:
- Business entity rules
- Data elements
- Dependancies
- Validity
As developers, we are pretty used to business entity rules. These relate to the fact that we expect entities to have some aspect of uniqueness to them (name, customer_id, etc), and describes the relationship between entities. For example, an entity may be the parent company of another and thus there is a relationship between them.
Data elements are generally supporting players to our data itself. The two main things here are the inheritance relationship between elements, (the data version of object animal, animal with four legs Vs two legs...), and the collections or sets of allowable values. For the latter, think of buying something online and having to choose the shipping country from a restricted list.
Dependancies are as you would expect, rules that govern the data. In other words, depending on the state or value of the data, a particular rule may apply. It would be reasonable to say a customer should not be allowed to overdraw his checking account unless the data showed an agreement in place to allow this.
Through coding we are also used to testing user data input in data-entry forms for validity, so we're aware of this one. We check for the data being complete, correct, accurate, etc.
Those are all useful, and mostly, depend on how the particular business you are developing for operates. What we are now going to look at are what are called the basic quality checks. To bring your system to the next level, and make it really robust, you could consider building these kinds of checks into your system whenever data is changed or ingested. While you can get very detailed and domain specific with the following, in general, its possible to be quite generic about data at this level and combine these rules and checks to dramatically improve the quality of your data. The bottom line is we are seeking to ensure our data is in a clean state before allowing it to proceed into production or analysis.
Top Basic Checks for Data Quality / Data Cleansing
Value distribution
One of the first things we do is check if our data values are distributed in what given the domain, are a reasonable manner. Distributions refer to where our data values lie within a range. If we were considering say ATM card withdrawals, we might expect that 70% of withdrawals would be in the range $10-100, 20% in the range of $100-250, and 10% greater than this (thats the time you bought your mammy a birthday present!). In this check, you put in either expected ranges beforehand, or, you specify a sliding window of value distribution. By this I mean that you expect your data to cluster into a certain number of ranges, each having a particular relationship to the other.
Unexpected values
Here we are trying to identify data that is not as expected. To do this, we need to have fore knowledge of what we expect and build this into our observations of the data. Consider for example that we know we are going to get some unexpected data, but we only allow a particular threshold of it. In this case we could say that we are happy to allow a 3% discrepancy from what we expect, but anything further raises an alert. Looking for both clumps of data (eg: 14% of the data is in a particular range), and gaps in data (distribution *not* what was expected) are good flags to work on.
Its impossible!
There are rules you can put in place that will save a lot of 'egg on face' moments, and deal with data that for the system in question, should not be possible. These might include a persons date of birth greater than today, or a discount value of greater than 100%.
Type check
This one is easy - here we monitor for data types expected versus what we are presented with. If we are expecting a numeric value and instead the data is text, or a date.
The mis-used, mis-informed and mis-understood
Data can come to us in strange ways. I was once told 'never underestimate the ability of a user to (ab)use a system to suit their particular needs'. I have seen for example a common mis-use of a particular field, the 'customer name', where things like 'no credit, owes money' was input beside the customers name ... (and yes, that one did get them into trouble when someone sent out a mass mail shot!). Here you need to check for things out of the ordinary, in the context of the meaning of the data. Things to look at would be what is 'NA' 'n/a', 'TBC', does '%' or a semi-colon ';' or an asterix '*' mean something and should they be present (or not).
Out of scope/range
There are many circumstances where we expect data to be in a certain range. Temperature in celsius perhaps from -50 (brrr!) to +50 (OMG.. wheres the air-con!), number of office floors in building, from 1..163 (yes, really ... check it out). If the value is outside this scope, it fails the check.
Outlier values
Very similar to out of scope, outliers can be within scope, but completely out of the median or standard deviation for the range being tested. The can also of course simply be completely outside the range/scope.
Gone missing
Many times we expect certain values to be present in a dataset, and they are actually conspicuous by their absence. This type of check can be useful to tell us what *didn't* happen when we were expecting some values to follow a trend for example. It may also be the case in data of mixed type, that one part of the mix is missing. Lets say we were expecting a currency symbol and and decimal number to indicate a unit price... we expected this '$10', but got this instead '10'.
Invalid logic
This one is generally domain specific, but can be common sense as well. For example, in a survey we generally wouldn't expect to get a response rate over 100%, and seeing a duplicate customer record (its amazing how these things seem to slip in!) would also raise a flag.
Spell check
Really! ... its amazing the kind of stuff that can slip through. While its good to do a standard spell check, I am always interested in doing one against things that are specific to the domain, and would cause problems for data integration and mapping at a later stage.
Automating data quality
So, now we have a bit of an understanding of the general concepts (and, as always, its only the tip of the veritable iceberg!) ... lets look at how we might approach automating the monitoring of quality.
There are two things to consider when thinking about automation of data quality monitoring. The first is the identification of the issue, and the second is the action to take on problems identified.
Identifying the issue in an automagic way...
One of the things we don't really want to do is to put together a solution that requires a lot of maintenance. Its important as always to make things as robust and future proof as possible (but at the same time, not going overboard!). One way of approaching the problem is to put together a series of IF/THEN statements. Used in isolation, this can get messy quickly, especially when rules start to get added and specific edge cases start to emerge.
IF rule says 'value must be string'<br /> THEN TryParse for string...<br /> On Fail .. Do X
A tidier approach is to use expression trees. The actual approach you take of course depends on your own particular requirements. Here is one method of putting together a basic rule engine (based on code from stack).
An expression tree allows runtime evaluation of an expression. This means we don't need to know what we are going to ask before we compile our code (useful). Instead, we can allow a user for example to generate a set of rules, store them in a database table, and then run them against some data when required. The concept is simple. We use the expression tree to create a function at runtime, which we then run against our input data, giving us a 'go/no-go'. This means we can hand off the decision of 'what to ask' to a user, and it leaves us to potter along and do other coding type tasks and not have to worry about constant tweaks to QA code.
Lets look at an example of how this works with some basic evaluations:
First, we set up a simple user class:
public class User
{
public int Age { get; set; }
public string UserName { get; set; }
public string SkillLevel { get; set; }
}
Add some sample data:
var user1 = new User
{
Age = 13,
UserName = "Jacques",
SkillLevel = "1"
}
;
var user2 = new User
{
Age = 33,
UserName = "Fred",
SkillLevel = "2"
}
;
var user3 = new User
{
Age = 65,
UserName = "Prakesh",
SkillLevel = "1"
};
var user4 = new User
{
Age = 34,
UserName = "Mary",
SkillLevel = "ONE"
};
Now we will create a class to store our quality RULES
public class Rule
{
public string RuleName { get; set; }
public string FieldMatch { get; set; }
public string Operator { get; set; }
public string Value { get; set; }
}
As you can see above, we can give the rule a name (eg: 'Age must be > 15'), a FieldMatch (this is to map to the field/member-name of the record/object we are evaluating, for example 'Age'), an Operator (greater than, equal to...), and a value ('15'). Here are some examples of rules:
Rule name | Field name match | Operator | Value |
Legal age to drive | Age | Greater than | 15 |
Qualification to fly | Profession | EqualTo | Pilot |
Minimum bank balance | Balance | GreaterThan | 100 |
In order to create our rules, we need to have some way of telling the compiler what our operators are, and how they are to be used. In this example, the way it's done is to use existing operators from Linq Expression Type Enumerations. We're going to keep this example simple and create two rules.
Rule 1 - Age Limit.
In this rule we are using the 'GreaterThan' enum to say there is an age limit of 20 for this QA test.
var rule1 = new Rule();
rule1.RuleName = "Age limit";
rule1.FieldMatch = "Age";
rule1.Operator = ExpressionType.GreaterThan.ToString();
rule1.Value = "20";
Rule 2 - Test for name value.
Here we are only allowing through people with the name of "Prakesh" (lucky chap!).
var rule2 = new Rule();
rule2.RuleName = "Name match";
rule2.FieldMatch = "UserName";
rule2.Operator = ExpressionType.Equal.ToString();
rule2.Value = "Prakesh";
Before we move on to show how this thing works, lets look at the implementation of a rule. What we do this this... first, we send our rule in to be 'compiled' - this gives us back of all things (brace yourself), a function() ... yep! ... its like we are wrangling with Javascript or F#! Once we have the compiled function back, we can then use it by sending in data to be evaluated against that new compiled rule.
See here - we send in 'rule1' (our "> age" rule), and it gives us back a function "compiledRule_Age" that contains that rule ready to evaluate against data. We do this for 'user1' and 'user2'
Func<user, bool=""> compiledRule_Age = CompileRule<user>(rule1);
var ageResult1 = compiledRule_Age(user1);
var ageResult2 = compiledRule_Age(user2);
</user></user,>
The rule says 'pass, where Field "Age" has a value GreaterThan "20'.
As user1.age = 13, the rule fails.
As user2.age = 33, the rule passes.
It makes sense. Lets look at how 'CompileRule' actually works.
public static Func<T, bool> CompileRule<T>(Rule r)
{
var paramUser = Expression.Parameter(typeof(User));
Expression expr = BuildExpr<T>(r, paramUser);
return Expression.Lambda<Func<T, bool>>(expr, paramUser).Compile();
}
Whats happening is the method takes in a rule, and builds a lambda function that maps 'User' -> 'Bool', and compiles it. The expression itself is bulit using a seperate function that uses reflection:
static Expression BuildExpr<T>(Rule rule, ParameterExpression param)
{
var left = MemberExpression.Property(param, rule.FieldMatch);
var tProp = typeof(T).GetProperty(rule.FieldMatch).PropertyType;
ExpressionType tBinary;
ExpressionType.TryParse(rule.Operator, out tBinary);
var right = Expression.Constant(Convert.ChangeType(rule.Value, tProp));
return Expression.MakeBinary(tBinary, left, right);
}
There are three parts to the expression we are building: left eval-operator right
- We define the left side of the equation/comparison as being a param that maps on the rule 'FieldMatch' (eg: 'Age').
- The operator to use to evaluate the equation/rule is extracted from rule.Operator, which is parsed to an ExpressionType enum
- Finally the right side is the value we are considering in the equation (which we convert to match the property type of the rule.FieldMatch we on the left side.
Now that everything is lined up, we can compare apples for apples, and the method returns an Expression, which can be compiled and used.
This is a very basic introduction - I have attached an example project you can download to check it out running.
For further information on building a rules engine for your own particular requirements, have a look at these fine examples:
Building a rule engine in C# (a biggie/series, but well worth a read)
Building a MicroRuleEngine using LinqExpressions (tidy, does the job)
How to implement a rule engine in C# (make sure you read the comments, good info in there)
Taking auto-action
Once we have identified problems in our data, we either reject the data or attempt to fix it. If we decide to fix the data, we may have some hard-coded solutions, here are some examples:
Problem | Solution |
'N/A' found in post-code | Replace with String.Empty |
'O' (uppercase letter 'o') found where should be numeric zero | Replace with number zero |
'\n' line feed found in line | Replace with null |
That will work well. Theres a lot of things we can think of upfront (like the above) we can code in. There will however be other things that only get noticed as data starts to flow over time. For these situations I have found an auto-mapping/replace solution can work well. This is where you identify a problem, and refer the problem to a user for a resolution. If the user determines they can create a rule for the problem, then they can create a mapping from the rule to a resolution. On a simple level this might be:
Problem | Solution |
Find ALAN as surname | Replace with ALLAN |
'\n' line feed found in line | Replace with <br/> |
A very useful tool you can fall back on is Regular Expressions. They are an incredibly flexible way to both detect, and fix problems with data. If you aren't comfortable with RegEx, well, heres a good excuse to dig into them a bit more!
Some useful resources for learning RegEx:
Regex Storm - a .net focused online playground/tool for testing and learning RegEx patterns
RegexR - comprehensive learning playground - quite visual
Wrap-up
Hopefully we have learned a bit about Data Quality,. and how putting some checks and balances in your solution can help improve not only your data, but also your own stress levels :)
History
Version 1 - 21 Dec 2016