Measuring the Immeasurable
In this rough-and-tumble world of software, have you ever stopped to consider what makes for a memorable dining experience? Yes, I do mean dining experience. What could possibly be the connection between software and cuisine, you ask? Well, let us see.
What is it that makes a meal pleasurable and memorable? Perhaps the chef is a genius, with a retinue of brilliant young culinary disciples, all feverishly striving to create and execute the most ingenious recipes under the most unyielding standards. Suppose the kitchen is a gleaming tribute to the modern age. State-of-the-art appliances along with all the finest cookware, in the right hands, would surely improve your chances of a great meal. Maybe the dining room is special, with enough charm and ambiance to make you want to move into the restaurant. Is there prompt and courteous service? That is always important. The best ingredients are certainly necessary. All of these things and more go into making an evening out at your favorite restaurant one to remember. Moreover, if your expectations as a customer are not met, the pleasure of your experience is that much lessened and you will not be as likely to return.
I believe something like this is also true with software. In the best of worlds, people do not simply use software, they experience it.
We in software equate a sense of quality with the number and severity of the defects we find. The more, and more severe, defects we find and fix, the more confident we are about the quality of our software. This is a reasonable thing to believe, as high defect counts and severities almost certainly will lower our sales. So we measure the number and severity of defects throughout the lifecycles our projects, with the confidence that the lower these two things are, the higher the quality of our software. But this confidence supposes that we understand what our customers want and have completely and thoroughly tested our software against this understanding. This understanding is defined by the requirements we establish. Supposedly, requirements come from a mature, thorough knowledge of our customers' expectations, and even the expectations of our potential customers. In reality, requirements are all too often established in partial or complete isolation from customers, or are in some way incomplete or too shallow or even disregarded as a necessary evil that must be gotten out of the way before the real work can commence.
To continue our culinary analogy, brilliant chefs of the software world all too often produce complex, elegant dishes that taste wonderful, but are served to people who were expecting something a little different. Maybe there are side dishes that they wanted that they did not get, or flavors that ruin an otherwise great dish. Sometimes the chef will add something to a dish that, while very clever, is completely unnecessary and worse, interferes with the rest of the meal. Even if these things were not so true, and more often than not, software requirements were written thoroughly and completely with the participation of many customers and potential customers, and then followed with reverence, there would still be something missing. Just as a great dining experience is about more than a well-written recipe, so too is software about more than well-established requirements. Requirements are important, but they only give us part of the picture. If we want our customers to do more than use our software, if we want them to experience it, and if we understand that for this to happen we need more than a well-written and well-followed recipe, then we must conclude that there is more to software quality than counting defects. Any one of us might be able to discern a great omelet from a bad one, but that knowledge alone will not tell you if the restaurant that produced it will succeed.
Suppose two separate pieces of software are created by two different software teams for the same set of purposes and for the same set of customers. Now imagine that both pieces of software are adequately tested against the same, thoroughly-defined requirements with equal rigor and no defects are found in either. Further, imagine that the customers are free to choose which piece of software they want to use and that both titles are made freely available and equally accessible to all customers. Suppose that, over some time, a majority of customers are found to use one piece of software more than the other. We are not interested in the case where both pieces of software have an exactly equal number of supporters, as we want to produce software that is more desirable than our competitors's. What is wrong with the less popular piece of software?
There can be only two conclusions:
- The two pieces of software are identical, so it is mere chance that one is chosen over the other, or
- all or some part of the differences between the two software titles is what causes the majority of users to choose one over the other.
The two software titles cannot be identical, despite being created from the exact same set of requirements. If they were, it would mean that the requirements from which both were built had been communicated so thoroughly and clearly as to leave the two different software teams no alternative but to construct identical products. This is not possible for software of any real complexity. Interpretation will always effect the outcome, and not always wrongly so. There must be some difference or differences that will tend to win more people over, and these differences in captured requirements are not always represented by defect counts or severities. This conclusion only becomes stronger when we lessen the unrealistic restrictions of our thought experiment — it becomes truer in the real world.
The thought experiment is an extreme example. But it is a simplified view of what every software company faces with their competitors. What is it about a software experience that makes customers choose one title over the other? It is unsettling to think that defect counts and well-followed requirements might not be enough to help us find out. Where do we go from here?
As engineers and technically-minded individuals, through both training and experience, we know that we cannot adequately react to a situation without measured quantities. This is why we cling so tightly to defect counts, and not without good reason. It is a measured value — feedback that rises and falls, guiding us and showing us where to correct our course. If we accept that even with well-established written requirements, adequately addressing the defects found against those requirements will not, alone, assure the success of our software, then what are we to do? By what measurement must we be guided if defects and requirements alone are not enough? To truly understand how to make great software, we must learn how to measure, and take seriously, whole software experiences.
Software experiences, like dining experiences, are very subjective. While it is true that no single piece of software will appeal to everyone, there are many instances where some software is preferred by the overwhelming majority, and for different reasons. It is this phenomenon that is so hard to capture with mere requirements. For the most successful software, most customers have positive experiences, but they do not always have exactly the same experiences. As producers of great software, how are we to measure the experiences our customers have with our software and use those measurements to change our software in positive ways when software experiences are, themselves, so subjective?
In the dining rooms of many restaurants, often even on the tables themselves, customers may find small survey cards which invite them to evaluate their dining experience. These cards are usually filled out by the customer and handed over to a member of the wait staff or placed in some designated receptacle. For large, chain restaurants, dining experience evaluation often takes the form of an invitation to participate in an on-line or telephone survey. Ideally, this dining experience information is carefully scrutinized by restaurant management in an ongoing effort to improve the dining experience for everyone. The restaurant wants to keep its customers and acquire more customers. It is simple business. However, they know that your satisfaction as a customer is about more than just your food, and this understanding is always reflected in the kinds of questions you are asked: the promptness of your services, the price of the meal, the cleanliness of the dining room, etc. In short, your whole experience with the restaurant. For each item under evaluation, you are very often given a set of values to correlate to your level of satisfaction in that area. Do you not think that this information is counted, sorted, correlated, i.e. measured? It is, in fact, an excellent way of measuring large collections of subjective experiences and turning them into actionable items for improvement.
I know it sounds simple, almost simplistic, but why do software companies not do the same? If a carefully and thoroughly constructed survey about experience with your software were established and filled out by a number of your customers, would that not help you on your way to making great software, beyond mere requirements or defects alone?
Before you respond, consider the following. In academia, this same kind of technique has been used and respected for years as a means of turning subjective evaluations into actionable measured quantities and to establish standardized evaluations of student coursework from kindergarten through graduate school. What we have so far identified as a kind of experience survey is actually known within academia as a "rubric." Rubrics are used to evaluate papers, projects and presentations in fair and transparent ways. Students can use rubrics to find out exactly where they need to improve, and to what extent. Teachers use rubrics to make certain that, for assignments requiring subjective evaluation, everyone's work is measured with the same standard and in the same way. It is a widely accepted method of turning subjective evaluations into fair, actionable measurements.
What does a rubric look like? It can take many forms, but the most common consists of a grid, where the rows represent areas of evaluation while the columns represent evaluation scores for each area under evaluation. The individual scores for each area of evaluation can be summed or averaged and even weighted in whatever way the evaluator thinks is important, depending upon how the rubric is established. This same kind of thing could be used in software. In fact, it already has.
The IBM Rational Performance Tester testing team collaborated with their development team to establish a rubric that measures the experiences they and their customers have with IBM Rational Performance Tester.
Figure 1: Example of a rubric
Each area of interest is defined in each row of the far left column. The corresponding experience ratings for each item extend to the right, from most positive to least positive. A numerical value is associated with each rating and is weighted by the importance of its associated item. The values for each item are averaged and a score is calculated.
Figure 2: Close up of a quality element of the rubric
The Rational Performance Tester testing team produced a Web tool that aids in the collection of this subjective data. Any tester, developer or visiting customer can access the Web page and fill out a rubric according to their experience with the product at that time. Information about the date, time and nature of the experience can also be recorded within the rubric, as well as any comments the user might have for each item. As a rule, each tester fills out a rubric at the completion of a test scenario, in the context of their most recent experience with the product.
Figure 3: Comparison graph generated from rubric data
At any time, the collection of existing, filled-out rubrics can be summarized and constitutes an overall measurement of the experience the testing team, and others, have had with the product within a certain time frame. Any single filled-out rubric is not as valuable as the larger collection of them taken together. The data can be presented in such a way as to show areas of the product experience that require improvement or are exceeding expectations. In this way, entire collections of software experiences, as well as requirements and defect counts, can be used to improve the overall quality of the product.
When testers experience issues with the product that are not quite explicit violations of written requirements or are in other ways difficult to articulate in simple defect reports, but nonetheless produce negative experiences with the product, the rubric evaluations become ideal ways to capture that otherwise vague and subjective information in a way that might be more useful.
There is an added benefit to creating a rubric that goes beyond capturing subjective data. The process the IBM Rational Performance Tester team underwent to produce each version of their rubric was a collaborative one. Testers and developers came together to discuss, not just specific requirements, but full expectations about customer experience. There developed a much clearer and fuller understanding of what both developers and testers wanted and expected for the customer experience; not just the technical details of particular features, but real experiences. This connection and mutual understanding, by itself, was worth the effort of producing the rubric, even if the finished version were never filled out by a single person.
I encourage you to work with your team to produce a rubric. If you can, get input from a wide range of stakeholders, especially customers. Using your written requirements is a good place to start, but try to think beyond this, to the whole customer experience. So start using rubrics and find out what your customers are hungry for — learn more about the whole experience and don't just concentrate on the recipe. Use rubrics to measure the immeasurable and don't just taste the food you're cooking, take a seat in the dining room and experience it.
Further Reading
Though the vast majority of books, papers and articles I have found deal with the use of rubrics in only an academic setting, the following may provide interesting reading on the topic of rubrics and rubric-based evaluations.
- Goodrich Andrade, Heidi. Understanding Rubrics [http://learnweb.harvard.edu/ALPS/thinking/docs/rubricar.htm].
- Moskal, Barbara M. Scoring Rubrics Part I: What and When. [http://www.ericdigests.org/2001-2/scoring.html]
- Wikipedia article on academic rubrics [http://en.wikipedia.org/wiki/Rubric_%28academic%29].