Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / database

Cross-Tabulation (Pivot Tables) with MySQL

0.00/5 (No votes)
10 Apr 2012CPOL4 min read 66.3K  
Example of using code-based query generation, transposition, cross-tabulation, and normalized data to produce a query result that can feed a simple chart from data which was not originally designed with reporting or charting in mind.

Overview

So in my years being trained on data architecture and database design, I learned to “normalize, normalize, normalize”. Unfortunately, normalization doesn’t lend itself to very simple statistical calculations–such as those used to generate graphs or a statistical scorecard.

My Problem

In the case of Angel, I’ve normalized the hell out of the data. Now it comes down to needing to report on that data. At first, my reporting attempts were terrible–tons of nested loops, many queries, etc. Then after I spoke to a colleague of mine at work, he said "If I was running on an SAS Mainframe, the work would already be done for me using a function called “PROC TABULATE”." He sort of dated himself by saying that, but I won't hold it against him. :)

Then I broke the news to him: I’m not using SAS, or any other mainframe technology. He replied by suggesting I look into some sort of MySQL equivalent to that function. So I started hunting and eventually came across this very nice article which describes how to create a nice “statistical” report using cross-tabulation techniques with normalized source data. The queries are sort of monstrous if you have a large number of fields on which to report, but at that point you’d be using code to generate the parameters for the SELECT statement.

The Data

My source data looks like the following--please note that some results are truncated for brevity:

mysql> select uid, Email from players;
+-----+--------------------------+
| uid | Email                    |
+-----+--------------------------+
|  10 | brian@venomgamelabs.com  |
|   4 | cale@caledunlap.com      |
|   1 | cale@venomgamelabs.com   |
|   9 | daniel@venomgamelabs.com |
|  11 | dave@venomgamelabs.com   |
|   5 | jane@venomgamelabs.com   |
|   8 | janes@venomgamelabs.com  |
|  12 | jill@venomgamelabs.com   |
|   6 | john@venomgamelabs.com   |
|   7 | johns@venomgamelabs.com  |
+-----+--------------------------+
mysql> select * from gamedata LIMIT 10;
+----------+---------+------+
| playerid | fieldid | data |
+----------+---------+------+
|        1 |       3 | 60   |
|        1 |       4 | 1    |
|        1 |       5 | 1    |
|        1 |       6 | 286  |
|        1 |       7 | 1    |
|        1 |       8 | 13   |
|        1 |       9 | 1    |
|        1 |      10 | 1    |
|        1 |      11 | 261  |
|        1 |      12 | 13   |
+----------+---------+------+

mysql> select FieldID, FriendlyName, GameID, FieldTypeID FROM gamefields;
+---------+---------------------------------+--------+-------------+
| FieldID | FriendlyName                    | GameID | FieldTypeID |
+---------+---------------------------------+--------+-------------+
|       1 | Play Time                       |      1 |           2 |
|       2 | Last Character                  |      1 |           1 |
|       3 | Total Eggs                      |      4 |           2 |
|       4 | Total Playtime                  |      4 |           2 |
|       5 | Total King Of The Hill Captures |      4 |           2 |
|       6 | Total Kills                     |      4 |           2 |
|       7 | Total Deaths                    |      4 |           2 |
|       8 | Engineer Playtime               |      4 |           2 |
|      10 | Mongo Playtime                  |      4 |           2 |
|      11 | Grunt Playtime                  |      4 |           2 |
|      12 | Gladiator Playtime              |      4 |           2 |
|      13 | Total Powerup Uses              |      4 |           2 |
|      14 | Total Powerups Received         |      4 |           2 |
+---------+---------------------------------+--------+-------------+
mysql> select typeid, typename from field_types;
+--------+------------------+
| typeid | typename         |
+--------+------------------+
|      1 | Text             |
|      2 | Numeric          |
|      3 | Game Coordinates |
|      4 | Boolean          |
|      5 | Physical Address |
|      6 | IP Address       |
+--------+------------------+

Desired Result & My Solution

What I want to know from that data is how many players have field values within specific ranges–to generate some input data for a chart output. The code I’m using does the following math (from a separate query):

  1. Find the minimum value out of all of the player data for each field (SQL Query)
  2. Find the maximum value out of all of the player data for each field (Same SQL query)
  3. Get a “standard deviation” between the number $StandardDeviation = floor( ($High - $Low)/NUM_DEVIATION_RANGES );
  4. Begins a SQL query and assembles a list of parameters to select (based on the deviation ranges)
  5. Runs the query, obtains a single result per field and feeds it to the chart to render a per-field graphical analysis chart.

Query String Generation

I used PHP to generate the query described above in step 4, but here is some pseudo code (though similar to PHP) which can be used to model a function or subroutine that generates the query:

 for $i = 0 to NUM_DEVIATION_RANGES
	$Min = $Low + ($StandardDeviation * $i)
	$Max = ($Min + $StandardDeviation) – 1;
	$Query .= "SUM( IF( CAST(gamedata.Data AS UNSIGNED) >= $Min _
              AND CAST(gamedata.Data AS UNSIGNED) <= $Max, 1, 0) ) AS ‘$Min-$Max’,";
end for 

Keep in mind that the code above is only used to generate the SELECT clause of the query string and not the whole thing.

Generated Query String

The result of executing the code above will produce a query string like the one below. This query string is then passed to the database in order to fetch the cross-tabulated result.

SQL
SELECT
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 60 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 87,1,0) ) 
        AS '60-87',
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 88 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 115,1,0) ) 
        AS '88-115',
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 116 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 143,1,0) ) 
        AS '116-143',
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 144 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 171,1,0) ) 
        AS '144-171',
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 172 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 199,1,0) ) 
        AS '172-199',
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 200 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 227,1,0) ) 
        AS '200-227',
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 228 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 255,1,0) ) 
        AS '228-255',
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 256 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 283,1,0) ) 
        AS '256-283',
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 284 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 311,1,0) ) 
        AS '284-311',
    SUM( IF( 
        CAST(gamedata.Data AS UNSIGNED) >= 312 AND 
        CAST(gamedata.Data AS UNSIGNED) <= 339,1,0) ) 
        AS '312-339'
FROM gamedata, gamefields, field_types
WHERE gamedata.fieldid = gamefields.FieldID
    AND gamefields.FieldTypeID = field_types.typeid
    AND field_types.typename = 'Numeric'
    AND gamefields.FriendlyName = 'Total Eggs' 

Final Result

Now that query runs once per number of fields requested. The above query was just a single query for a single field. Note how the code generated the SELECT statement parameters (min and max) and set them as the column header in the result?

The result of that query looks like this:

+-------+--------+---------+---------+---------+---------+---------+---------+---------+---------+
| 60-87 | 88-115 | 116-143 | 144-171 | 172-199 | 200-227 | 228-255 | 256-283 | 284-311 | 312-339 |
+-------+--------+---------+---------+---------+---------+---------+---------+---------+---------+
|     2	|      0 |	 2 |	   0 |	     1 |      0	 |	2  |	   0 |	     1 |       0 |
+-------+--------+---------+---------+---------+---------+---------+---------+---------+---------+

So the output is exactly what I need for my chart. The result fields represent the X axis of the chart, and the result values represent the Y axis of the chart, all from a single query against normalized data using cross-tabulation.

Conclusion

I urge you to click on the link I mentioned near the top of the article if you want to learn how to do this. My example is fairly specific to my situation. That article showed me exactly what I needed to do, but I needed to apply it to my data and my reports. You’ll notice that I’ve used CAST’s in my query. The reason being that in order to accept any type of data, I’ve used the ‘TEXT’ type on my game data fields–that’s why I also have a type table and a type ID in the field types. I then query what types of fields are numeric, and only perform my calculations against them.

Doing cross-tabluation/pivot tables isn’t too bad once you know what you’re doing. I had a tough time wrapping my head around it, but after a few hours of tinkering, I finally got what I was looking for. Another thing to note is that I’m summing up 1s and 0s, basically a tally. If you want to sum up your real data, don’t use 1s and 0s; use your actual data field.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)