When you are looking at data, it can be helpful to place it into categories based on using what I call a category tree breakdown. This can give us additional information that may not be apparent without looking at the breakdown.
In this simple example, say we have gathered a simple dataset
on various political candidates, including: First Name, Last Name, Party Affiliation, Gender, State Affiliation, and Job. Maybe, we would want to see which genders were associated with which jobs or what the similarities are across party or state affiliations.
Each of these could be done individually in SQL, but that would be inefficient and wouldn't expose the power of being able to perform a breakdown on the fly.
The Dataset in PHP
$rows = array(
array('fname' => 'Ted', 'lname' => 'Cruz',
'party' => 'Republican', 'gender' => 'male',
'state' => 'Texas', 'job' => 'Senator'),
array('fname' => 'John', 'lname' => 'Kasich',
'party' => 'Republican', 'gender' => 'male',
'state' => 'Ohio', 'job' => 'Governor'),
array('fname' => 'Donald', 'lname' => 'Trump',
'party' => 'Republican', 'gender' => 'male',
'state' => 'New York', 'job' => 'Chairman'),
array('fname' => 'Hillary', 'lname' => 'Clinton',
'party' => 'Democrat', 'gender' => 'female',
'state' => 'New York', 'job' => 'Secretary of State'),
array('fname' => 'Bernie', 'lname' => 'Sanders',
'party' => 'Democrat', 'gender' => 'male',
'state' => 'Vermont', 'job' => 'Senator'),
);
All I need to do to perform a breakdown is to supply the function with the full dataset and specify which keys to look at to start building categories. The order of the breakdown is important and determines the structure of the tree that is generated as a result.
For example, let's say I want to look at party affiliation. Then for each party, I want to know what are the positions held by those candidates. Then for each of those jobs I want to see which states have an affiliation.
That can be done by specifying the breakdown array as follows:
$breakdown_arr = array('party', 'job', 'state');
show_breakdown($breakdown_arr, $rows);
I like to use named keys in my array because that makes it easier to read than integer indexes.
When this breakdown is applied to the dataset, the following result displays:
Republican
Senator
Texas
Governor
Ohio
Chairman
New York
Democrat
Secretary of State
New York
Senator
Vermont
Specify Any Type of Breakdown
The breakdown can be specified in any order we want. Say we want to look again at the party affiliation, but this time by gender and first name:
$breakdown_arr = array('party', 'gender', 'fname');
show_breakdown($breakdown_arr, $rows);
This shows me a completely different take on the data. We see here that while the Republican party is dominated by males, the Democratic party is split evenly by gender.
Republican
male
Ted
John
Donald
Democrat
female
Hillary
male
Bernie
The Algorithm For Performing the Breakdown
function get_array_breakdown($breakdown_indexes, $rows, &$data = null) {
if (count($breakdown_indexes) < 1) {
return array();
}
if (!$data) {
$data = new datum();
}
$metric_index = array_shift($breakdown_indexes);
$to_add = array();
foreach ($rows as $row) {
$val = $row[$metric_index];
$to_add[$val][] = $row;
}
foreach ($to_add as $key => $to_add_val) {
$datum = new datum();
$datum->name = $key;
$datum->data = $to_add_val;
$data->addChild($datum);
get_arrays_for_breakdown($breakdown_indexes, $to_add[$key], $datum);
}
return $data;
}
The way this code works is that it continually slices off a category index to look for in the dataset
. Then, it loops through the passed rows and puts them into groups based on their values for the index. These newly categorized values (which are stored in the $to_add
array) need to be again gone through for the next breakdown (if there is one) and so are passed along to the next iteration so the next breakdown can be checked. The remaining breakdown list, the unchecked values that were just added, and the $data
object containing the result are all passed back into the method as a recursive call.
When there are no more rows to be added and subsequently checked, the recursive call stops being made and the resulting $data
object is returned to the original calling function.
Unfortunately, this call can be expensive when many rows and categories are being parsed. If you have 'n' rows and 'm' columns to review, so the big O(n) notation and where k is the number of distinct entries for each category is something like O(n) = n+k^m.
Try this algorithm to find interesting facts on your own datasets.
Complete Example For Performing Category Tree Breakdowns in PHP
<?php
$rows = array(
array('fname' => 'Ted', 'lname' => 'Cruz',
'party' => 'Republican', 'gender' => 'male',
'state' => 'Texas', 'job' => 'Senator'),
array('fname' => 'John', 'lname' => 'Kasich',
'party' => 'Republican', 'gender' => 'male',
'state' => 'Ohio', 'job' => 'Governor'),
array('fname' => 'Donald', 'lname' => 'Trump',
'party' => 'Republican', 'gender' => 'male',
'state' => 'New York', 'job' => 'Chairman'),
array('fname' => 'Hillary', 'lname' => 'Clinton',
'party' => 'Democrat', 'gender' => 'female',
'state' => 'New York', 'job' => 'Secretary of State'),
array('fname' => 'Bernie', 'lname' => 'Sanders',
'party' => 'Democrat', 'gender' => 'male',
'state' => 'Vermont', 'job' => 'Senator'),
);
$breakdown_indexes = array('party', 'job', 'state');
show_breakdown($breakdown_indexes, $rows);
$breakdown_indexes = array('party', 'gender', 'fname');
show_breakdown($breakdown_indexes, $rows);
function show_breakdown($breakdown_indexes, $rows) {
$new_data = get_array_breakdown($breakdown_indexes, $rows);
echo 'For breakdown in form of: '. json_encode($breakdown_indexes).PHP_EOL;
echo 'Breakdown result is: '.PHP_EOL;
echo $new_data;
echo PHP_EOL;
return $new_data;
}
function get_array_breakdown($breakdown_indexes, $rows, &$data = null) {
if (count($breakdown_indexes) < 1) {
return array();
}
if (!$data) {
$data = new datum();
}
$metric_index = array_shift($breakdown_indexes);
$to_add = array();
foreach ($rows as $row) {
$val = $row[$metric_index];
$to_add[$val][] = $row;
}
foreach ($to_add as $key => $to_add_val) {
$datum = new datum();
$datum->name = $key;
$datum->data = $to_add_val;
$data->addChild($datum);
get_arrays_for_breakdown($breakdown_indexes, $to_add[$key], $datum);
}
return $data;
}
class datum {
public $name = '';
public $data;
public $children = array();
private $depth = 0;
public function addChild($child) {
$child->depth = $this->depth + 1;
$this->children[] = $child;
}
public function __toString() {
$str = $this->toStringHelper();
foreach ($this->children as $child) {
$str .= $child;
}
return $str;
}
private function toStringHelper() {
$str = str_pad(' ', $this->depth * 4);
$str .= $this->name;
$str .= PHP_EOL;
return $str;
}
}