Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / All-Topics

Break Down Data Into a Tree of Categories - Easy Category Tree Breakdown

5.00/5 (1 vote)
14 Apr 2016CPOL3 min read 4.5K  
Break Down Data Into a Tree of Categories - Easy Category Tree Breakdown

When you are looking at data, it can be helpful to place it into categories based on using what I call a category tree breakdown. This can give us additional information that may not be apparent without looking at the breakdown.

In this simple example, say we have gathered a simple dataset on various political candidates, including: First Name, Last Name, Party Affiliation, Gender, State Affiliation, and Job. Maybe, we would want to see which genders were associated with which jobs or what the similarities are across party or state affiliations.

Each of these could be done individually in SQL, but that would be inefficient and wouldn't expose the power of being able to perform a breakdown on the fly.

The Dataset in PHP

PHP
$rows = array(
    array('fname' => 'Ted', 'lname' => 'Cruz', 
    'party' => 'Republican', 'gender' => 'male', 
    'state' => 'Texas', 'job' => 'Senator'),
    array('fname' => 'John', 'lname' => 'Kasich', 
    'party' => 'Republican', 'gender' => 'male', 
    'state' => 'Ohio', 'job' => 'Governor'),
    array('fname' => 'Donald', 'lname' => 'Trump', 
    'party' => 'Republican', 'gender' => 'male', 
    'state' => 'New York', 'job' => 'Chairman'),
    array('fname' => 'Hillary', 'lname' => 'Clinton', 
    'party' => 'Democrat', 'gender' => 'female', 
    'state' => 'New York', 'job' => 'Secretary of State'),
    array('fname' => 'Bernie', 'lname' => 'Sanders', 
    'party' => 'Democrat', 'gender' => 'male', 
    'state' => 'Vermont', 'job' => 'Senator'),
);

All I need to do to perform a breakdown is to supply the function with the full dataset and specify which keys to look at to start building categories. The order of the breakdown is important and determines the structure of the tree that is generated as a result.

For example, let's say I want to look at party affiliation. Then for each party, I want to know what are the positions held by those candidates. Then for each of those jobs I want to see which states have an affiliation.

That can be done by specifying the breakdown array as follows:

PHP
$breakdown_arr = array('party', 'job', 'state');
show_breakdown($breakdown_arr, $rows);

I like to use named keys in my array because that makes it easier to read than integer indexes.

When this breakdown is applied to the dataset, the following result displays:

Republican
    Senator
        Texas
    Governor
        Ohio
    Chairman
        New York
Democrat
    Secretary of State
        New York
    Senator
        Vermont

Specify Any Type of Breakdown

The breakdown can be specified in any order we want. Say we want to look again at the party affiliation, but this time by gender and first name:

PHP
$breakdown_arr = array('party', 'gender', 'fname');
show_breakdown($breakdown_arr, $rows);

This shows me a completely different take on the data. We see here that while the Republican party is dominated by males, the Democratic party is split evenly by gender.

Republican
    male
        Ted
        John
        Donald
Democrat
    female
        Hillary
    male
        Bernie

The Algorithm For Performing the Breakdown

JavaScript
function get_array_breakdown($breakdown_indexes, $rows, &$data = null) {
    if (count($breakdown_indexes) < 1) {
        return array();
    }
    if (!$data) {
        $data = new datum();
    }
    $metric_index = array_shift($breakdown_indexes);
    $to_add = array();
    foreach ($rows as $row) {
        $val = $row[$metric_index];
        $to_add[$val][] = $row;
    }
    //Now loop through the children to continue the breakdown
    foreach ($to_add as $key => $to_add_val) {
        $datum = new datum();
        $datum->name = $key;
        $datum->data = $to_add_val;
        $data->addChild($datum);
        get_arrays_for_breakdown($breakdown_indexes, $to_add[$key], $datum);
    }
    return $data;
}

The way this code works is that it continually slices off a category index to look for in the dataset. Then, it loops through the passed rows and puts them into groups based on their values for the index. These newly categorized values (which are stored in the $to_add array) need to be again gone through for the next breakdown (if there is one) and so are passed along to the next iteration so the next breakdown can be checked. The remaining breakdown list, the unchecked values that were just added, and the $data object containing the result are all passed back into the method as a recursive call.

When there are no more rows to be added and subsequently checked, the recursive call stops being made and the resulting $data object is returned to the original calling function.

Unfortunately, this call can be expensive when many rows and categories are being parsed. If you have 'n' rows and 'm' columns to review, so the big O(n) notation and where k is the number of distinct entries for each category is something like O(n) = n+k^m.

Try this algorithm to find interesting facts on your own datasets.

Complete Example For Performing Category Tree Breakdowns in PHP

PHP
<?php
//first name, last name, position
$rows = array(
    array('fname' => 'Ted', 'lname' => 'Cruz', 
    'party' => 'Republican', 'gender' => 'male', 
    'state' => 'Texas', 'job' => 'Senator'),
    array('fname' => 'John', 'lname' => 'Kasich', 
    'party' => 'Republican', 'gender' => 'male', 
    'state' => 'Ohio', 'job' => 'Governor'),
    array('fname' => 'Donald', 'lname' => 'Trump', 
    'party' => 'Republican', 'gender' => 'male', 
    'state' => 'New York', 'job' => 'Chairman'),
    array('fname' => 'Hillary', 'lname' => 'Clinton', 
    'party' => 'Democrat', 'gender' => 'female', 
    'state' => 'New York', 'job' => 'Secretary of State'),
    array('fname' => 'Bernie', 'lname' => 'Sanders', 
    'party' => 'Democrat', 'gender' => 'male', 
    'state' => 'Vermont', 'job' => 'Senator'),
);

$breakdown_indexes = array('party', 'job', 'state');
show_breakdown($breakdown_indexes, $rows);
/*
Output:
    For breakdown in form of: ["party","job","state"]
    Breakdown result is:

        Republican
            Senator
                Texas
            Governor
                Ohio
            Chairman
                New York
        Democrat
            Secretary of State
                New York
            Senator
                Vermont
*/
$breakdown_indexes = array('party', 'gender', 'fname');
show_breakdown($breakdown_indexes, $rows);
/*
Output:
    For breakdown in form of: ["party","gender","fname"]
    Breakdown result is:

        Republican
            male
                Ted
                John
                Donald
        Democrat
            female
                Hillary
            male
                Bernie
*/

function show_breakdown($breakdown_indexes, $rows) {
    $new_data = get_array_breakdown($breakdown_indexes, $rows);
    echo 'For breakdown in form of: '. json_encode($breakdown_indexes).PHP_EOL;
    echo 'Breakdown result is: '.PHP_EOL;
    echo $new_data;
    echo PHP_EOL;

    return $new_data;
}

function get_array_breakdown($breakdown_indexes, $rows, &$data = null) {
    if (count($breakdown_indexes) < 1) {
        return array();
    }
    if (!$data) {
        $data = new datum();
    }
    $metric_index = array_shift($breakdown_indexes);
    $to_add = array();
    foreach ($rows as $row) {
        $val = $row[$metric_index];
        $to_add[$val][] = $row;
    }
    //Now loop through the children to continue the breakdown
    foreach ($to_add as $key => $to_add_val) {
        $datum = new datum();
        $datum->name = $key;
        $datum->data = $to_add_val;
        $data->addChild($datum);
        get_arrays_for_breakdown($breakdown_indexes, $to_add[$key], $datum);
    }

    return $data;
}

class datum {
    public $name = '';
    public $data;
    public $children = array();

    private $depth = 0;

    public function addChild($child) {
        $child->depth = $this->depth + 1;
        $this->children[] = $child;
    }

    public function __toString() {
        $str = $this->toStringHelper();
        foreach ($this->children as $child) {
            $str .= $child;
        }

        return $str;
    }

    private function toStringHelper() {
        $str = str_pad(' ', $this->depth * 4);
        $str .= $this->name;
        $str .= PHP_EOL;

        return $str;
    }
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)