WordCloud - A Squarified Treemap of Word Frequency

Alex D. Mawhinney

4.85/5 (13 votes)

10 Aug 20074 min read

2.2K

A squarified treemap of word frequency

Introduction

WordCloud is a visual depiction of how many times a word is used, or its frequency if you will, within a given set of words. It does this by: reading in plain text, filtering out "stop words", counting how many times a word is used, and displaying results in a Squarified Treemap. (In the images above, the larger a node and more saturated the color, the more frequent its use.)

Background

I was really impressed, and inspired, by Chirag Mehta's cool web-based tag cloud generator of US Presidential Speeches. So I took a shot at doing a simplified version using .NET.

At best, I'm a hobbyist with the technologies used in this example, so I'm defaulting to various articles I read that lead to creating WordCloud.

The Squarified Treemap

Display is handled by Microsoft's TreemapGenerator, part of the Data Visualization Components suite. While a true treemap utilizes both hierarchical and proportional attributes, WordCloud only uses proportional attributes to show word count.
Wikipedia's Treemapping overview is a good place to start for understanding the origins.
Jonathan Hodgson's Squarified Treemaps Code Project article is an excellent in-depth look at this subject.
WordCloud performs the same basic function as a tag cloud.
Newsmap - a very impressive Flashed-based Squarified Treemap of Google News.
Internet tagging site del.icio.us most popular treemap.

Stemming

WordCloud uses the Porter stemming algorithm to remove (or reduce) words with common origins.

Stop words

Stop words are used to filter out common words before processing.

The Code

To build WordCloud, you'll need to grab the latest version of Microsoft's Data Visualization Components, and update WordCloud's project references to include TreemapGenerator. You'll find this reference in \VisualizationComponents\Treemap\Latest\Components\TreemapGenerator.dll. NOTE: WordCloud needs .NET Framework 2.0 or greater to build and run.

TreemapPanel.cs

TreemapPanel handles node rendering. Nodes are preprocessed into an ArrayList collection and then added to the TreemapGenerator. Object data is stored within each node in the form of NodeInfo.

// Treemap drawing engine in TreemapPanel.cs
protected TreemapGenerator m_oTreemapGenerator;

...

public void PopulateTreeMap(Hashtable wordsHash, Hashtable stemmedWordsHash)
{
    AssertValid();

    ArrayList nodes = new ArrayList();
    ArrayList aKeys = new ArrayList(stemmedWordsHash.Keys);
    aKeys.Sort();

    foreach (string key in aKeys)
    {
        //build each node element
        int count = (int)stemmedWordsHash[key];
        string name = (string)wordsHash[key];

        //show count in node?
        if(m_bShowWordCount)
            name += String.Format(" ({0})", count);
        NodeInfo nodeinfo = new NodeInfo(name, count);
        nodes.Add(nodeinfo);
    }
    m_nodes = nodes;
    RepopulateTreeMap();
}

...

private void RepopulateTreeMap()
{
    if(m_nodes.Count == 0)
        return;

    Nodes TreemapGeneratorNodes;

    //reset treemap
    m_TreemapGenerator.Clear();

    TreemapGeneratorNodes = m_TreemapGenerator.Nodes;

    foreach(NodeInfo n in m_nodes)
    {
        //does this node have enough to display?
        if(n.Count >= m_nDisplayCount)
        {
            //Create node with basic default size and color
            Node oWordNode = new Node(n.Name, n.Count * 50.0f, 0F);

            //set object data
            oWordNode.Tag = n;

            //add category to tree
            TreemapGeneratorNodes.Add(oWordNode);

            //used later for determining node color
            if (n.Count > m_nLargestCount)
                m_nLargestCount = n.Count;
            else if (n.Count < m_nSmallestCount)
                m_nSmallestCount = n.Count;
        }
    }
}

Drawing Nodes

The treemap uses custom drawing for nodes, which is called from OnPaint.

// We want to do owner drawing, so handle the DrawItem event.
m_TreemapGenerator.DrawItem += 
    new TreemapGenerator.TreemapDrawItemEventHandler(DrawItem);

...

protected override void OnPaint(PaintEventArgs e)
{
    AssertValid();

    // Save the Graphics object so it can be accessed by OnDrawItem().
    m_Graphics = e.Graphics;

    // Tell the TreemapGenerator to draw the treemap using owner-
    // implemented code.  This causes the DrawItem event to get fired for
    // each node in the treemap.
    m_TreemapGenerator.Draw(this.ClientRectangle);

    // All DrawItem events have been fired.  Make sure the Graphics object
    // doesn't get used again.
    m_Graphics = null;
}

Node rendering is handled in DrawItem(). Within this method we extract the NodeInfo object, get name and count, set color and text size based on count, and then draw the node. Final node result: the greater the count, the larger the text and more saturated the color.

private void DrawItem(Object sender, TreemapDrawItemEventArgs e)
{
    AssertValid();

    Node oNode = e.Node;
    float fontSize = m_FontSize;
    int count = 0;

    // Retrieve the NodeInfo object from the node's tag.
    if (oNode.Tag is NodeInfo)
    {
        //get word count
        NodeInfo oInfo = (NodeInfo)oNode.Tag;
        count = oInfo.Count;

        //if we're using text scaling, increment font size
        if(m_bUseTextScaling)
            fontSize += oInfo.Count;
    }
    else
    {
        //should never get here
        Debug.WriteLine("DrawItem: Skipping node - bad");
        return;
    }
    //set color alpha based on frequency
    Color newStartColor = GetColor(count, m_startColor);
    Color newEndColor = GetColor(count, m_endColor);

    //set gradient colors and gamma
    LinearGradientBrush nodeBrush = new LinearGradientBrush(e.Bounds,
        newStartColor, newEndColor, LinearGradientMode.Vertical);

    nodeBrush.GammaCorrection = true;

    m_Graphics.FillRectangle(nodeBrush, e.Bounds);

    // Create font and align in the center
    Font newfont = new Font(m_FontName, fontSize, m_FontStyle);
    StringFormat sf = new StringFormat();
    sf.Alignment = StringAlignment.Center;
    sf.LineAlignment = StringAlignment.Center;

    //draw the text
    m_Graphics.DrawString(e.Node.Text, newfont, new SolidBrush(m_FontColor), 
        e.Bounds, sf);

    // Draw a black border around each node
    Pen blackPen = new Pen(Color.Black, 2);
    m_Graphics.DrawRectangle(blackPen, e.Bounds);

    //clean up
    nodeBrush.Dispose();
    newfont.Dispose();
    blackPen.Dispose();
}

"Massaging" The Text

A worker thread method, DoWordProcessing(), in the main form processes the word collection document. Stemming is also performed in this method for word suffix stripping.

private void DoWordProcessing(object obj)
{
    //unpack array
    object[] objArray = (object[])obj;
    IProgressCallback callback = (IProgressCallback)objArray[0];
    StringBuilder sbRawText = (StringBuilder)objArray[1];
    ArrayList arrStopWords = (ArrayList)objArray[2];

    try
    {
        //Build a hash of words and thier frequency
        Hashtable wordsHash = new Hashtable();
        Hashtable stemmedWordsHash = new Hashtable();
        PorterStemmer ps = new PorterStemmer();

        //construct our document from the input text
        Document doc = new Document(sbRawText.ToString());

        callback.Begin(0, doc.Words.Count);

        for (int i = 0; i < doc.Words.Count; ++i)
        {
            //cancel button clicked?
            if (callback.IsAborting)
            {
                callback.End();
                return;
            }
            //update progress dialog
            callback.SetText(String.Format("Reading word: {0}", i));
            callback.StepTo(i);

            //Don't do numbers
            if (!IsNumeric(doc.Words[i]))
            {
                // normalize each word to lowercase
                string key = doc.Words[i].ToLower();

                //check stop words list
                if (!arrStopWords.Contains(key))
                {
                    //set our stemming term
                    ps.stemTerm(key);

                    //get the stem word
                    string stemmedKey = ps.getTerm();

                    //either add to hash or increment frequency
                    if (!stemmedWordsHash.Contains(stemmedKey))
                    {
                        //add new word
                        stemmedWordsHash.Add(stemmedKey, 1);
                        wordsHash.Add(stemmedKey, key);
                    }
                    else
                    {
                        //increment word count
                        stemmedWordsHash[stemmedKey] = 
                            (int)stemmedWordsHash[stemmedKey] + 1;
                    }
                }
            }
        }
        //now let the treemap load the information
        this.TreePanel.PopulateTreeMap(wordsHash, stemmedWordsHash);
    }
    catch (System.Threading.ThreadAbortException)
    {
        // noop
    }
    catch (System.Threading.ThreadInterruptedException)
    {
        // noop
    }
    finally
    {
        if (callback != null)
        {
            callback.End();
        }
    }
}

The Demo Application

Controls

Description of the toolbar buttons (in order from left to right):

Open Text File: Open a text file document to visualize
Input Text: Paste text into this dialog from another document to visualize (128k max, but can be changed to your liking)
Stop Words: A dialog allowing you to modify the set of stop words**
Font: A dialog allowing you set the display font
Node Color: A dialog allowing you set the gradient colors for node display
Scale Text: Toggle for scaling text relative to count
Show Count: Toggle for showing/hiding word count in nodes**
Minimum word count slider: Dynamically controls how many nodes to display based on word frequency
Save as image: Save the treemap as a gif image

**NOTE: Document text is not retained in memory; it's only parsed, added to the treemap as nodes, and then discarded. So the Show Count and Stop Words features are only useful before opening/inputting text; it doesn't dynamically show/hide node counts or apply stop words.

Input Data

I've tried various document sizes, ranging from 400 to 6000 words - mostly presidential speeches and the like. In the project, I've included two text files: mlk.txt and kennedy.txt. These are Martin Luther King's "I Have a Dream" address at the March on Washington, August 28, 1963, and former United States President John F. Kennedy's 1961 State of the Union Address - 1,588 and 5,184 words respectively.

Another issue to be aware of is stop words. I've added a default set of stop words which is user configurable and greatly affects word parsing. The 430 stop words provided are fairly standard and cover a wide number of stop words without getting too aggressive.

Conclusion

While crude, un-optimized, not web-based, and entry level at best when compared to other tag/word cloud generators, the example could perhaps be a starting point for someone interested in the idea. It also may serve as a basic example using Microsoft's TreemapGenerator from the Data Visualization Components suite.

Attribution

Tony Capone's Google Groups posting for the TreemapGenerator code

Matthew Adams's Progress Dialog

Leif Azzopardi's port of the Porter's Porter stemming algorithm

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here