Background
This control is inspired by the free web-based word cloud generator called Wordle. In fact, the control is a screw-out product of my project at http://sourcecodecloud.codeplex.com.
I really loved the visualizations produced by Wordle, but my goal was to write a non web based local solution to process large amounts of sensible data. There were a number of components I found on the web, but most of them had either very pure performance when processing text and the visualization or the layout was not what I expected.
Architecture and usage
There are four phases when visualizing the word cloud:
Processing data like text, HTML, or source code, and extracting the relevant words while ignoring others. As an example, I have implemented three of them. TextExtractor
extracts all words from some text string ignoring spaces and all non-letter characters. FileExtractor
is able to process large text files line by line. Another one UriExtractor
fetches a URL content and tries to clean away HTML tags and JavaScript (to be honest, I just implemented it as a showcase and its filtering capabilities are very pure).
To tap your own data source, just implement the IEnumerable<string>
interface or derive from BaseExtractor
.
Counting words and ignoring ones from blacklist.
The result is an enumeration with pairs of terms (words) and integers representing the number of occurrences of this word in a text. In the first implementation, I was using KeyValuePair<string, int>
to represent them. In this version, I switched to the IWord
interface.
public interface IWord : IComparable<IWord>
{
string Text { get; }
int Occurrences { get; }
string GetCaption();
}
I have also moved to LINQ and gave up my own classes for word counting, grouping, and sorting. I loved them very much, but using LINQ increased readability, reduced complexity, and shortened code. All these at the price of an ignorable insignificant performance drawback was really a good deal.
IBlacklist blacklist = new CommonWords();
IProgressIndicator progress = new ProgressBarWrapper(progressBar);
IEnumerable<string> terms = new StringExtractor(textBox.Text, progress);
cloudControl.WeightedWords =
terms
.Filter(blacklist)
.CountOccurences()
.SortByOccurences();
Layout – I use a QuadTree
data structure to create a non overlapping map of words on controls graphics. The same data structure is also used to query control which words are under a certain rectangular area or point. This query is used to redraw only a particular area when needed or perform some action when a control is clicked. Thereby it is very useful to know which word was clicked to perform a word related action, let’s say show statistics or navigate to some URL.
private void cloudControl_Click(object sender, EventArgs e)
{
LayoutItem itemUderMouse;
Point mousePositionRelativeToControl =
cloudControl.PointToClient(new Point(MousePosition.X, MousePosition.Y));
if (!cloudControl.TryGetItemAtLocation(
mousePositionRelativeToControl, out itemUderMouse))
{
return;
}
MessageBox.Show(itemUderMouse.Word);
}
Configuring the Word Cloud Control
There are several things you may vary on this control:
You can change the font type and size.
cloudControl.MinFontSize = 6;
cloudControl.MaxFontSize = 60;
cloudControl.Font = new Font(new FontFamily("Verdana"), 8, FontStyle.Regular);
Use different colours:
cloudControl.Palette = new Brush[] {Brushes.DarkRed, Brushes.Red, Brushes.LightPink};
Use a different layout. Currently, there are two layouts implemented. You can implement your own by deriving from BaseLayout
or just by implementing the ILayout
interface on your own.
cloudControl.LayoutType = LayoutType.Typewriter;
The logic of lay out and drawing graphics is strictly separated by the IGraphicEngine
interface. So I think it would not be a big deal to port it to WPF or Silverlight in the future.
For experts
By digging in the code, you will discover the following extra features:
- Creating your own blacklist -
IBlacklist
interface or the CommonBlacklist
base class. - Loading blacklist from file -
CommonBlacklist.CreateFromFile(...)
method. - Grouping words having common stem like - departed, depart, departing.
- You are even able to see statistics on it.
Credits