Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

A Nearly Self-organizing Tag Cloud

0.00/5 (No votes)
4 Jan 2010 2  
Show and use items in a tag cloud for data entry.

Figure 1: The TagCloud Control

Introduction

I've often seen tag clouds on websites, but I had the need to use them in Windows Forms. So, I developed a control which you can add easily to your form. It will help you to visualize strings of a text or a collection, depending on their importance. Beyond this, you can also click on the items to use them for data entry. Every click increases the item's importance, and may change the view of the items in the cloud, as they are all seen in connection with all the other items in the cloud.

Background

The strings (I also call them 'items') are stored in an XML file. The XML file design is very easy. Every item is stored in a Tag, which has three attributes: Text, Occurrence, and Last. Example:

<Tag Text="ABC of programming" Occurrence="5" Last="633923147566312334" />

Text contains the item's text, Occurrence stores how often the item is being used (or the number of times the item appears in a text), and Last contains the time of last access (in DateTime.Ticks). I've implemented the access to a default XML file, which is called 'Tags.xml', and lies in the program directory. Of course, you can load and save any user-created tag file.

To demonstrate the control's use, I've written a wrapper which shows you most of the control's abilities. Please be aware that the wrapper is not a masterpiece in GUI programming. Its only purpose is to show some effects in connection with the tag cloud.

In the screenshot (figure 2), the button 'Read Default Tags XML' has been pressed with the effect that the contents of the default file 'Tags.xml' is shown in the tag cloud. On the right, you can see how many items can be shown in the cloud by: 'How many items shown in cloud', and how many items are in use at all: 'How many items at all'. In our example, 34 items can de displayed. The number of items seen in the tag cloud depends on the control's size (you can change it in steps of '+10' in the wrapper) and on the complex interaction of the item weights, meaning their occurrence.

The item weights are visualized by five different font styles and colors. You can change them by setting the control's property (described later). The wrapper demonstrates how to change the output design of the least important items by 'Set Output design 1', and those in the midrange by 'Set Output design 3'. You can reset to the default using 'Set Standard Output designs'.

Figure 2: The TagCloud Control, Embedded in TagCloudWrapper

Besides this, you can also change the control's background color (the default is Color.Azure) and the look of the tags. Setting the property 'Underline on' shows the item being underlined on mouse hover, and setting 'Frame on' shows frames around the items (this is also the catch area when clicking on an item).

Figure 3: The TagCloud Control with Frame On

Clicking onto an item raises an event. You can see this in the wrapper in the 'Tags Information' area: the clicked text is shown in the textbox 'Text' and the occurrence in the textbox 'Occurrence'. Every click on an item increases its weight by 1, and depending on the other items, it will change its output design in the cloud (or not!). Whether belonging to a certain output design is calculated by statistical means, depending on the overall average of all items and the standard deviation. You can also find out the 'Most' and 'Least' clicked items as well as the 'Youngest' and 'Oldest' clicked ones.

In the area 'Some Tag Manipulations', you can study how items are added (please study the source code, what really happens there), and you can 'Clear all items' in the cloud. With 'Add this item', you can add an item to the cloud. It must be edited in the textbox left to the button. When adding a new item, its weight (occurrence) is set to the average value of all the items, in order to let it really appear. If it would start with weight 1, it might be possible that it would not be displayed, as other items would dominate.

To manipulate the cloud control's contents, you can open a context menu by clicking the right mouse button over an item. You can add a new item, remove an item, or change an item text.

Figure 4: Context Menu in TagCloud Control

Another way to show tags in the cloud is to open a text file or an HTML file. This is demonstrated in the area: 'Read Tags from text and HTML files'. The three buttons there demonstrate the opening of a text or HTML file, the evaluation of the important tags inside, and displaying the tags in the cloud control. As many words in English (or German) are filler words, I've added two files: ExcludeList-en.txt and ExcludeList-de.txt, which contain some words which will not be regarded when building the list of words for the cloud. These are words like: I, you, are, if, them, here, about, never... The list is read, depending on the expected language. In my example, it's English and German (no French, Spanish, Italian... sorry). I know that the lists are not complete, so feel free to add or remove words as you like, or even design an exclude list for a language you need. Another exclude list is used when trying to show an HTML file: ExcludeList-html.txt. It contains some of the syntax of HTML and web design (but is not complete too).

Furthermore, the cloud control has drag and drop functionality. You can take any text or HTML file and drop it onto the control. Please consider not to use too large files. I've implemented some checks to avoid wrong file formats being loaded, so you may get a message like this:

Figure 5: Unsuccessful Drag and Drop

Below, you can see an English text file (example: majorca_en.txt) which has been dragged and dropped over the cloud control. The height and width have been extended manually to show more items in the control.

Figure 6: Successful Drag and Drop with File majorca_en.txt

Using the Code

To use the cloud control, you have to add 'TagCloud.dll' to your references. You can see it best in the TagCloudWrapper project, which is part of the solution.

In the following sections, I'll try to describe the classes of the 'TagCloud' project.

TXTHandling

This class is responsible for reading text or HTML files. The words in the files are separated by some delimiters like ', . : " ;'. In the next step, all the words are examined, if they are 'wanted'. Wanted means that they are not part of the appropriate exclude list.

But, let's start from the beginning. First, a file is checked, whether it is a text file, or an HTML file, or not. A rudimentary algorithm checks if it's English or German text, and if it's text or HTML:

void GetFileTypeAndLanguage(string filename, 
   out bool ishtml, 
   out TagCloudControl.TextLanguage language)

Depending on the result, either (ishtml == false):

bool ReadTextFile(out SortedList<string, TagCloudControl.StringItem> sc, 
   string filename, 
   TagCloud.TagCloudControl.TextLanguage language)

or (ishtml == true):

bool ReadHTMLFile(out SortedList<string, TagCloudControl.StringItem> sc, 
  string filename, 
  TagCloud.TagCloudControl.TextLanguage language)

is called. Furthermore, the appropriate exclude list will be applied on the words, depending on the language English or German. In the case of an HTML file, the HTML exclude list will also be applied to filter some HTML specific words. The result is a list of 'wanted' words, which are stored in SortedList sc. Parts of this list will then be displayed in the cloud control.

One short comment on the exclude lists: as mentioned above, they are called: ExcludeList-en.txt, ExcludeList-de.txt, and ExcludeList-html.txt.

They are normal text files, which you can change on your own.

The syntax is:

Exclude:

word1, word2, word3, word4,..
wordxxx,wordyyy, wordzzz,...
...

Include:

word1, word2

Example part of 'ExcludeList-en.txt':

Exclude:

a,able,about,above,ah,ain't,alas,am,an,any,anybody,
anyone,anywhere,after,again,against,ago,all,also,
although,among,and,are,around,as,at,away,
...

Include:

May,US,

You can see that there is also an 'Include' part. Its purpose is to include words which should normally be excluded. The words in the 'Exclude' part should be written in lower case and separated by commas. As the words 'may' and 'May' have different meanings, you should define in the 'Include' part all those words which should not be excluded when being written in the way you define them there (meaning: write them case sensitive). Please feel free to add further words or delete words. I'm not a native English speaker, so I think there are lots of improvements you can do.

Some methods of the class 'TXTHandling' which deal with word processing, are:

int FillExcludeIncludeList(
    TagCloud.TagCloudControl.TextLanguage language, 
    bool clear)
bool WantedWord(string newword)
bool WantedHTMLWord(string newword)
bool IsRomanNumber(string number)
bool ContainsDigit(string word)
bool IsTimeSpan(string timespan)

XMLHandling

The class 'XMLHandling' contains the methods for reading and writing a tag file, which contains the tags (items) of a cloud. The methods 'ReadTagFile' and 'WriteTagFile' are overloaded. They give access to a certain filename or to the default tag file 'Tags.xml'. The default filename is got by the method 'GetDefaultFileName'.

FileFormats

The class 'FileFormats' contains some methods to check file formats. Some of the following formats are examined:

  • Graphic file formats
  • Audio file formats
  • Video file formats
  • Compressed file formats
  • Program and development file formats
  • Document file formats
  • System file formats
  • Database file formats

The list is not exhaustive - feel free to add further file formats there. The files are recognized in a simple way by reading their magic numbers. These are sequences of bytes, most of them at the beginning of the file, which allow you to get a general idea of which kind of file it could be. In this solution, the methods are used to avoid 'wrong' files being dragged and dropped over the cloud control.

Statistics

The class 'Statistics' is responsible for delivering the mean value of the collection of values:

double Mean(IEnumerable<double> values)

and the standard deviation:

double StandardDeviation(IEnumerable<double> values, out double mean)

Both values are needed to regard an item in its context of the other items to decide, whether it will be shown in the cloud control and which weight it has there. Depending on its weight, one of 5 output designs (font, color) is assigned to it.

AddItem and RenameItem

These classes are simple dialog boxes which are shown when being clicked in the context menu. 'AddItem' allows you to add a new item to the cloud, 'RenameItem' offers you the possibility to change an existing item's name.

TagCloudControl

This class is the central class for cloud control management and display. It contains a lot of properties and methods to steer the cloud control. In the following section, I will describe some of them. The best way to understand the functionality is to study the methods and property calls in TagCloudWrapper.

Let's start with the properties:

// Sets or gets the control's back color
public Color ControlBackColor

// Sets or gets the condition for drawing a rectangle
public bool ControlTextFrame

// Sets or gets the condition for drawing
// the text strings underlined on mouse hover
public bool ControlTextUnderline

// Sets or gets the control's height.
// Due to a height change, the cloud is set and painted new
public int ControlHeight
// Sets or gets the control's width.
// Due to a width change, the cloud is set and painted new
public int ControlWidth

// Gets the most clicked text item from
// the list with text items (StringCollection)
public string MostClickedItem

// Gets the least clicked text item from
// the list with text items (StringCollection)
public string LeastClickedItem

// Gets the oldest clicked text item from
// the list with text items (StringCollection)
public string OldestClickedItem 

// Gets the youngest clicked text item from
// the list with text items (StringCollection)
public string YoungestClickedItem

// Gets the number of the items in StringCollection
public int ItemsCount 

// Gets the number of the items shown in the cloud
public int ShownItemsCount 

The only event which is fired is on the mouse click on an item:

[Description("Delegates a click onto the user control to the wrapper")]
// delegate declaration
public delegate void OnUserControlClick(string Text, double Occurrence);
// event declaration
public event OnUserControlClick clickHandler;

The event returns the text and the occurrence of the item in the cloud.

Public Methods for Output Design Management

// Sets the 5 predefined output designs
public void SetAllDesigns(bool update)
 
// Sets an output design
public bool SetDesign(int number, string font, float size, Color color)

// Overloaded, sets an output design
public bool SetDesign(int number, OutputDesign od)

Public Methods for String Management

// Adds or replaces a text item to the list of items (StringCollection).
// If the item is new, its occurrence is set to 1,
// else its previous occurrence is incremented by 1.
// The attribute 'last' (last access) will be set to 'now'.
public void AddItem(string text)
// Overloaded, adds a text item to the list of items (StringCollection).
// The attribute 'last' (last access) will be set to 'now'.
public void AddItem(string text, double occ)
// Overloaded, adds a text item to the list of items (StringCollection).
// The attribute 'last' (last access) will be set by last.
public void AddItem(string text, double occ, long last)
// Overloaded, adds a complete list sc to the list of items (StringCollection).
public void AddItem(SortedList<string, StringItem> sc)
// Overloaded, adds or replaces a text item to the list of items (StringCollection).
// If the item exists already in StringCollection,
// the variable regardmean will be interpreted:
// - if regardmean is true and the item is not yet member of CloudCollections, 
// the occurrence will be set to the mean occurrence
// of the items in the StringCollection
// - otherwise its occurrence is incremented by 1.
// If the item does not exist, the variable regardmean will be interpreted:
// - false: the occurrence will be set to 1
// - true: the occurrence will be set to the mean occurrence
// of the items in the StringCollection.
// The attribute 'last' (last access) will be set to 'now'.
public void AddItem(string text, bool regardmean)
// Removes the item text from the item list (StringCollection).
public void RemoveItem(string text)
// Clears the list with the items (StringCollection) completely.
public void ClearAllItems()
// Returns all the items from the StringCollection.
public SortedList<string, StringItem> GetAllItems()

Private Methods to Build the Cloud

These methods are all private, as they are only called when:

  • strings are added to or removed from the string collection
  • the weights of the items change
  • the output design of the items change
  • the design of the cloud changes

The methods are responsible for building the cloud's contents and showing the items:

private void SetCloud()
private void CopyStringsToCloud()
private int BuildCloud(ref StringItem si)

The algorithm works as follows: the items are listed in an alphabetic manner. Due to their weight, they get an output design (one of 5). It will be calculated if the item's text dimensions will fit into the cloud. If the dimensions exceed the right border, a new line will be started. The height of a line is given by the maximum height of the items of one line. You can best understand it when you have a look at figure 3, where the text borders are shown (frame on). The new line will be filled in the same way, and this is done until the lower border of the cloud control is reached. If the list of items has not yet reached its end, the 'weakest' item will be removed from the list of items which should be displayed. The 'weakest' item is that one which has the least occurrence and which has not been called for the longest time span.

Now, the filling of the cloud control begins in a new cycle. As an item has been removed, the weights of the other items may have changed. So the output designs may have changed due to new statistical calculations, and accordingly, the text dimensions may have changed too. The whole thing is a recursive process targeted to display most of the 'important' items. In this case, 'important' means, with the most occurrence and recently called. For details, please study the code. Maybe it can be optimized, but on my machine, I have no performance problems.

Points of Interest

I've planned to use the cloud control in two future applications.

On one hand, I want to write a program to apply some basic modifications to images taken by a digital camera. One point will be to rename the files, and I think I will use the tag cloud to get keywords for half-automatic file renaming. I plan to build a set of keywords from our last holidays, just as suggested in the example in figure 6.

On the other hand, I also plan to write a program which helps you learn by asking you questions, which you have to answer correctly. The questions are put randomly, but I could imagine it would be a good thing to additionally offer a tag cloud containing theme-based keywords you can click onto.

I hope, you have much more good ideas - let me know!

History

  • Initial version 1.0: December 2009.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here