Introduction
I recently had a conversation with Srini (aka Mike) Vasan (CEO at Quantum Ventura) on the subject of semantic analysis that led to this fun little project. The idea is to take a Twitter stream (using the wonderful open source Tweetinvi C# API) and marrying the output to a word cloud, which I've actually implemented as a Force Directed Graph using, as a starting point, the code that Bradley Smith blogged about in 2010. I looked at a few word cloud generators but none were suited for real-time updating, however, a force directed graph is a perfect way of creating a living, dynamic view of tweets as they are happening in real time.
A (silent) video shows this best, so I've posted one here: https://www.youtube.com/watch?v=vEH_1h0jrZY
Wait until the 10 second mark for fun stuff to start happening.
The salient points of this applet are:
- Word hit counts are shown in different font sizes (from 8pt to 36pt, representing counts from 1 to 24)
- Word hit counts are also reflected in the colorization, from blue (representing 1 hit) to red (representing 24 or more hits)
- A maximum of 100 words are shown
- To accommodate new words once the maximum is reached, existing words with 1 hit are randomly removed
- To prevent saturation of >1 hit count words, over time, counts on all words are slowly decremented
Source Code
The source code is on GitHub, here: https://github.com/cliftonm/TwitterWordCloud-WinForm
Accessing Twitter with the Tweetinvi API
This is very simple. You'll need to first obtain a consumer key and consumer secret from Twitter here: https://apps.twitter.com/. Then, get an access token and access token secret from here: https://api.twitter.com/oauth/request_token
Once you've done that, the credentials are set up in the API with the call:
TwitterCredentials.SetCredentials("Access_Token", "Access_Token_Secret", "Consumer_Key", "Consumer_Secret");
To get this working in the app, you'll need to place these keys into a file called "twitterauth.txt" in the bin\Debug (or bin\Release) folder. The format should be:
[Access Token]
[Access Token Secret]
[Consumer Key]
[Consumer Secret]
For example (made up numbers):
bl6NVMpfD
bxrhfA8v
svdaQ86mNTE
lvGwXzG3MJnN
The values you get from Twitter will be much longer.
I read these four lines and initialize the credentials with:
protected void TwitterAuth()
{
string[] keys = File.ReadAllLines("twitterauth.txt");
TwitterCredentials.SetCredentials(keys[0], keys[1], keys[2], keys[3]);
}
Starting / Stopping a Filtered Stream
The code gracefully handles starting and stopping a stream. By graceful, I mean that, if a stream exists, we shut down the current stream, wait for the event that indicates it has stopped, and then start a new stream.
protected void RestartStream(string keyword)
{
if (stream != null)
{
Clear();
stream.StreamStopped += (sender, args) => StartStream(keyword);
stream.StopStream();
}
else
{
StartStream(keyword);
}
}
protected void StartStream(string keyword)
{
stream = Stream.CreateFilteredStream();
stream.AddTrack(keyword);
stream.MatchingTweetReceived += (sender, args) =>
{
if (args.Tweet.Language == Language.English)
{
UpdateFdg(args.Tweet.Text);
}
};
stream.StartStreamMatchingAllConditionsAsync();
}
protected void OnStop(object sender, EventArgs e)
{
if (stream != null)
{
stream.StreamStopped += (s, args) => stream = null;
stream.StopStream();
}
}
protected void Clear()
{
wordNodeMap.ForEach(kvp => kvp.Value.Diagram = null);
wordNodeMap.Clear();
}
Parsing the Tweet
There's lots of parts of speech that needs to be parsed out of a tweet. At the moment, the dictionary of words to exclude is harcoded:
protected List<string> skipWords = new List<string>(new string[] { "a", "an", "and", "the", "it", ... etc ...
you get the idea.
We also want to remove punctuation (a somewhat brute force approach):
protected List<string> punctuation = new List<string>(new string[] { ".", ",", ";", "?", "!" });
public static class Extensions
{
public static string StripPunctuation(this string s)
{
var sb = new StringBuilder();
foreach (char c in s)
{
if (!char.IsPunctuation(c))
{
sb.Append(c);
}
}
return sb.ToString();
}
}
and filter out specific components of the tweet and the words in our dictionary:
protected bool EliminateWord(string word)
{
bool ret = false;
int n;
if (int.TryParse(word, out n))
{
ret = true;
}
else if (word.StartsWith("#"))
{
ret = true;
}
else if (word.StartsWith("http"))
{
ret = true;
}
else
{
ret = skipWords.Contains(word);
}
return ret;
}
Avoiding Saturation and Accommodating new Tweets
As mentioned earlier, once we reach the limit of 100 words, we remove stale words to make room for new words:
protected void RemoveAStaleWord()
{
if (wordNodeMap.Count > MaxWords)
{
DateTime now = DateTime.Now;
KeyValuePair<string, TextNode> tnode = wordNodeMap.Where(w => w.Value.Count==1).
OrderByDescending(w => (now - w.Value.UpdatedOn).TotalMilliseconds).First();
tnode.Value.Diagram = null;
wordNodeMap.Remove(tnode.Key);
wordTweetMap.Remove(tnode.Key);
}
}
The above algorithm applies only to words with one hit count. If we don't do this, high volume streams, like "Obama" result in words never gaining any traction because of the huge volume of tweets coming in. By eliminating only the oldest "rabble", we get a nice word cloud of the concerns around President Obama:
Saturation is avoided by reducing all word counts over time, based on the number of tweets (iterations) modulus some saturation value, currently set to 20 - in other words, every 20 tweets, all word counts are decremented:
protected void ReduceCounts()
{
if (iteration % SaturationCount == 0)
{
iteration = 0;
wordNodeMap.Where(wc => wc.Value.Count > 1).Select(wc => wc.Key).ForEach(w=>wordNodeMap[w].DecrementCount());
}
}
Queueing the Tweet
The tweets are received asynchronously, so we put a lock around adding them to the queue:
protected void UpdateFdg(string text)
{
lock (this)
{
tweetQueue.Enqueue(text);
}
}
De-Queueing the Tweet
The entire process of updating the FDG is done in the application's main thread, specifically in the OnPaint method, which is called 20 times a second by invalidating the owner-draw panel:
timer = new System.Windows.Forms.Timer();
timer.Interval = 1000 / 20;
timer.Tick += (sender, args) => pnlCloud.Invalidate(true);
In the Paint event handler, we dequeue the tweet, update the nodes in the graph, execute a single iteration cycle of the FDG and draw the results:
pnlCloud.Paint += (sender, args) =>
{
Graphics gr = args.Graphics;
gr.SmoothingMode = System.Drawing.Drawing2D.SmoothingMode.HighQuality;
++paintIteration;
if (!overrun)
{
overrun = true;
int maxTweets = 20;
while (tweetQueue.Count > 0 && (--maxTweets > 0))
{
string tweet;
lock (this)
{
tweet = tweetQueue.Dequeue();
}
SynchronousUpdate(tweet);
}
diagram.Iterate(Diagram.DEFAULT_DAMPING, Diagram.DEFAULT_SPRING_LENGTH, Diagram.DEFAULT_MAX_ITERATIONS);
diagram.Draw(gr, Rectangle.FromLTRB(12, 24, pnlCloud.Width - 12, pnlCloud.Height - 36));
overrun = false;
}
else
{
gr.DrawString("overrun", font, brushBlack, new Point(3, 3));
}
};
I've never seen the application emit an overrun message, so I assume that everything processes fast enough that that's not an issue. Also, the processing of the incoming tweets, filtering the words, etc., could all be done in separate threads, but for simplicity, and because a lot of the existing code that I used for the FDG would need refactoring to be more thread-friendly, I decided to keep it simple and perform all the processing synchronously.
Updating the Counters and Tweet Buffers
The workhorse function is really SynchronousUpdate. Here, we remove any punctuation, eliminate words we don't care about, replace stale words with any new words in the tweet, and update the word hit counters. We also record up to "MaxTweet" tweets for each word, which (as I'll show next) on mouse-over, you can then see the tweet text. Here's the method:
protected void SynchronousUpdate(string tweet)
{
string[] words = tweet.Split(' ');
++iteration;
ReduceCounts();
foreach (string w in words)
{
string word = w.StripPunctuation();
string lcword = word.ToLower();
TextNode node;
if (!EliminateWord(lcword))
{
if (!wordNodeMap.TryGetValue(lcword, out node))
{
++totalWordCount;
PointF p = rootNode.Location;
RemoveAStaleWord();
TextNode n = new TextNode(word, p);
rootNode.AddChild(n);
wordNodeMap[lcword] = n;
wordTweetMap[lcword] = new Queue<string>(new string[] { tweet });
}
else
{
wordNodeMap[lcword].IncrementCount();
Queue<string> tweets = wordTweetMap[lcword];
if (tweets.Count > MaxTweets)
{
tweets.Dequeue();
}
tweets.Enqueue(tweet);
}
}
}
}
Mouse-Overs
Mouse-overs are handled by two events:
pnlCloud.MouseMove += OnMouseMove;
pnlCloud.MouseLeave += (sender, args) =>
{
if (tweetForm != null)
{
tweetForm.Close();
tweetForm=null;
mouseWord=String.Empty;
}
};
When the mouse leaves the owner-draw panel, we close the form displaying the tweets and reset everything to a "not showing tweets" state.
When the user moves the mouse over the owner-draw panel, we check for whether the mouse coordinates are inside the rectangle displaying a word. There's some logic to update the existing tweet form or create a new one if one isn't displayed:
protected void OnMouseMove(object sender, MouseEventArgs args)
{
var hits = wordNodeMap.Where(w => w.Value.Region.Contains(args.Location));
Point windowPos = PointToScreen(args.Location);
windowPos.Offset(50, 70);
if (hits.Count() > 0)
{
string word = hits.First().Key;
TextNode node = hits.First().Value;
if (mouseWord == word)
{
tweetForm.Location = windowPos;
}
else
{
if (tweetForm == null)
{
tweetForm = new TweetForm();
tweetForm.Location = windowPos;
tweetForm.Show();
tweetForm.TopMost = true;
}
tweetForm.tbTweets.Clear();
ShowTweets(word);
mouseWord = word;
}
}
else
{
if (tweetForm != null)
{
tweetForm.Location = windowPos;
tweetForm.TopMost = true;
}
}
}
The result is popup window that moves with the mouse as the user moves around the owner-draw panel:
The Force Directed Graph
If you look at Bradley Smith's original FDG code, you'll notice that I've changed a few things. For one, I'm not drawing the force lines, only the nodes:
foreach (Node node in nodes)
{
PointF destination = ScalePoint(node.Location, scale);
Size nodeSize = node.Size;
RectangleF nodeBounds = new RectangleF(center.X + destination.X - (nodeSize.Width / 2), center.Y + destination.Y - (nodeSize.Height / 2), nodeSize.Width, nodeSize.Height);
node.DrawNode(graphics, nodeBounds);
}
The original code was also simply drawing spots, so I extended the SpotNote
class to be able draw text as well:
public class TextNode : SpotNode
{
protected int count;
public int Count
{
get { return count; }
}
public Rectangle Region { get; set; }
public DateTime CreatedOn { get; set; }
public DateTime UpdatedOn { get; set; }
public static Dictionary<int, Font> fontSizeMap = new Dictionary<int, Font>();
protected string text;
public TextNode(string text, PointF location)
: base()
{
this.text = text;
Location = location;
count = 1;
CreatedOn = DateTime.Now;
UpdatedOn = CreatedOn;
}
public void IncrementCount()
{
++count;
UpdatedOn = DateTime.Now;
}
public void DecrementCount()
{
if (count > 1)
{
--count;
}
}
public override void DrawNode(Graphics gr, RectangleF bounds)
{
Font font;
int fontSize = Math.Min(8 + Count, 36);
if (!fontSizeMap.TryGetValue(fontSize, out font))
{
font = new Font(FontFamily.GenericSansSerif, fontSize);
fontSizeMap[fontSize] = font;
}
int count2 = Math.Min(count, 24);
if (count2 >= twitterWordCloud.AppForm.CountThreshold)
{
int blue = 255 * (24 - count2) / 24;
int red = 255 - blue;
Brush brush = new SolidBrush(Color.FromArgb(red, 0, blue));
SizeF strSize = gr.MeasureString(text, font);
PointF textCenter = PointF.Subtract(bounds.Location, new Size((int)strSize.Width / 2 - 5, (int)strSize.Height / 2 - 5));
Region = Rectangle.FromLTRB((int)textCenter.X, (int)textCenter.Y, (int)(textCenter.X + strSize.Width), (int)(textCenter.Y + strSize.Height));
gr.DrawString(text, font, brush, textCenter);
brush.Dispose();
}
}
}
This class also colorizes the text and each node, being a unique word, keeps track of the hit count and created/updated date.
I also removed the asynchronous behavior of the FDG that Bradley had originally implemented. Also removed was the detection for when to stop iterating -- the graph iterates forever, which is evident in the constant wiggle of the central spot. Various tweaks were also made to better support adding / removing nodes.
Conclusion
This was a fun little project to throw together using some great existing work and just writing a bit of logic around the Twitter and FDG pieces to glue them together. I found some disappointing things:
- The vast majority of the "newsy" tweets are re-tweeted, often disproportionately skewing the hit counts.
- "Original" tweets are for the most part rather boring, being simply paraphrases of other tweets.
- People really only tweet about mainstream things. You won't find people tweeting about global warming or alternative currencies.
I also found some interesting things:
- You can discover interesting linkages between subjects. For example, when watching the feed on "Obama", I saw "BigBird" and discovered that Michelle Obama was meeting with Sesame Street's BigBird. A good thing for the First Lady to be doing!
- I "read" about the oil tanker train wreck in West Virginia first through this program as a result of filtering on "oil" and saw keywords like "train" and "derailment" having significant hit counts.
- It definitely looks possible to perform sentiment analysis on tweets -- there are many single hit count words that convey sentiment: "angry", "happy", "scared", "disappointed", and so forth.
- Just because the media makes a hoopla of something, like Apple getting into the electric car market, the tweet volume on this was non-existent, leading me to the conclusion that most people have a "who cares?" attitude to that news event.
There's also some other fun things that could be done, such as plotting the tweets on a map, filtering tweets by geolocation, extending the filtering mechanism for "and" vs. "or" filter tracks, and so forth. This applet really just touches the surface of what can be done.