Source Code and Running the Program
The source code for this article is hosted on GitHub: https://github.com/cliftonm/nlpvisualizer
To run the program, you will need to obtain an API key on AlchemyAPI's registration page. The free account permits you to perform 1000 queries per day. You can put the key directly into the source code or, as I have done, create the file "alchemyapikey.txt" in the bin\debug folder and copy your key into the first line of that file.
Using the Program
Basic Operation
- Enter a URL in the URL textbox and click Process.
- Once the keywords are displayed, you can click on the keyword list to display sentences containing that keyword and update the selected visualizer for that keyword.
- If there are multiple sentences, double-click on a sentence in the RichTextBox to narrow the scope of the visualization down to that one sentence
- Navigate previous and next sentences with the "Prev. Sentence" and "Next Sentence" buttons.
Visualization
- Right-click and drag to move the entire visualization surface
- Use the mouse wheel to zoom in and out (this is a bit problematic because of the ridiculous way the mouse wheel event works in relation to what control in the form has focus.) This feature is only available in the Keyword Directed Graph visualization. Alternatively (and if you don't have a mouse wheel) left-click and drag up/right or down/left to zoom in/out.
- In the "Neighboring Sentence Keywords" visualization, double-click on a keyword to select that keyword to navigate to in the text and visualization.
- In the "Keyword Directed Graph" visualization, double-click on a node (blue circle) to select that keyword to navigate to in the text and visualization.
Introduction
Natural Language Processing (NLP) intends to enable computers to derive meaning from human or natural language input. In my article reviewing three NLP's, we saw that these services extract entities, keywords, topics, events, themes and concepts. Other than themes and concepts, the results are essentially keywords or phrases. The extracted "strings" often have an associated relevance or strength, count or frequency, and/or sentiment value. I used the features of one NLP provider, AlchemyAPI, in another article to provide some filtering capabilities of RSS feeds, enabling the user to create filters based on the extracted strings and additional values.
Meaning in Concepts
Still, I found myself rather dissatisfied with the results. My first issue is with concept extraction. When analyzing a short publication on The Threefold Social Order and Education Reform 1, AlchemyAPI"s "concept" extraction is very high level:
- sociology
- education
- soul
- meaning of life
- religion
- human
- school
- life
As is Semantria's "themes":
- economic organization
- social organism
- human nature
- economic system
- bourgeois world view
However, given this sentence in the document:
"Rather, the spiritual-cultural organ of the social organism should, following the dictates of its own independent administration, bring those who are suitably gifted to a certain level of cultivation, and the state and economic life should organize themselves in accordance with the results of work in the spiritual-cultural sphere."
None of the NLP's that I reviewed earlier determine that this sentence is dealing with the concept of "Meritocracy."2
Meaning in Relationships
My second concern is that meaning is closely related to relationships between keywords or concepts. This article discusses two approaches for extracting relational meaning from keywords within a single document, creating a kind of semantic mind map or concept map. The two approaches use two different kinds of visualizations -- one is a simple "keywords in adjacent sentences" visualization, and the other is a force directed graph3 (FDG) of the relationships between keywords in the sentences in which the selected and related keywords occur. How to read the GDF will be explained in more detail later. The FDG code was originally written and posted by Bradley Smith in his July 2, 2010 blog entry4 -- I have made some minor modifications to that code to improve processing and to render text nodes.
Visualizations
There are two visualizations: adjacent sentence keywords and keyword associations. For these examples, I am using the wikipedia page on Founding Fathers of the United States5.
Adjacent Sentence Keyword Visualization
In the sentence containing the keyword "national affairs":
The previous sentence contains the keyword "well-educated men" and the next sentence contains the keywords "American Revolution" and "Continental Army":
This visualization is actually of interesting use. While it should not be interpreted as having any causal relationship, it can be interpreted as having a concept relationship. In the above keyword relationships, for example, the three sentences together are:
- Almost all of them were well-educated men of means who were leaders in their communities.
- Many were also prominent in national affairs.
- Virtually every one had taken part in the American Revolution; at least 29 had served in the Continental Army, most of them in positions of command.
One can quickly determine that the concept here is that "these men (in this case, delegates of the Federal Convention in Philadephia, determined by inspecting prior sentences) were well educated, prominent in national affairs, and almost all had taken part in the American Revolution or served in the Continental Army and most were in positions of command." When working with a complex document, keyword adjacency allows you to quickly create a concept from the surrounding text, which may have been missed in the overall complexity of the text.
Also note that double-clicking on a keyword in the visualization shows all the sentences containing that keyword as well as updating the visualization. For example, when double-clicking on "well-educated men", the program reveals:
Keyword Directed Graph
The second visualization is a directed graph of keyword associations. To explain this, let's start with something basic using the sentence "Many were also prominent in national affairs":
What this graph shows is that this sentence has only one keyword, which is "national affairs." Because this keyword does not appear in any other sentences, there are no further links.
Now let's look at this sentence, a little bit earlier in the text:
"As a sanctuary for Baptists, Rhode Island's absence at the Convention in part explains the absence of Baptist affiliation among those who did attend."
This sentence can also be found by clicking on the keyword "Baptist affiliation."
Here we have a more complex graph. Starting with the sentence "As a sanctuary..." we see that it has two keywords:
- Baptist affiliation
- Rhode Island
"Baptist affiliation" is not contained in any other sentences and therefore does not have any child nodes. However, "Rhode Island" is contained in one or more sentences, having two other keywords:
- delegates
- Convention delegates
The keyword "delegates" is used in one or more sentences containing keywords:
- United States Declaration
- Constitutional Convention
- large group
- Founding Fathers
- United States
These graphs can become complex, as illustrated by the starting text:
"The Founding Fathers of the United States of America were political leaders and statesmen who participated in the American Revolution by signing the United States Declaration of Independence, taking part in the American Revolutionary War, and establishing the United States Constitution."
There are two constants, not exposed at the moment in the UI, that limit the depth and breadth of the directed graph:
const int FDG_DEPTH_LIMIT = 3;
const int FDG_NODE_KEYWORD_LIMIT = 5;
The keyword association directed graph is a very interesting way of mapping out the relationship between concepts that occur within sentences. One can quickly discover additional paths for investigating concepts based on how keywords are associated with each other, which I've found helps to build a broader picture of what the text is discussing. So, for example, while adjacent keywords usually stay within a closely knit thought process, the keyword association graph allows one to explore more loosely coupled concepts around the central theme.
Double-clicking on a keyword's node (the blue circle) in the visualization shows all the sentences containing that keyword as well as updating the visualization.
Relevance Weighting
Keyword font size reflects the relevance (as determined by AlchemyAPI) of the keyword. So, for example, because the keyword "United States" has the highest relevance (0.92971), it is displayed in a large font. The relevance scale is from 0 to 1 and adjusts the font by multiplying the relevance (minus the minimum relevance) by 16 and adding that value to the base font size of 8:
font = new Font(
FontFamily.GenericSansSerif,
(float)(8.0 + (Program.app.keywordRelevanceMap[keyword] - Program.app.minRelevance) * FONT_WEIGHT_MULTIPLIER));
The Code
While there's nothing complex about the code, I'll discuss the basic processes here.
Document Analysis
The program analyzes web pages (as opposed to document text that you input yourself) from the URL that you enter on the main form. You may discover that you will get a "content exceeded" error message for some pages, as there is a size limit to content that AlchemyAPI processes.
The processing has three parts:
- Obtaining the scraped content using AlchemyAPI's
URLGetText
method. - Obtaining the keywords from that content using AlchemyAPI's
TextGetRankedKeywords
method. - Performing a keyword-sentence relationship lookup map pre-process.
protected async void Process(object sender, EventArgs args)
{
btnProcess.Enabled = false;
ClearAllGrids();
string url = tbUrl.Text;
sbStatus.Text = "Acquiring page content...";
try
{
pageText = await Task.Run(() => GetUrlText(url));
pageSentences = ParseOutSentences(pageText);
sbStatus.Text = "Acquiring keywords from AlchemyAPI...";
dsKeywords = GetKeywords(url, pageText);
sbStatus.Text = "Processing results...";
dvKeywords = new DataView(dsKeywords.Tables["keyword"]);
CreateSortedKeywordList(dvKeywords);
CreateSentenceKeywordMaps(dvKeywords);
CreateKeywordRelevanceMap(dvKeywords);
sbStatus.Text = "Ready";
dgvKeywords.DataSource = dvKeywords;
lblAlchemyKeywords.Text = String.Format("Keywords: {0}", dvKeywords.Count);
btnProcess.Enabled = true;
}
catch (Exception ex)
{
MessageBox.Show(ex.Message, "Processing Error", MessageBoxButtons.OK);
}
finally
{
sbStatus.Text = "Ready";
btnProcess.Enabled = true;
}
}
Several "mappings" are created between keywords, sentence indices, and relevance values to facilitate visualization of selected keywords:
protected void CreateSentenceKeywordMaps(DataView dvKeywords)
{
sentenceKeywordMap.Clear();
keywordSentenceMap.Clear();
pageSentences.ForEachWithIndex((s, idx) =>
{
List<string> keywordsInSentence = new List<string>();
sentenceKeywordMap[idx] = keywordsInSentence;
string sl = s.ToLower();
dvKeywords.ForEach(row =>
{
string keyword = row[0].ToString();
if (sl.Contains(keyword.ToLower()))
{
keywordsInSentence.Add(keyword);
List<int> sentences;
if (!keywordSentenceMap.TryGetValue(keyword, out sentences))
{
sentences = new List<int>();
keywordSentenceMap[keyword] = sentences;
}
sentences.AddIfUnique(idx);
}
});
});
}
RichTextBox Display
When a keyword is selected, the sentences containing that keyword are displayed with that keyword highlighted.
public void ShowKeywordSelection(string keyword)
{
textboxEventsEnabled = false;
ShowSentences(keyword);
textboxEventsEnabled = true;
rtbSentences.SelectionStart = 0;
surface.NewKeyword(keyword);
UpdateKeywordVisualization();
}
This is accomplished by parsing the sentence for the selected keyword and building the text in the RichTextBox as each keyword occurrence is encountered:
protected void ShowSentences(string keyword)
{
rtbSentences.Clear();
displayedSentenceIndices.Clear();
pageSentences.ForEachWithIndex((sentence, sidx) =>
{
string s = sentence.ToLower();
int idx = s.IndexOf(keyword.ToLower());
bool found = idx >= 0;
int start = 0;
while (idx >= 0)
{
if (!displayedSentenceIndices.Contains(sidx))
{
displayedSentenceIndices.Add(sidx);
}
string substr = sentence.Substring(start, idx);
rtbSentences.AppendText(substr);
rtbSentences.AppendText(keyword, Color.Red);
s = s.Substring(idx + keyword.Length);
start += idx + keyword.Length;
idx = s.IndexOf(keyword.ToLower());
}
if (found)
{
rtbSentences.AppendText(s);
rtbSentences.AppendText("\n\n");
}
});
}
Adjacent Sentence Keyword Visualization
The code for generating the visualization of adjacent sentence keyword visualization first draws the previous keywords, then the next keywords, and then the current keyword, so that the present keyword appears above the connecting lines:
protected void DrawNeighboringSentenceKeywords(Graphics gr)
{
try
{
Point ctr = new Point(Size.Width / 2, Size.Height / 2);
keywordLocationMap.Clear();
DrawPreviousKeywords(gr, ctr);
DrawNextKeywords(gr, ctr);
DrawKeyword(gr, keyword);
}
catch (Exception ex)
{
System.Diagnostics.Debug.WriteLine(ex.Message);
}
}
The previous and next keywords are predetermined when the user clicks on a keyword in the keyword list or filters the sentences containing that keyword to a single sentence:
protected void UpdateKeywordVisualization()
{
List<SentenceInfo> prevKeywords = GetPreviousSentencesKeywords();
List<SentenceInfo> nextKeywords = GetNextSentencesKeywords();
surface.PreviousKeywords(prevKeywords);
surface.NextKeywords(nextKeywords);
if (directedGraph)
{
UpdateDirectedGraph();
}
surface.Invalidate(true);
}
Ultimately, given the sentence index, this is a simple lookup and processing into the a list of SentenceInfo
instances.
protected List<SentenceInfo> GetKeywordsInSentence(int idx)
{
List<SentenceInfo> ret = new List<SentenceInfo>();
sentenceKeywordMap[idx].ForEach(k => ret.Add(new SentenceInfo()
{ Keyword = k, Index = idx, Relevance = keywordRelevanceMap[k] }));
return ret;
}
If the selected keyword does not appear in the current sentence, the visualization will render the center with empty brackets "[ ]":
Keyword Directed Graph
As discussed earlier, this is a recursive search of keyword as determined by their associative occurrences in sentences. The algorithm is limited in depth and breadth by two constants:
const int FDG_DEPTH_LIMIT = 3;
const int FDG_NODE_KEYWORD_LIMIT = 5;
Also, duplicate keywords are omitted during the traversal. The algorithm begins with keywords in the current sentence and recurses, for each keyword, to other sentences containing that keyword. In those sentences, the associated keywords determine the next level of recursion:
protected void UpdateDirectedGraph()
{
mDiagram.Clear();
parsedKeywords.Clear();
string ctrSentence = FirstThreeWords(pageSentences[displayedSentenceIndices[0]]);
Node node = new TextNode(surface, ctrSentence);
((TextNode)node).Brush = surface.greenBrush;
mDiagram.AddNode(node);
List<SentenceInfo> keywords = GetSentencesKeywords();
keywords = keywords.RemoveDuplicates((si1, si2) => si1.Keyword.ToLower() == si2.Keyword.ToLower()).ToList();
parsedKeywords.AddRange(keywords.Select(si => si.Keyword.ToLower()));
AddKeywordsToGraphNode(node, keywords, 0);
mDiagram.Arrange();
}
protected void AddKeywordsToGraphNode(Node node, List<SentenceInfo> keywords, int depth)
{
if (depth < FDG_DEPTH_LIMIT)
{
int idx = 0;
foreach(SentenceInfo si in keywords)
{
if (idx++ < FDG_NODE_KEYWORD_LIMIT)
{
Node child = new TextNode(surface, si.Keyword);
node.AddChild(child);
List<int> containingSentences = keywordSentenceMap[si.Keyword];
List<SentenceInfo> relatedKeywords = new List<SentenceInfo>();
containingSentences.ForEach(cs =>
{
List<SentenceInfo> si3 = GetKeywordsInSentence(cs).Where(sik => !parsedKeywords.Contains(sik.Keyword.ToLower())).ToList();
si3 = si3.RemoveDuplicates((si1, si2) => si1.Keyword.ToLower() == si2.Keyword.ToLower()).ToList();
relatedKeywords.AddRange(si3);
parsedKeywords.AddRange(si3.Select(sik=>sik.Keyword.ToLower()));
});
if (relatedKeywords.Count > 0)
{
AddKeywordsToGraphNode(child, relatedKeywords, depth + 1);
}
}
else
{
break;
}
}
}
}
I refer you to Brad Smith's excellent blog4 on force directed graphs for further reading on the algorithm that generates the graph.
Going Deeper
As a research tool, it is also useful to create relationships between documents. This requires building a database of documents and extracted keywords/concepts so that a program such as the one presented here can correlate keywords/concepts between documents, enabling the user to investigate a concept beyond the scope of one single document. I may at some point add this capability!!
Conclusion
In actual practice, I find that this program is actually a very effective tool for focusing on specific points in an article or blog. It is actually quite useful in and of itself to navigate a document a sentence at a time because it helps reduce the clutter of the entire document. The adjacent sentence keyword visualization helps in exploring related keywords within the same "thought", facilitating the quick construction of a primary concept. Using the keyword association directed graph, the primary concept can be expanded to include other peripheral concepts. It is quite enjoyable and instructive to work with a document in this way.
References
1. http://wn.rsarchive.org/Books/GA024/English/AP1985/GA024_c04.htmll
2. http://en.wikipedia.org/wiki/Meritocracy<
3. http://en.wikipedia.org/wiki/Force-directed_graph_drawing
4. http://www.brad-smith.info/blog/archives/129
5. http://en.wikipedia.org/wiki/Founding_Fathers_of_the_United_States