To Run This Application...
- You will need to register with AlchemyAPI to obtain an API key and this key must be placed in the file "alchemyapikey.txt" in the bin\Debug (or bin\Release) folder.
- Download the code from https://github.com/cliftonm/HOPE
- Checkout the branch "semantic-feed-reader". Bug fixes related to this article will be made on this branch.
- When you launch HOPE, load the applet called "NewFeedReaderTabbed"
- The various display forms may disappear behind the HOPE application main window -- move/resize the main window to get it out of the way. Also, the forms initially display on top of each other. Arrange them as you wish and then save the applet--the size and positions of the main window and display forms are persisted.
- If you're interested in other API's for, say, C++, Android, Java, Ruby, etc..., visit the page.
Introduction
To state the obvious, there is a vast amount of information "in the cloud", and it grows every millisecond. Some of it is rather static, like a wikipedia page, news article or blog, and some of it is very dynamic, like stock tickers, weather, and tweets. And again stating the obvious, from a usability standpoint, the integrated means to chew that information such that what is presented to the user are only things that have meaning to that user simply do not exist, or if they do, they're limited to "here google, filter my news items by these categories." But if, for example, I want to be alerted to when someone blogs about Visual Studio 14 (or whatever version of VS is in CTP when you read this article), well, good luck with that.
We can look at the Semantic Web:
By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web, dominated by unstructured and semi-structured documents into a "web of data". (http://en.wikipedia.org/wiki/Semantic_Web)
But adoption of this movement is morbidly slow and probably will not deliver enough semantic information about the content to be actually useful.
That leaves us with "Natural Language Processing", or NLP:
"enabling computers to derive meaning from human or natural language input" (http://en.wikipedia.org/wiki/Natural_language_processing).
Using NLP, we can extract the actual semantic meaning of the content. What this article explores is integrating one NLP service (AlchemyAPI) with webpage scraping (a feature of AlchemyAPI) to extract and persist said semantic meaning. Given a basic set of functionality, there are many features that can then be further developed (such as tracking / reporting on trends) from the semantic meaning once it has been derived from content. These additional features may be explored in future articles. Specifically what will be presented here is:
- Using the SyndicationFeed class to acquire feed items
- Extracting the semantic meaning using AlchemyAPI's NLP
- Persisting feed item links and each item's associated meaning
- Providing a simple UI presentation for exploring feed items and their associated semantics
RSS is a specific niche tool and one needs to be able to use other tools for non-feed content. Rather than develop a monolithic application glued to RSS, I'm also going to be demonstrating a D-Tier approach (distributed, dimensional, dynamic) for building this application, using the Higher Order Programming Environment (HOPE) as the platform of choice. This can be leveraged to include other means of acquiring content, using other NLP processors (such as OpenCalais or Semantria), and developing components for working with the semantic meaning in other unique and interesting ways. If you are unfamiliar with my previous articles on HOPE, please read the introductory article. As such, I will be interweaving discussions regarding HOPE development with the primary topic of this article.
AlchemyAPI
AlchemyAPI is one of several NLP's. For my particular purposes, they are attractive for the following reasons:
- Fast response -- of the services I've looked at, they have the fastest response times
- A free option -- NLP providers can be expensive! While OpenCalais is free, AlchemyAPI provides a richer analysis and is free for 1000 transactions per day.
- Built in web page scraper -- I certainly don't want to write one, so this feature is crucial. AlchemyAPI's web page scraper looks quite good. Some of the other services either tie in with expensive options. OpenCalais is assocaited with SemanticProxy however the demo faults (out of memory) and I have not tried the programmatic interface.
- Painless API -- the .NET API provided by AlchemyAPI is painless to use and the XML format can be directly parsed into a .NET
DataSet
object. In my review of Semantria and OpenCalais, this was definitely not the case -- I encountered bugs in the .NET OpenCalais API and the complexity of Semantria's API was frustrating, though Semantria has been very helpful in guiding me through the issues. I will be posting a complete review of all three of these NLP services in a separate article.
Based on the aforementioned criteria, the choice was rather clear.
Why Higher Order Programming Environment?
Why am I writing this in the HOPE framework? Several reasons:
- I want to continue promoting and extending the capabilities of this framework
- I want to avoid a monolithic application. NLP can be applied to many things beyond RSS feeds and I want a platform that allows me to plug and play, and I mean really "play" with different configurations, NLP providers, etc., for extracting semantic meaning. HOPE is designed for precisely this kind of Lego-building.
- Visualizing NLP results is a uncharted territory. While I only use boring data table lists, there is a rich field of visualization to explore with regards to NLP results. Again, HOPE is an excellent framework for plugging in different visualizers and playing with them.
- In my opinion, writing synchronous, single threaded monolithic applications is a dead end, and HOPE represents a very interesting alternative for creating distributed, dynamic, and dimensional applications that promotes non-deterministic UI's and behaviors: it the user, not the developer that determines the behavior and visualization of the applet.
- It's fun, and it's easy.
Still interested? Then let's begin with feed readers, move on to visualizers, and then parsing feed content with NLP.
The Feed Reader Receptor
(a receptor with feed items ready to be processed.)
In HOPE, behaviors are written in autonomous receptors. We can start with a very simple receptor that loads acquires the feed items and emits them.
The RSSFeedItem Semantic Structrure
We need to define the protocol for a feed item, which is done in XML:
<SemanticTypeStruct DeclType="RSSFeedItem">
<Attributes>
<NativeType Name="FeedName" ImplementingType="string"/>
<NativeType Name="Title" ImplementingType="string"/>
<SemanticElement Name="URL"/> <!---->
<NativeType Name="Description" ImplementingType="string"/>
<NativeType Name="Authors" ImplementingType="string"/>
<NativeType Name="Categories" ImplementingType="string"/>
<NativeType Name="PubDate" ImplementingType="DateTime"/>
</Attributes>
</SemanticTypeStruct>
If you're new to HOPE, one of the foundational concepts is that all data is itself semantic, which has pros and cons in this first cut and is always an interesting decision point: should the types always be semantic elements or can the be native types? I'll leave that question for another discussion.
Receptor Implementation
The three things of interest to note here:
- There is a configuration UI so the user can specify the feed name and URL.
- Note how user-configurable properties are decorated with the
UserConfigurableProperty
attribute, so the serializer knows what to persist when the applet is saved / loaded.
- The feed is loaded asynchronously, and when the task completes, the feed items are emitted.
public class FeedReader : BaseReceptor
{
public override string Name { get { return "Feed Reader"; } }
public override bool IsEdgeReceptor { get { return true; } }
public override string ConfigurationUI { get { return "FeedReaderConfig.xml"; } }
[UserConfigurableProperty("Feed URL:")]
public string FeedUrl { get; set; }
[UserConfigurableProperty("Feed Name:")]
public string FeedName {get;set;}
protected SyndicationFeed feed;
public FeedReader(IReceptorSystem rsys)
: base(rsys)
{
AddEmitProtocol("RSSFeedItem");
}
public override void EndSystemInit()
{
base.EndSystemInit();
AcquireFeed();
}
public override void UserConfigurationUpdated()
{
base.UserConfigurationUpdated();
AcquireFeed();
}
protected async void AcquireFeed()
{
if (!String.IsNullOrEmpty(FeedUrl))
{
try
{
SyndicationFeed feed = await GetFeedAsync(FeedUrl);
EmitFeedItems(feed);
}
catch (Exception ex)
{
EmitException("Feed Reader Receptor", ex);
}
}
}
protected async Task<SyndicationFeed> GetFeedAsync(string feedUrl)
{
SyndicationFeed feed = await Task.Run(() =>
{
XmlReader xr = XmlReader.Create(feedUrl);
SyndicationFeed sfeed = SyndicationFeed.Load(xr);
xr.Close();
return sfeed;
});
return feed;
}
protected void EmitFeedItems(SyndicationFeed feed)
{
feed.Items.ForEach(item =>
{
CreateCarrier("RSSFeedItem", signal =>
{
signal.FeedName = FeedName;
signal.Title = item.Title.Text;
signal.URL.Value = item.Links[0].Uri.ToString();
signal.Description = item.Summary.Text;
signal.Authors = String.Join(", ", item.Authors.Select(a => a.Name).ToArray());
signal.Categories = String.Join(", ", item.Categories.Select(c => c.Name).ToArray());
signal.PubDate = item.PublishDate.LocalDateTime;
});
});
}
}
Feed Reader User Configuration
A very simple UI is used to configure the feed (note that this configuration is persisted when the HOPE applet is saved.) Because the UI is defined in XML, it can be easily customized for other appearances -- this customizability is a particular strength of HOPE. The parser used is a derivative of MycroXaml which I wrote about 10 years ago.
The salient point here is the explicit binding of control properties to the receptor instance's properties.
="1.0" ="utf-8"
<MycroXaml Name="Form"
xmlns:wf="System.Windows.Forms, System.Windows.Forms, Version=1.0.5000.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
xmlns:r="Clifton.Receptor, Clifton.Receptor"
xmlns:def="def"
xmlns:ref="ref">
<wf:Form Text="Feed Reader Configuration" Size="480, 190" StartPosition="CenterScreen" ShowInTaskbar="false" MinimizeBox="false" MaximizeBox="false">
<wf:Controls>
<wf:Label Text="Feed Name:" Location="20, 23" Size="70, 15"/>
<wf:TextBox def:Name="tbFeedName" Location="92, 20" Size="150, 20"/>
<wf:Label Text="Feed URL:" Location="20, 48" Size="70, 15"/>
<wf:TextBox def:Name="tbFeedUrl" Location="92, 45" Size="250, 20"/>
<wf:CheckBox def:Name="ckEnabled" Text="Enabled?" Location="20, 120" Size="80, 25"/>
<wf:Button Text="Save" Location="360, 10" Size="80, 25" Click="OnReceptorConfigOK"/>
<wf:Button Text="Cancel" Location="360, 40" Size="80, 25" Click="OnReceptorConfigCancel"/>
</wf:Controls>
<r:PropertyControlMap def:Name="ControlMap">
<r:Entries>
<r:PropertyControlEntry PropertyName="FeedUrl" ControlName="tbFeedUrl" ControlPropertyName="Text"/>
<r:PropertyControlEntry PropertyName="FeedName" ControlName="tbFeedName" ControlPropertyName="Text"/>
</r:Entries>
</r:PropertyControlMap>
</wf:Form>
</MycroXaml>
Receptor and Carrier
Once the asynchronous function returns, we note that there are several carriers (one for each item listed in the feed) awaiting to be processed. We can inspect their signals by hovering the mouse over one of the carriers (the yellow triangle) which displays the signal in the property grid:
The Feed Item Viewer
Next, we need a way to view feeds. Rather than write a specific feed reader viewer, I'm going instead to implement a general purpose "carrier viewer" that will display the carrier signals in a DataGridView
control. As a general purpose receptor, this will be useful for other applications as well. The only thing we'll need to configure is the protocol (the semantic structure) that the viewer should listen for.
Configuring the Feed Item Viewer
As with the feed reader, we have a small XML file (not shown) that lets us specify the protocol we want to monitor. In our case, it's "RSSFeedItem."
The Code
The code is again quite simple, with the addition of removing the old protocol if the user changes it.
public class CarrierListViewer : BaseReceptor
{
public override string Name { get { return "Carrier List Viewer"; } }
public override bool IsEdgeReceptor { get { return true; } }
public override string ConfigurationUI { get { return "CarrierListViewerConfig.xml"; } }
[UserConfigurableProperty("Protocol Name:")]
public string ProtocolName { get; set; }
protected string oldProtocol;
protected DataView dvSignals;
protected DataGridView dgvSignals;
protected Form form;
public CarrierListViewer(IReceptorSystem rsys)
: base(rsys)
{
}
public override void Initialize()
{
base.Initialize();
InitializeUI();
}
public override void EndSystemInit()
{
base.EndSystemInit();
CreateViewerTable();
ListenForProtocol();
}
protected void InitializeUI()
{
MycroParser mp = new MycroParser();
form = mp.Load<Form>("CarrierListViewer.xml", this);
dgvSignals = (DataGridView)mp.ObjectCollection["dgvRecords"];
form.Show();
}
public override void UserConfigurationUpdated()
{
base.UserConfigurationUpdated();
CreateViewerTable();
ListenForProtocol();
}
protected void CreateViewerTable()
{
if (!String.IsNullOrEmpty(ProtocolName))
{
DataTable dt = new DataTable();
ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(ProtocolName);
st.AllTypes.ForEach(t =>
{
dt.Columns.Add(new DataColumn(t.Name));
});
dvSignals = new DataView(dt);
dgvSignals.DataSource = dvSignals;
}
}
protected void ListenForProtocol()
{
if (!String.IsNullOrEmpty(oldProtocol))
{
RemoveReceiveProtocol(oldProtocol);
}
oldProtocol = ProtocolName;
AddReceiveProtocol(ProtocolName, (Action<dynamic>)((signal) => ShowSignal(signal)));
}
protected void ShowSignal(dynamic signal)
{
try
{
DataTable dt = dvSignals.Table;
DataRow row = dt.NewRow();
ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(ProtocolName);
st.AllTypes.ForEach(t =>
{
object val = t.GetValue(rsys.SemanticTypeSystem, signal);
row[t.Name] = val;
});
dt.Rows.Add(row);
}
catch (Exception ex)
{
EmitException("Carrier List Viewer Receptor", ex);
}
}
}
Displaying Feed Items
We can now drop the Carrier List Viewer onto the surface, double-click on it to configure the protocol, and we immediately note that it is now wired up as a receiver of what the Feed Reader receptor emits:
A small XML file declares the UI (again, easily configured to some other presentation or third party control):
<MycroXaml Name="Form"
xmlns:wf="System.Windows.Forms, System.Windows.Forms, Version=1.0.5000.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
xmlns:def="def"
xmlns:ref="ref">
<wf:Form Text="List Viewer" Size="500, 300" StartPosition="CenterScreen" ShowInTaskbar="false" MinimizeBox="false" MaximizeBox="false">
<wf:Controls>
<wf:DataGridView def:Name="dgvRecords" Dock="Fill"
AllowUserToAddRows="false"
AllowUserToDeleteRows="false"
ReadOnly="true"
SelectionMode="FullRowSelect"
RowHeadersVisible="False"/>
</wf:Controls>
</wf:Form>
</MycroXaml>
And here's a result from the Code Project article feed:
Configuring Feed Readers (Introducing Membranes)
Let's pause here for a bit and see what we can do with HOPE now. For example, we can create multiple feed readers, all feeding into one list viewer:
And here's a sample listing:
But let's say you want a list just for Code Project. We can do that with a new feature of HOPE called "membranes." While I'm not going to go into the full details of membranes yet, you can read up on the idea under Membrane Computing. An overview of the idea is this: carriers (the protocols and their signals) are contained within a membrane and can only permeate the membrane (moving in or moving out) if the membrane has been configured to be permeable to that protocol. So, we can use membranes for "islands of computation:"
Resulting in separate feed item lists:
Working With Semantic Types
Another thing we can add to the viewer is the ability to emit semantic types when the user double-clicks on a line. Remember that when we defined the RSSFeedItem semantic type, the URL was itself a semantic type:
<SemanticElement Name="URL"/>
We can look for all semantic type attributes and emit them, letting some other receptor do something with them. We inspect the protocol the viewer listens to for semantic elements and add them to the emitter list:
RemoveEmitProtocols();
ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(ProtocolName);
st.SemanticElements.ForEach(se => AddEmitProtocol(se.Name));
and, when we double click, the receptor iterates through semantic elements of the protocol it is representing and issues carriers whose signal is the value for that semantic element:
protected void OnCellContentDoubleClick(object sender, DataGridViewCellEventArgs e)
{
ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(ProtocolName);
st.SemanticElements.ForEach(se =>
{
CreateCarrier(se.Name, signal => se.SetValue(rsys.SemanticTypeSystem, signal, dvSignals[e.RowIndex][se.Name].ToString()));
});
}
In the APOD web scraper article, I had created a simple receptor that listens for the semantic type "URL" and launches the browser with that URL, so we can re-use that receptor here:
Notice how we need only one URL receptor. Each membrane is made permeable to the URL protocol:
This allows the URL protocol to permeate out of the membrane, thus connecting the carrier list viewer (which at runtime configured itself as emitting the URL protocol) to the URL receptor. Now we have two separate feed item lists and a way to go to the feed item in the browser by double-clicking on an item in either list.
Protocol Semantic Sub-Elements
A new feature in HOPE is the ability to create carriers on semantic elements of a parent carrier. For example, because the protocol RSSFeedItem contains the semantic element "URL", when the "RSSFeedItem" signal is emitted, a second carrier for the semantic element "URL" is created as well. When this behavior of HOPE is enabled, you can immediately see the effects in our current feed reader applet:
Notice the additional pathways from the Feed Reader Receptor directly to the URL receptor. This feature is experimental but is definitely a useful and quite interesting to explore the behavior of carrier protocol-signals. Indeed, as it is implemented in the above configuration, this has the interesting effect of opening every feed item's page in the browser. However, this is not what we want, so instead, we'll create a child membrane around just the feed readers to prevent the URL from permeating the membrane and being received by the URL Receptor:
For each membrane around a Feed Reader Receptor, we configure it so that only the RSSFeedItem protocol permeates the membrane.
This gives us the desired behavior -- only the Carrier List Viewer's emitting of the "URL" protocol is received by the URL Receptor.
Applying Natural Language Processing to the Feed Items
The feature of creating carriers for semantic elements within a protocol can be taken advantage of however by the NLP, for which we definitely do want processing of each feed item's URL. As mentioned earlier, I'm using AlchemyAPI as the NLP service. Notice that I combined the two feed readers on the right into a single child membrane and how the Alchemy Receptor is now associated to the Feed Reader receptors because the Alchemy Receptor is listening for "URL" protocols:
Note that, because of how we've configured the feed readers into two separate "systems", it is not possible to have only a single Alchemy Receptor -- this would require allowing the URL protocol to permeate the feed reader membrane, which would then lead us back to the issue described earlier. However, is this really an issue? In fact, not necessarily, especially if you consider the advantages of a distributed system as well as leveraging asynchronous behaviors. Furthermore, if the multiple instances are actually a problem, at some point the HOPE framework may allow you to specify logical receptors, which would then support a single instance (or more) in the underlying implementation.
The Alchemy API Receptor Code
Alchemy API provides three results from the NLP in its more-or-less default configuration: Entities, Keywords, and Concepts, each having unique attributes, as illustrated in this screenshot from the article comparing three NLP services:
Salient points:
- AlchemyAPI allows us to directly pass in the URL, as it has a built in content scraper. This saves us a lot of effort in either extracting the content ourselves (a daunting task) or using a third party service.
- To acquire the entities, keywords, and concepts, we have to make three separate calls. Note how I'm increasing the limits of the entries returned (the default is 50) to the maximum, 250.
- Note that I have a "TEST" compiler conditional, as I don't want to hit AlchemyAPI during testing of the entire applet, nor do I want to wait the 4 or 5 seconds it takes AlchemyAPI to return with the data. The test datasets were previously acquired and serialized.
- AlchemyAPI returns a very nicely formatted XML document that can be read directly into a .NET
DataSet
. I'm ignoring some of the information in that DataSet
, which you may wish to explore.
Here's the complete code for Alchemy Receptor:
public class Alchemy : BaseReceptor
{
public override string Name { get { return "Alchemy"; } }
public override bool IsEdgeReceptor { get { return true; } }
protected AlchemyAPI.AlchemyAPI alchemyObj;
public Alchemy(IReceptorSystem rsys)
: base(rsys)
{
AddEmitProtocol("AlchemyEntity");
AddEmitProtocol("AlchemyKeyword");
AddEmitProtocol("AlchemyConcept");
AddReceiveProtocol("URL",
(Action<dynamic>)(signal => ParseUrl(signal)));
}
public override void Initialize()
{
base.Initialize();
InitializeAlchemy();
}
protected void InitializeAlchemy()
{
alchemyObj = new AlchemyAPI.AlchemyAPI();
alchemyObj.LoadAPIKey("alchemyapikey.txt");
}
protected async void ParseUrl(dynamic signal)
{
string url = signal.Value;
DataSet dsEntities = await Task.Run(() => { return GetEntities(url); });
DataSet dsKeywords = await Task.Run(() => { return GetKeywords(url); });
DataSet dsConcepts = await Task.Run(() => { return GetConcepts(url); });
dsEntities.Tables["entity"].IfNotNull(t => Emit("AlchemyEntity", t));
dsKeywords.Tables["keyword"].IfNotNull(t => Emit("AlchemyKeyword", t));
dsConcepts.Tables["concept"].IfNotNull(t => Emit("AlchemyConcept", t));
}
protected void Emit(string protocol, DataTable data)
{
data.ForEach(row =>
{
CreateCarrierIfReceiver(protocol, signal =>
{
ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(protocol);
st.AllTypes.ForEach(se =>
{
object val = row[se.Name];
if (val != null && val != DBNull.Value)
{
se.SetValue(rsys.SemanticTypeSystem, signal, val);
}
});
});
});
}
protected DataSet GetEntities(string url)
{
DataSet dsEntities = new DataSet();
#if TEST
dsEntities.ReadXml("alchemyEntityTestResponse.xml");
#else
try
{
AlchemyAPI_EntityParams eparams = new AlchemyAPI_EntityParams();
eparams.setMaxRetrieve(250);
string xml = alchemyObj.URLGetRankedNamedEntities(url, eparams);
TextReader tr = new StringReader(xml);
XmlReader xr = XmlReader.Create(tr);
dsEntities.ReadXml(xr);
xr.Close();
tr.Close();
}
catch(Exception ex)
{
EmitException("Alchemy Receptor", ex);
}
#endif
return dsEntities;
}
protected DataSet GetKeywords(string url)
{
DataSet dsKeywords = new DataSet();
#if TEST
dsKeywords.ReadXml("alchemyKeywordsTestResponse.xml");
#else
try
{
AlchemyAPI_KeywordParams eparams = new AlchemyAPI_KeywordParams();
eparams.setMaxRetrieve(250);
string xml = alchemyObj.URLGetRankedKeywords(url);
TextReader tr = new StringReader(xml);
XmlReader xr = XmlReader.Create(tr);
dsKeywords.ReadXml(xr);
xr.Close();
tr.Close();
}
catch(Exception ex)
{
EmitException("Alchemy Receptor", ex);
}
#endif
return dsKeywords;
}
protected DataSet GetConcepts(string url)
{
DataSet dsConcepts = new DataSet();
#if TEST
dsConcepts.ReadXml("alchemyConceptsTestResponse.xml");
#else
try
{
AlchemyAPI_ConceptParams eparams = new AlchemyAPI_ConceptParams();
eparams.setMaxRetrieve(250);
string xml = alchemyObj.URLGetRankedConcepts(url);
TextReader tr = new StringReader(xml);
XmlReader xr = XmlReader.Create(tr);
dsConcepts.ReadXml(xr);
xr.Close();
tr.Close();
}
catch(Exception ex)
{
EmitException("Alchemy Receptor", ex);
}
#endif
return dsConcepts;
}
}
To display the results, we'll drop in Carrier List Viewer Receptors that list the NLP results from all feeds:
To accomplish this, we need to allow the AlchemyEntity, AlchemyKeyword, and AlchemyConcept protocols to permeate the membranes:
When we do this for both membranes surrounding the Alchemy Receptor, the visualizer then shows us that the Alchemy receptor is emitting protocols that the Carrier List Viewre Receptor is interested in. Each Carrier List Viewer receptor on the bottom of the screenshot has been configured to receive the respective protocol.
Of course, we don't necessarily need to see all three types (entities, keywords, concepts) - this all depends on how you'd like to configure the applet. You'll note above that I'm using three separate list viewers, one for each category of analysis. Later on I'll be using a tabbed list viewer to manage all this information.
AlchemyAPI
This section specifically discusses the AlchemyAPI service. Not everything that AlchemyAPI provides is discussed here -- just the most common features. Specifically, "sentiment" and "relationships" are not covered, but you can read more about those on the AlchemyAPI website.
Given a document or URL, you can extract the semantic meaning into three categories: Entities, Keywords, and Concepts.
Entities
AlchemyAPI returns the following information for each entity:
text: this is the entity name (or, more specifically, the noun)
type: AlchemyAPI attempts to determine the entity type, which includes such labels as City, Company, Continent, Country, Crime, Degree, Facility, Field Terminology, Geographic Feature, Holiday, Job Title, Person, Operating System, Organization, PrintMedia, Product, Region, Sport, StateOrCounty, and Technology. The complete list can be found here.
count: This is a count of the occurrences of the entity. This count (common to all NLP's I've reviewed) utilizes a coreference feature called "anaphora resolution": "In the sentence Sally arrived, but nobody saw her, the pronoun her is anaphoric, referring back to Sally." (from wikipedia)
relevance: A relevance score from 0.0 - 1.0, where 1.0 is the most relevant. According to Steve Herschleb, API Evangelist at AlchemyAPI: "The relevance score for each keyword ranks the general importance of each extracted keyword. How the score is actually calculate involves some pretty complex statistics, but the algorithm includes things like the word's position within the text, the other words around it, how many times it's used, etc." (source from Quora website)
Keywords
Keywords consist of the keyword text and relevance. "Keywords are the important topics in your content and can be used to index data, generate tag clouds or for searching. AlchemyAPI's keyword extraction API is capable of finding keywords in text and and ranking them. The sentiment can then be determined for each extracted keyword." (source) Note that I do not demonstrate sentiment in this applet -- performing sentiment analysis is a separate call that counts as a "transaction."
Concepts
Concepts are an interesting feature of AlchemyAPI: "
"AlchemyAPI employs sophisticated text analysis techniques to concept tag documents in a manner similar to how humans would identify concepts. The concept tagging API is capable of making high-level abstractions by understanding how concepts relate, and can identify concepts that aren't necessarily directly referenced in the text. For example, if an article mentions CERN and the Higgs boson, it will tag Large Hadron Collider as a concept even if the term is not mentioned explicitly in the page. By using concept tagging you can perform higher level analysis of your content than just basic keyword identification." (
source)
One of the interesting things about AlchemyAPI's concepts is its data linking. You can read more about Linked Data here. From the above screenshot, you can see that there are three linked data results from DBpedia, Freebase, and opencyc. Depending on the content, AlchemyAPI will link to several other knowledge bases as well.
AlchemyAPI Exceptions
The exception routine in AlchemyAPI is rather poor -- it does not actually report the error that the server produced, which is definitely part of the resulting XML.
A simple modification provides a much more meaningful result (in AlchemyAPI.cs, starting on line 955):
if (status.InnerText != "OK")
{
string errorMessage = "Error making API call.";
try
{
XmlNode statusInfo = root.SelectSingleNode("/results/statusInfo");
errorMessage = statusInfo.InnerText;
}
catch
{
}
System.ApplicationException ex = new System.ApplicationException(errorMessage);
throw ex;
}
Happily, this fix will soon be incorporated into the API provided by AlchemyAPI.
Caching Content
Ideally we don't want to repeatedly scrape the same pages so for the moment (because I don't want to add the whole persistence piece in this article), I've added a simple caching mechanism to avoid exceeding one's daily limit of 1000 transactions:
protected bool Cached(string prefix, string url, ref DataSet ds)
{
string urlHash = url.GetHashCode().ToString();
string fn = prefix + "-" + urlHash + ".xml";
bool cached = File.Exists(fn);
if (cached)
{
ds.ReadXml(fn);
}
return cached;
}
protected void Cache(string prefix, string url, DataSet ds)
{
string urlHash = url.GetHashCode().ToString();
string fn = prefix + "-" + urlHash + ".xml";
ds.WriteXml(fn);
}
This is only a temporary measure, true data persistence to a database will be covered in part 2.
Content Limit Size
An error that you may also get is "content exceeds size limit". I'll update this article once I know the exact limit.
Retrieve Limits
The default number of entities, keywords, and concepts retrieved by AlchemyAPI is 50. You can increase this limit to a maximum of 250 as I've done in the Alchemy Receptor, for example with entities:
AlchemyAPI_EntityParams eparams = new AlchemyAPI_EntityParams();
eparams.setMaxRetrieve(250);
string xml = alchemyObj.URLGetRankedNamedEntities(url, eparams);
This is an important parameter with which to experiment, as I'm not sure how useful it is to increase this limit. For example, when processing this wikipedia page on computer science, AlchemyAPI extracts 147 total entities. This compares well with OpenCalais (155 entities) which has no default limit. By contrast, Semantria defaults to 5 entities with a maximum retrieval of 50.
More With Receptors
To achieve my primary goal in this article, filtering feeds from the NLP results, we need to add some further behaviors, the first of which is simply a tabbed list viewer receptor that will enable easier management of all these lists.
Tabbed List Viewer Receptor
I'm not going to show the code (it's very similar to the Carrier List Viewer Receptor above), instead I'll just walk through the configuration and usage.
Configuration
After dropping the tabbed list viewer receptor onto the surface, we double-click on it and configure the tabs we want and the protocols that it lists. The astute reader may realize that this will not work for RSSFeedItem protocols -- there is nothing to distinguish to feed items from one RSS feed from another. This can only be accomplished by qualifying the signal's data, in this case with the feed name. This feature is not currently implemented because it needs to be done in a general purpose manner.
Wiring it up
Once the protocols are defined, we can see how it is connected:
Results
The NLP results now display in a tabbed list form rather than in discrete list forms:
Associating the URL with NLP Results
The NLP result isn't very useful by itself. We need to associate the URL for each result, which we can do by adding the semantic element to the Alchemy protocols:
<SemanticElement Name="URL"/>
and of course assigning that property to each result record that is emitted by the Alchemy Receptor:
signal.URL.Value = url;
Notice immediately what now happens:
Because the Alchemy protocols now include the semantic element "URL", the list viewer receptor and URL Receptor are now auto-magically wired up (well, it was implemented in a couple lines of code, as illustrated above in the single list viewer) such that, when the user double-clicks on an entry in the tabbed viewer, it emits all known semantic elements, of which "URL" is one (and the only one right now.) Again, the astute reader will say, "but what about the URL's that are part of the Linked Data content, such as DBpedia?" And that is a very good question which is not addressed in this article.
As a side-note, the beauty of the HOPE architecture is illustrated in the above behavior: the capability of the system is defined equally (if not more, actually) by the semantics of the protocols -- the richer your semantics become, the more interesting behaviors that can be created that work with those semantics.
A Filter Receptor
We finally get to the crux of the matter -- filtering feeds based on the NLP results. To make this somewhat sophisticated, I'm going to use the NCalc Expression Evaluator so that we can do interesting things such as filtering entities or concepts not just by keywords but by a relevance threshold as well. We'll do this as generically as possible. First, the filtered protocol is emitted exactly as received, however it is necessary to use a different semantic protocol to avoid ambiguity between unfiltered and filtered results. To some extent, this can be viewed as a potential flaw in the HOPE architecture, but this problem is common in publisher/subscriber systems, which is one of aspect of HOPE. We will look at this issue at some point in the future.
Working with NCalc is very simple. This code snippet of the Filter Receptor demonstrates setting up "variables" in NCalc and creating a custom function "contains":
protected void FilterSignal(string protocol, dynamic signal, List<string> filters)
{
filters.ForEach(filter =>
{
try
{
Expression exp = new Expression(filter);
ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(protocol);
st.AllTypes.ForEach(t =>
{
exp.Parameters[t.Name] = t.GetValue(rsys.SemanticTypeSystem, signal);
});
exp.EvaluateFunction += OnEvaluateFunction;
object result = exp.Evaluate();
if (result is bool)
{
if ((bool)result)
{
CreateCarrier("Filtered" + protocol, outSignal =>
{
st.AllTypes.ForEach(t =>
{
t.SetValue(rsys.SemanticTypeSystem, outSignal, t.GetValue(rsys.SemanticTypeSystem, signal));
});
});
}
}
}
catch (Exception ex)
{
EmitException(ex.Message + " with filter " + filter);
}
});
}
protected void OnEvaluateFunction(string name, FunctionArgs args)
{
if (name.ToLower() == "contains")
{
string v1 = args.Parameters[0].Evaluate().ToString().ToLower();
string v2 = args.Parameters[1].Evaluate().ToString().ToLower();
args.Result = v1.Contains(v2);
}
}
Configuration
The above screenshot illustrates a sample configuration of filtering protocols. Certainly, more filters on the same protocols (or other protocols) can be added.
We display the filtered list in a tabbed list view, configured as such:
Wiring it up
Membranes are again used to ensure that protocols are received and emitted in a controlled manner:
We can now view both unfiltered and filtered feed items (note that I changed the filter criteria from the above screenshot):
Conclusion
Natural Language Processing is a unique way of parsing "big data", providing semantic meaning suitable for potentially complex machine processing that results in information specifically tailored for delivery to us humans. However, such a lofty statement can only be achieved with the development of algorithms that process this information into something that actually has "meaning." What I've demonstrated here is a very rudimentary process little better than keyword filtering, but hopefully it may inspire someone to use these services to develop the ideas further!
While I've entangled the discussion of NLP with the Higher Order Programming Environment framework, personally I hope this may be inspiring to others as well to develop processing receptors and visualizations beyond simple lists.
There is still more work to be done in this demo which will be the focus of Part 2: persisting the NLP data to a database, querying, and improving the usability such as displaying whether a feed item is new or has been already read.