Introduction
In this article, I show some techniques for manipulation of XML elements by manipulating the values as string
s.
Background
One day, a friend stopped by and asked me if I knew anything about the dictionary that came on his iPad. He told me how it had a dictionary on it that gave example sentences that didn’t match the headword. I had spent some time trying to work my way through the Wordnet database files so I said, “I know exactly what happened!” And so will you: but first, here's the background.
Wordnet is a lexical project run by Princeton University. They have a very large lexical database and it is available for download here. You will need this database if you want to run the project. If you want to know more about Wordnet, please check their site. The key to solving the dictionary on the iPad that had the wrong examples sentences is in the structure of the Wordnet XML database.
Structure The main relation among words in WordNet is synonymy, as between the words shut and close or car and automobile. Synonyms--words that denote the same concept and are interchangeable in many contexts--are grouped into unordered sets (synsets). Each of WordNet’s 117 000 synsets is linked to other synsets by means of a small number of “conceptual relations.” Additionally, a synset contains a brief definition (“gloss”) and, in most cases, one or more short sentences illustrating the use of the synset members. Word forms with several distinct meanings are represented in as many distinct synsets. Thus, each form-meaning pair in WordNet is unique.
This project is released under CPOL. However, WordNet has its own licence and the terms of its use should be understood and honored.
Following is an excerpt of just one element in the WordNet database. It includes all synset
data described above under "structure
".
<synset pos="r" ofs="00516492" id="r00516492">
<terms><term>wrongfully</term></terms>
<keys><sk>wrongfully%4:02:00::</sk></keys>
<gloss desc="orig">
<orig>in an unjust or unfair manner; "the employee claimed that she was
wrongfully dismissed"; "people who were wrongfully imprisoned should be
released"</orig>
</gloss>
<gloss desc="text">
<text>in an unjust or unfair manner ; â the employee claimed that she was
wrongfully dismissed â ; â people who were wrongfully imprisoned should be
released â</text>
</gloss>
<gloss desc="wsd">
<def id="r00516492_d">
<wf pos="IN" id="r00516492_wf1" tag="ignore"
lemma="in">in</wf>
<wf pos="DT" id="r00516492_wf2" tag="ignore"
lemma="an">an</wf>
<wf pos="JJ" id="r00516492_wf3" tag="man"
lemma="unjust%3">
<id id="r00516492_id.6" lemma="unjust" sk="unjust
%3:00:02::"/>
<id id="r00516492_id.5" lemma="unjust" sk="unjust
%3:00:04::"/>
<id id="r00516492_id.4" lemma="unjust" sk="unjust
%3:00:00::"/>unjust</wf>
<wf pos="CC" id="r00516492_wf4" tag="ignore"
lemma="or">or</wf>
<wf pos="JJ" id="r00516492_wf5" tag="man"
lemma="unfair%3">
<id id="r00516492_id.8" lemma="unfair" sk="unfair
%3:00:00::"/>unfair</wf>
<wf pos="NN" id="r00516492_wf6" tag="man"
lemma="manner%1" sep="">
<id id="r00516492_id.7" lemma="manner" sk="manner
%1:07:02::"/>manner</wf>
<wf pos=":" id="r00516492_wf7" tag="ignore"
type="punc">;</wf>
</def><ex id="r00516492_ex1"><qf rend="dq">
<wf id="r00516492_wf8" tag="ignore"
lemma="the">the</wf>
<wf id="r00516492_wf9" tag="un" lemma="employee
%1">employee</wf>
<wf id="r00516492_wf10" tag="un" lemma="claim
%2">claimed</wf>
<wf id="r00516492_wf11" tag="ignore"
lemma="that">that</wf>
<wf id="r00516492_wf12" tag="ignore"
lemma="she">she</wf>
<wf id="r00516492_wf13" tag="un" lemma="be
%2">was</wf>
<wf id="r00516492_wf14" tag="auto" lemma="wrongfully
%4">
<id id="r00516492_id.2" lemma="wrongfully" sk="wrongfully
%4:02:00::"/>wrongfully</wf>
<wf id="r00516492_wf15" tag="un" lemma="dismiss%2|dismissed
%3" sep="">dismissed</wf>
</qf>
<wf id="r00516492_wf16" tag="ignore"
type="punc">;</wf>
</ex><ex id="r00516492_ex2"><qf rend="dq">
<wf id="r00516492_wf17" tag="un" lemma="people%1|people
%2">people</wf>
<wf id="r00516492_wf18" tag="ignore"
lemma="who">who</wf>
<wf id="r00516492_wf19" tag="un" lemma="be
%2">were</wf>
<wf id="r00516492_wf20" tag="auto" lemma="wrongfully
%4">
<id id="r00516492_id.3" lemma="wrongfully" sk="wrongfully
%4:02:00::"/>wrongfully</wf>
<wf id="r00516492_wf21" tag="un" lemma="imprison%2|imprisoned
%3">imprisoned</wf>
<wf id="r00516492_wf22" tag="ignore"
lemma="should">should</wf>
<wf id="r00516492_wf23" tag="un" lemma="be
%2">be</wf>
<wf id="r00516492_wf24" tag="un" lemma="release%2"
sep="">released</wf>
</qf>
<wf id="r00516492_wf25" tag="ignore"
type="punc">;</wf>
</ex>
</gloss>
</synset>
But I just want a dictionary, not all these fancy cross-referenced elements. Extracting just what I want from this example element programmatically would produce a dictionary entry something like the following:
"Wrongfully: in an unjust or unfair manner; "the employee claimed that she was wrongfully dismissed"; "people who were wrongfully imprisoned should be released""
Nothing wrong with that. But wait, what about my friend's iPad? OK, let's look at another synset
element extracted from the XML files. As you can see from the structure, the element <terms>
has 3 <term>
elements in it:
<synset id="v00384055" ofs="00384055" pos="v">
<terms>
<term>metamorphose</term>
<term>transfigure</term>
<term>transmogrify</term>
</terms>
<keys>
<sk>metamorphose%2:30:00::</sk>
<sk>transfigure%2:30:00::</sk>
<sk>transmogrify%2:30:00::</sk>
</keys>
<gloss desc="orig">
<orig>change completely the nature or appearance of; "In
Kafka's story, a person metamorphoses into a bug"; "The treatment and diet
transfigured her into a beautiful young woman"; "Jesus was transfigured after
his resurrection"</orig>
</gloss>
<gloss desc="text">
<text>change completely the nature or appearance of ; â In
Kafka's story , a person metamorphoses into a bug â ; â The treatment and
diet transfigured her into a beautiful young woman â ; â Jesus was
transfigured after his resurrection â</text>
</gloss>
<gloss desc="wsd">
<def id="v00384055_d">
<wf id="v00384055_wf1" lemma="change%1|change%2"
pos="VB" tag="man">
<id id="v00384055_id.5" lemma="change"
sk="change%2:30:01::"/>change</wf>
<wf id="v00384055_wf2" lemma="completely%4"
pos="RB" tag="un">completely</wf>
<wf id="v00384055_wf3" lemma="the"
pos="DT" tag="ignore">the</wf>
<wf id="v00384055_wf4" lemma="nature%1"
pos="NN" tag="un">nature</wf>
<wf id="v00384055_wf5" lemma="or"
pos="CC" tag="ignore">or</wf>
<wf id="v00384055_wf6" lemma="appearance%1"
pos="NN" tag="man">
<id id="v00384055_id.4" lemma="appearance"
sk="appearance%1:07:00::"/>appearance</wf>
<wf id="v00384055_wf7" lemma="of"
pos="IN" sep="" tag="ignore">of</wf>
<wf id="v00384055_wf8" pos=":"
tag="ignore" type="punc">;</wf>
</def>
<ex id="v00384055_ex1">
<qf rend="dq">
<wf id="v00384055_wf9" lemma="in"
tag="ignore">In</wf>
<wf id="v00384055_wf10" lemma="Kafka%1"
tag="un">Kafka's</wf>
<wf id="v00384055_wf11" lemma="story%1"
sep="" tag="un">story</wf>
<wf id="v00384055_wf12" tag="ignore"
type="punc">,</wf>
<wf id="v00384055_wf13" lemma="a"
tag="ignore">a</wf>
<wf id="v00384055_wf14" lemma="person%1"
tag="un">person</wf>
<wf id="v00384055_wf15" lemma="metamorphosis
%1|metamorphose%2" tag="auto">
<id id="v00384055_id.1"
lemma="metamorphose" sk="metamorphose
%2:30:00::"/>metamorphoses</wf>
<wf id="v00384055_wf16" lemma="into"
tag="ignore">into</wf>
<wf id="v00384055_wf17" lemma="a"
tag="ignore">a</wf>
<wf id="v00384055_wf18" lemma="bug%1|bug
%2" sep="" tag="un">bug</wf>
</qf>
<wf id="v00384055_wf19" tag="ignore"
type="punc">;</wf>
</ex>
<ex id="v00384055_ex2">
<qf rend="dq">
<wf id="v00384055_wf20" lemma="the"
tag="ignore">The</wf>
<wf id="v00384055_wf21" lemma="treatment
%1" tag="un">treatment</wf>
<wf id="v00384055_wf22" lemma="and"
tag="ignore">and</wf>
<wf id="v00384055_wf23" lemma="diet%1|diet
%2" tag="un">diet</wf>
<wf id="v00384055_wf24" lemma="transfigure
%2" tag="auto">
<id id="v00384055_id.2"
lemma="transfigure" sk="transfigure
%2:30:00::"/>transfigured</wf>
<wf id="v00384055_wf25" lemma="her"
tag="ignore">her</wf>
<wf id="v00384055_wf26" lemma="into"
tag="ignore">into</wf>
<wf id="v00384055_wf27" lemma="a"
tag="ignore">a</wf>
<wf id="v00384055_wf28" lemma="beautiful
%3" tag="un">beautiful</wf>
<wf id="v00384055_wf29" lemma="young%1|young
%3" tag="un">young</wf>
<wf id="v00384055_wf30" lemma="woman%1"
sep="" tag="un">woman</wf>
</qf>
<wf id="v00384055_wf31" tag="ignore"
type="punc">;</wf>
</ex>
<ex id="v00384055_ex3">
<qf rend="dq">
<wf id="v00384055_wf32" lemma="Jesus%1"
tag="un">Jesus</wf>
<wf id="v00384055_wf33" lemma="be%2"
tag="un">was</wf>
<wf id="v00384055_wf34" lemma="transfigure
%2" tag="auto">
<id id="v00384055_id.3"
lemma="transfigure" sk="transfigure
%2:30:00::"/>transfigured</wf>
<wf id="v00384055_wf35" lemma="after%3|after
%4" tag="un">after</wf>
<wf id="v00384055_wf36" lemma="his"
tag="ignore">his</wf>
<wf id="v00384055_wf37" lemma="resurrection
%1" sep="" tag="un">resurrection</wf>
</qf>
<wf id="v00384055_wf38" tag="ignore"
type="punc">;</wf>
</ex>
</gloss>
</synset>
And if I were going to programmatically extract dictionary entries out of this one, the final text would look something like:
metamorphose: change completely the nature or appearance of; "In Kafka's story, a person metamorphoses into a bug"; "The treatment and diet transfigured her into a beautiful young woman"; "Jesus was transfigured after his resurrection"
transfigure: change completely the nature or appearance of; "In Kafka's story, a person metamorphoses into a bug"; "The treatment and diet transfigured her into a beautiful young woman"; "Jesus was transfigured after his resurrection"
transmogrify: change completely the nature or appearance of; "In Kafka's story, a person metamorphoses into a bug"; "The treatment and diet transfigured her into a beautiful young woman"; "Jesus was transfigured after his resurrection"
The first two almost work as they each have at least one example that matches their headword but the entry for 'transmogrify" has three wrong example sentences with it. It is not even safe to attempt changing out the match for the <term>
in the example sentence. Without visually inspecting each one, you might create example sentences like, "Jesus was transmogrified after his resurrection" which might be technically correct but I'm sure some would take offense at it. And thus, any attempt to run a simple query extracting headword, definition and example-sentence will produce errors.
Using the Code
In order to run the code, you will need to download the WordNet database files per above link, extract the folder "merged" and put that folder in the debug folder of the project. I've left Console.WriteLine()
so running the code will display the points and examples given in this article but most of them will zip by so fast you won't even be able to read them. So if you want to have it stop at any point, insert a Console.Readline()
at the appropriate place. As released, there is only one at the end.
The code all runs in a console application. Sub Main
calls the subs that demonstrate different ways of string
manipulation of the XML element's values and attributes. These subs are described below but it will be easier to step through the code if you need to see exactly what was done. I'm only showing specific points in the article that demonstrate what I'm talking about.
Sub Main()
wrongfully()
Rightfully()
Wordley()
HTML()
End Sub
WordNet Wrongfully()
Transmogrified: shows how to achieve the wrong example that I first described above. It produces an XML file "wrongfully.xml". Looking through this file, you only go 4 entries before you hit one that looks goofy:
<entry>
<hw>dorsal</hw>
<orig>facing away from the axis of an organ or organism;
"the abaxial surface of a leaf is the underside or side facing away from the stem"</orig>
<pos>a</pos>
</entry>
But looking on the bright side, the first three comes out OK. And the XML for the "transmogrify" example comes out looking just like I predicted, explaining how wrong example sentences can end up on an iPad:
<entry>
<hw>metamorphose</hw>
<orig>change completely the nature or appearance of;
"In Kafka's story, a person metamorphoses into a bug";
"The treatment and diet transfigured her into a beautiful young woman";
"Jesus was transfigured after his resurrection"</orig>
<pos>v</pos>
</entry>
<entry>
<hw>transfigure</hw>
<orig>change completely the nature or appearance of;
"In Kafka's story, a person metamorphoses into a bug";
"The treatment and diet transfigured her into a beautiful young woman";
"Jesus was transfigured after his resurrection"</orig>
<pos>v</pos>
</entry>
<entry>
<hw>transmogrify</hw>
<orig>change completely the nature or appearance of;
"In Kafka's story, a person metamorphoses into a bug";
"The treatment and diet transfigured her into a beautiful young woman";
"Jesus was transfigured after his resurrection"</orig>
<pos>v</pos>
</entry>
WordNet Rightfully()
Transmogrified: What I want is an XML document that contains the Wordnet XML database reduced down to entries that each contain a headword, gives the definition, list the synonyms if any, and if there is an example that matches the headword, then it includes the example. Just a simple dictionary in XML format that can then be translated to other programs or formats. By looking closely at the elements in the database, I find the <orig>
element contains the definition that fits all the <term>
values but after that, it may or may not include example sentences and may not contain examples that fit each term listed. So to that, I add that I am not going to decide if the synonym would be replaceable in the example sentence. I only want the ones that match.
Rightfully()
shows how to use string
manipulation of the WordNet XML elements to get the WordNet database converted to a dictionary with entries in XML format. On this one, I take the <orig>
element from the database files (you can see this element in the above code examples) which includes the definition and examples that apply to the synset.
I split
the data first on the semicolon which worked for most of the elements. I had to add string
replace
to deal with the handful of entries that were not separated by semicolons. I simply ran it to error and worked out the string
replacement that was needed to deal with the error. I use replace
to add a semicolon at the right place and therefore the text is the same as before when the semicolon gets stripped out in the split
. Now I have a headword and all available example sentences. I split
the examples and check if it contains
the headword. If it does, then it gets matched up with that headword.
This produces an entry for the "transmogrify" example where I have achieved the desired result of example sentences that match the headword, etc.
<entry>
<hw>metamorphose</hw>
<pos>v</pos>
<def>change completely the nature or appearance of</def>
<term>transfigure</term>
<term>transmogrify</term>
<q> "In Kafka's story, a person metamorphoses into a bug"</q>
</entry>
<entry>
<hw>transfigure</hw>
<pos>v</pos>
<def>change completely the nature or appearance of</def>
<term>metamorphose</term>
<term>transmogrify</term>
<q> "The treatment and diet transfigured her into a beautiful young woman"</q>
<q> "Jesus was transfigured after his resurrection"</q>
</entry>
<entry>
<hw>transmogrify</hw>
<pos>v</pos>
<def>change completely the nature or appearance of</def>
<term>metamorphose</term>
<term>transfigure</term>
</entry>
I did some testing on the result XML files and I could not prove that any mistakes were entered in from the process of doing the string
replace in rightfully()
, but I did find one thing by accident as I wasn't looking for it here. There is at least 1 example sentence lost that should be in there. Some examples had "felt" as the past tense for "feel" which gets lost by using the contains("feel")
. But the object had been achieved of having no wrongfully entered example sentences. I may have lost some along the way where the plural or past tense doesn't match up.
Wordley()
is a sub that uses the XML file created in Rightfully()
to show further string
manipulation of the XML file to convert it to text files in the exact same format as the dictionary Alan Burkhart provides in his CP article: Wordley. The only exception being that this produces the full database converted to dictionary while Wordley is a trimmed back version. I start with StringBuilder
s and IDictionary
s to build the Wordley files. I take advantage of the fact that you can't add a duplicate key in the IDictionary
so I try to add it in Try
and if it already exists, it goes to Catch
so in Catch
I look up the already existing one and add to it.
HTML()
converts the file created in Rightfully()
into individual numbered XML files, similar to but not exactly the same as the directory created for the XML dictionary in Christ Kennedy's CP Article GCIDE: A Complete English Language Dictionary. In my version, I have it worked out so each file carries all entries of the same headword, rather than one file per definition. In both Wordley and HTML, I take advantage of the fact that the IDictionary
will not allow for duplicate keys by putting this in a Try
- Catch
. First, it tries creating a key for the headword of the next element. If it does not already exist, it does this. Otherwise, it goes to catch where I make it look up the already existing one and add to it. If you do not want to see the XML files and stylesheet and how they work together, I recommend commenting out the HTML()
sub as it will make 147,306 XML files using about 600 mb of disk space. If you just want to see a few and how they work, you can stop the project anytime after the HTML()
sub starts running because the XSL stylesheet is already in place. Then, if you double-click an XML file, it will open in your web browser, but it will be random selection as they are numbered files. The stylesheet ("wn.xsl") is created programmatically and saved in the WordnetFiles
directory when the directories are being created. OR...
Viewing the XML as HTML: The following code will make a simple Visual Basic browser with autocomplete textbox for viewing the XML/HTML files. The XML file "WNdicty.xml" is created during the processing of the XML files and saves the dictionary as a key value pair in the form <p><k></k><v></v></p>
. The file doesn't get saved until all the files are saved so if you want to try this out, you will have to run the whole sub.
- In VS 2010, create a new Windows forms project targeting 3.5 framework in Visual Basic. It might work in other versions, but it is up to you to convert if it doesn't.
- Add a textbox and dock it at the top of the form.
- Add a
WebBrowser
control and set the Dock
property to "Fill
" and the ScriptErrorSuppressed
property to "true
". - Stretch the form out to a respectable viewing size.
- Double-click on the form (or F7) to get Form1 showing. Replace the empty
Form1
with the following code. - Copy the folder "WordnetFiles" created in this project into the debug folder of the new project.
This code is given without comments, no explanation, to give a bare bones viewer for looking up the files, or learning about XSL stylesheets (don't ask me - I read XML for Dummies before I found CodeProject) or a base for building a better dictionary, should you care to do so. Otherwise, I recommend Wordley.
Public Class Form1
Public Shared AutoCompleteList As AutoCompleteStringCollection = New AutoCompleteStringCollection
Public Shared WNDicty As IDictionary(Of String, String) = New Dictionary(Of String, String)
Public Shared whereiam As String = My.Computer.FileSystem.CurrentDirectory & "\"
Private Sub autocompletefill()
Dim DictySource As XElement = XElement.Load(whereiam & "\WordnetFiles\WNdicty.xml")
WNDicty.Clear()
For Each kvp In DictySource.<p>
Dim searchkey As String = kvp.<k>.Value
Dim ID As String = kvp.<v>.Value
WNDicty.Add(searchkey, ID)
AutoCompleteList.Add(searchkey)
Next
End Sub
Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load
autocompletefill()
Me.TextBox1.Select(0, 1)
TextBox1.AutoCompleteSource = AutoCompleteSource.CustomSource
TextBox1.AutoCompleteCustomSource = AutoCompleteList
TextBox1.AutoCompleteMode = AutoCompleteMode.SuggestAppend
WebBrowser1.Navigate(whereiam & "WordnetFiles\032\032088.xml", False)
End Sub
Private Sub TextBox1_KeyDown(sender As Object, e As System.Windows.Forms.KeyEventArgs) Handles TextBox1.KeyDown
If e.KeyCode = Keys.Enter Then
Dim path As String = ""
If WNDicty.TryGetValue(TextBox1.Text, path) Then
Dim foldername As String = path.Substring(0, 3) & "\"
Dim makeurl As String = "file://"
Dim filelocation As String = makeurl & whereiam & _
"WordnetFiles\" & foldername & path & ".xml"
WebBrowser1.Navigate(filelocation, False)
End If
End If
End Sub
End Class
On the point of XSL stylesheets: I added extra title (tooltip) attributes to give it the hover explanation, colors, link to the WordNet site, etc. Yes, because it is in a WebBrowser
control, it does look up http addresses if they are in the link. It is a bit obnoxious intentionally so as to give incentive to learn to edit the XSL or to use Wordley.
Points of Interest
In this article, I have attempted to show that there is a right way and a wrong way to do something and that time invested at the beginning to work out what you are going to do is time well spent.
I give a couple examples of ways to do string
manipulation of XML files. Rightfully()
shows the string
manipulation that converts the WordNet synset into dictionary
entries with correct example sentences. Wordley()
shows further string
manipulation and a way to convert the XML to .txt files compatible with Wordley. On the second, HTML()
I show you how to convert the XML document into individual XML files, one per word, with an XSLT stylesheet applied that converts it to HTML.
HTML()
also shows an example of using XDocument
in real life. I did a lot of searching and there isn't much available on it that I could find. This is useful for including the processing instructions for converting the XML documents to HTML with the XSL stylesheet.
I attempt to show that XML is a versatile way to convert data from one form into another.
The WordNet project is part of the subject of computational lexicology. I am using it as the base for the main project I'm working on, of which the HTML()
sub is a modified part of my current working model for this. It will probably be very different by the time I am done with it. The more I study about it, the more I find there is to learn but the one thing I have not yet seen a definition for computational lexicographer. So I would like to propose: someone who applies both computer programming and lexicology in order to build a computer program that can assist build a better dictionary. You know, not just the hack who tries to interpret what the lexi-guy wants but actually studying and applying it from both ends. Thanks to Princeton for their Wordnet project!
This is my article.single for CodeProject but I hope to make it my article.first. If I get a favorable response, then maybe I can show it to the Personnel Director to support my claim that I would be more valuable in IT than in Maintenance...
History
- 5th November, 2012: Released
- 19th November, 2012: Minor typos & clarifications in article; fixed point in stylesheet that occasionally rendered the wrong part of speech for a word