Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / string

Wordnet Rightfully Transmogrified

5.00/5 (4 votes)
19 Nov 2012CPOL12 min read 22.2K   214  
String manipulation of XML files

Introduction

In this article, I show some techniques for manipulation of XML elements by manipulating the values as strings.

Background

One day, a friend stopped by and asked me if I knew anything about the dictionary that came on his iPad. He told me how it had a dictionary on it that gave example sentences that didn’t match the headword. I had spent some time trying to work my way through the Wordnet database files so I said, “I know exactly what happened!” And so will you: but first, here's the background.

Wordnet is a lexical project run by Princeton University. They have a very large lexical database and it is available for download here. You will need this database if you want to run the project. If you want to know more about Wordnet, please check their site. The key to solving the dictionary on the iPad that had the wrong examples sentences is in the structure of the Wordnet XML database.

Structure The main relation among words in WordNet is synonymy, as between the words shut and close or car and automobile. Synonyms--words that denote the same concept and are interchangeable in many contexts--are grouped into unordered sets (synsets). Each of WordNet’s 117 000 synsets is linked to other synsets by means of a small number of “conceptual relations.” Additionally, a synset contains a brief definition (“gloss”) and, in most cases, one or more short sentences illustrating the use of the synset members. Word forms with several distinct meanings are represented in as many distinct synsets. Thus, each form-meaning pair in WordNet is unique.

This project is released under CPOL. However, WordNet has its own licence and the terms of its use should be understood and honored.

Following is an excerpt of just one element in the WordNet database. It includes all synset data described above under "structure".

XML
<synset pos="r" ofs="00516492" id="r00516492">
<terms><term>wrongfully</term></terms>
<keys><sk>wrongfully%4:02:00::</sk></keys>
<gloss desc="orig">
<orig>in an unjust or unfair manner; "the employee claimed that she was 

wrongfully dismissed"; "people who were wrongfully imprisoned should be 

released"</orig>
</gloss>
<gloss desc="text">
<text>in an unjust or unfair manner ; “ the employee claimed that she was 

wrongfully dismissed ” ; “ people who were wrongfully imprisoned should be 

released ”</text>
</gloss>
<gloss desc="wsd">
<def id="r00516492_d">
<wf pos="IN" id="r00516492_wf1" tag="ignore" 

lemma="in">in</wf>
<wf pos="DT" id="r00516492_wf2" tag="ignore" 

lemma="an">an</wf>
<wf pos="JJ" id="r00516492_wf3" tag="man" 

lemma="unjust%3">
<id id="r00516492_id.6" lemma="unjust" sk="unjust

%3:00:02::"/>
<id id="r00516492_id.5" lemma="unjust" sk="unjust

%3:00:04::"/>
<id id="r00516492_id.4" lemma="unjust" sk="unjust

%3:00:00::"/>unjust</wf>
<wf pos="CC" id="r00516492_wf4" tag="ignore" 

lemma="or">or</wf>
<wf pos="JJ" id="r00516492_wf5" tag="man" 

lemma="unfair%3">
<id id="r00516492_id.8" lemma="unfair" sk="unfair

%3:00:00::"/>unfair</wf>
<wf pos="NN" id="r00516492_wf6" tag="man" 

lemma="manner%1" sep="">
<id id="r00516492_id.7" lemma="manner" sk="manner

%1:07:02::"/>manner</wf>
<wf pos=":" id="r00516492_wf7" tag="ignore" 

type="punc">;</wf>
</def><ex id="r00516492_ex1"><qf rend="dq">
<wf id="r00516492_wf8" tag="ignore" 

lemma="the">the</wf>
<wf id="r00516492_wf9" tag="un" lemma="employee

%1">employee</wf>
<wf id="r00516492_wf10" tag="un" lemma="claim

%2">claimed</wf>
<wf id="r00516492_wf11" tag="ignore" 

lemma="that">that</wf>
<wf id="r00516492_wf12" tag="ignore" 

lemma="she">she</wf>
<wf id="r00516492_wf13" tag="un" lemma="be

%2">was</wf>
<wf id="r00516492_wf14" tag="auto" lemma="wrongfully

%4">
<id id="r00516492_id.2" lemma="wrongfully" sk="wrongfully

%4:02:00::"/>wrongfully</wf>
<wf id="r00516492_wf15" tag="un" lemma="dismiss%2|dismissed

%3" sep="">dismissed</wf>
</qf>
<wf id="r00516492_wf16" tag="ignore" 

type="punc">;</wf>
</ex><ex id="r00516492_ex2"><qf rend="dq">
<wf id="r00516492_wf17" tag="un" lemma="people%1|people

%2">people</wf>
<wf id="r00516492_wf18" tag="ignore" 

lemma="who">who</wf>
<wf id="r00516492_wf19" tag="un" lemma="be

%2">were</wf>
<wf id="r00516492_wf20" tag="auto" lemma="wrongfully

%4">
<id id="r00516492_id.3" lemma="wrongfully" sk="wrongfully

%4:02:00::"/>wrongfully</wf>
<wf id="r00516492_wf21" tag="un" lemma="imprison%2|imprisoned

%3">imprisoned</wf>
<wf id="r00516492_wf22" tag="ignore" 

lemma="should">should</wf>
<wf id="r00516492_wf23" tag="un" lemma="be

%2">be</wf>
<wf id="r00516492_wf24" tag="un" lemma="release%2" 

sep="">released</wf>
</qf>
<wf id="r00516492_wf25" tag="ignore" 

type="punc">;</wf>
</ex>
</gloss>
</synset>

But I just want a dictionary, not all these fancy cross-referenced elements. Extracting just what I want from this example element programmatically would produce a dictionary entry something like the following:

"Wrongfully: in an unjust or unfair manner; "the employee claimed that she was wrongfully dismissed"; "people who were wrongfully imprisoned should be released""

Nothing wrong with that. But wait, what about my friend's iPad? OK, let's look at another synset element extracted from the XML files. As you can see from the structure, the element <terms> has 3 <term> elements in it:

XML
<synset id="v00384055" ofs="00384055" pos="v">
  <terms>
   <term>metamorphose</term>
   <term>transfigure</term>
   <term>transmogrify</term>
  </terms>
  <keys>
   <sk>metamorphose%2:30:00::</sk>
   <sk>transfigure%2:30:00::</sk>
   <sk>transmogrify%2:30:00::</sk>
  </keys>
  <gloss desc="orig">
   <orig>change completely the nature or appearance of; "In 

Kafka's story, a person metamorphoses into a bug"; "The treatment and diet 

transfigured her into a beautiful young woman"; "Jesus was transfigured after 

his resurrection"</orig>
  </gloss>
  <gloss desc="text">
   <text>change completely the nature or appearance of ; “ In 

Kafka's story , a person metamorphoses into a bug ” ; “ The treatment and 

diet transfigured her into a beautiful young woman ” ; “ Jesus was 

transfigured after his resurrection ”</text>
  </gloss>
  <gloss desc="wsd">
   <def id="v00384055_d">
    <wf id="v00384055_wf1" lemma="change%1|change%2" 
         pos="VB" tag="man">
     <id id="v00384055_id.5" lemma="change" 
          sk="change%2:30:01::"/>change</wf>
    <wf id="v00384055_wf2" lemma="completely%4" 
        pos="RB" tag="un">completely</wf>
    <wf id="v00384055_wf3" lemma="the" 
        pos="DT" tag="ignore">the</wf>
    <wf id="v00384055_wf4" lemma="nature%1" 
         pos="NN" tag="un">nature</wf>
    <wf id="v00384055_wf5" lemma="or" 
          pos="CC" tag="ignore">or</wf>
    <wf id="v00384055_wf6" lemma="appearance%1" 
         pos="NN" tag="man">
     <id id="v00384055_id.4" lemma="appearance" 
        sk="appearance%1:07:00::"/>appearance</wf>
    <wf id="v00384055_wf7" lemma="of" 
        pos="IN" sep="" tag="ignore">of</wf>
    <wf id="v00384055_wf8" pos=":" 
         tag="ignore" type="punc">;</wf>
   </def>
   <ex id="v00384055_ex1">
    <qf rend="dq">
     <wf id="v00384055_wf9" lemma="in" 
        tag="ignore">In</wf>
     <wf id="v00384055_wf10" lemma="Kafka%1" 
        tag="un">Kafka's</wf>
     <wf id="v00384055_wf11" lemma="story%1" 
        sep="" tag="un">story</wf>
     <wf id="v00384055_wf12" tag="ignore" 
        type="punc">,</wf>
     <wf id="v00384055_wf13" lemma="a" 
        tag="ignore">a</wf>
     <wf id="v00384055_wf14" lemma="person%1" 
         tag="un">person</wf>
     <wf id="v00384055_wf15" lemma="metamorphosis
         %1|metamorphose%2" tag="auto">
      <id id="v00384055_id.1" 
          lemma="metamorphose" sk="metamorphose
          %2:30:00::"/>metamorphoses</wf>
     <wf id="v00384055_wf16" lemma="into" 
        tag="ignore">into</wf>
     <wf id="v00384055_wf17" lemma="a" 
         tag="ignore">a</wf>
     <wf id="v00384055_wf18" lemma="bug%1|bug
        %2" sep="" tag="un">bug</wf>
    </qf>
    <wf id="v00384055_wf19" tag="ignore" 
        type="punc">;</wf>
   </ex>
   <ex id="v00384055_ex2">
    <qf rend="dq">
     <wf id="v00384055_wf20" lemma="the" 
          tag="ignore">The</wf>
     <wf id="v00384055_wf21" lemma="treatment
        %1" tag="un">treatment</wf>
     <wf id="v00384055_wf22" lemma="and" 
          tag="ignore">and</wf>
     <wf id="v00384055_wf23" lemma="diet%1|diet
         %2" tag="un">diet</wf>
     <wf id="v00384055_wf24" lemma="transfigure
          %2" tag="auto">
      <id id="v00384055_id.2" 
           lemma="transfigure" sk="transfigure
           %2:30:00::"/>transfigured</wf>
     <wf id="v00384055_wf25" lemma="her" 
          tag="ignore">her</wf>
     <wf id="v00384055_wf26" lemma="into" 
         tag="ignore">into</wf>
     <wf id="v00384055_wf27" lemma="a" 
           tag="ignore">a</wf>
     <wf id="v00384055_wf28" lemma="beautiful
        %3" tag="un">beautiful</wf>
     <wf id="v00384055_wf29" lemma="young%1|young
         %3" tag="un">young</wf>
     <wf id="v00384055_wf30" lemma="woman%1" 
        sep="" tag="un">woman</wf>
    </qf>
    <wf id="v00384055_wf31" tag="ignore" 
        type="punc">;</wf>
   </ex>
   <ex id="v00384055_ex3">
    <qf rend="dq">
     <wf id="v00384055_wf32" lemma="Jesus%1" 
        tag="un">Jesus</wf>
     <wf id="v00384055_wf33" lemma="be%2" 
       tag="un">was</wf>
     <wf id="v00384055_wf34" lemma="transfigure
          %2" tag="auto">
      <id id="v00384055_id.3" 
        lemma="transfigure" sk="transfigure
         %2:30:00::"/>transfigured</wf>
     <wf id="v00384055_wf35" lemma="after%3|after
        %4" tag="un">after</wf>
     <wf id="v00384055_wf36" lemma="his" 
        tag="ignore">his</wf>
     <wf id="v00384055_wf37" lemma="resurrection
         %1" sep="" tag="un">resurrection</wf>
    </qf>
    <wf id="v00384055_wf38" tag="ignore" 
        type="punc">;</wf>
   </ex>
  </gloss>
</synset>

And if I were going to programmatically extract dictionary entries out of this one, the final text would look something like:

metamorphose: change completely the nature or appearance of; "In Kafka's story, a person metamorphoses into a bug"; "The treatment and diet transfigured her into a beautiful young woman"; "Jesus was transfigured after his resurrection"

transfigure: change completely the nature or appearance of; "In Kafka's story, a person metamorphoses into a bug"; "The treatment and diet transfigured her into a beautiful young woman"; "Jesus was transfigured after his resurrection"

transmogrify: change completely the nature or appearance of; "In Kafka's story, a person metamorphoses into a bug"; "The treatment and diet transfigured her into a beautiful young woman"; "Jesus was transfigured after his resurrection"

The first two almost work as they each have at least one example that matches their headword but the entry for 'transmogrify" has three wrong example sentences with it. It is not even safe to attempt changing out the match for the <term> in the example sentence. Without visually inspecting each one, you might create example sentences like, "Jesus was transmogrified after his resurrection" which might be technically correct but I'm sure some would take offense at it. And thus, any attempt to run a simple query extracting headword, definition and example-sentence will produce errors.

Using the Code

In order to run the code, you will need to download the WordNet database files per above link, extract the folder "merged" and put that folder in the debug folder of the project. I've left Console.WriteLine() so running the code will display the points and examples given in this article but most of them will zip by so fast you won't even be able to read them. So if you want to have it stop at any point, insert a Console.Readline() at the appropriate place. As released, there is only one at the end.

The code all runs in a console application. Sub Main calls the subs that demonstrate different ways of string manipulation of the XML element's values and attributes. These subs are described below but it will be easier to step through the code if you need to see exactly what was done. I'm only showing specific points in the article that demonstrate what I'm talking about.

VB
Sub Main()
wrongfully()
Rightfully()
Wordley()
HTML()
End Sub

WordNet Wrongfully() Transmogrified: shows how to achieve the wrong example that I first described above. It produces an XML file "wrongfully.xml". Looking through this file, you only go 4 entries before you hit one that looks goofy:

XML
<entry>
    <hw>dorsal</hw>
    <orig>facing away from the axis of an organ or organism;
	 "the abaxial surface of a leaf is the underside or side facing away from the stem"</orig>
    <pos>a</pos>
</entry>

But looking on the bright side, the first three comes out OK. And the XML for the "transmogrify" example comes out looking just like I predicted, explaining how wrong example sentences can end up on an iPad:

XML
<entry>
    <hw>metamorphose</hw>
    <orig>change completely the nature or appearance of; 
     "In Kafka's story, a person metamorphoses into a bug";
	 "The treatment and diet transfigured her into a beautiful young woman";
	 "Jesus was transfigured after his resurrection"</orig>
    <pos>v</pos>
  </entry>
  <entry>
    <hw>transfigure</hw>
    <orig>change completely the nature or appearance of; 
     "In Kafka's story, a person metamorphoses into a bug";
	 "The treatment and diet transfigured her into a beautiful young woman"; 
	"Jesus was transfigured after his resurrection"</orig>
    <pos>v</pos>
  </entry>
  <entry>
    <hw>transmogrify</hw>
    <orig>change completely the nature or appearance of; 
     "In Kafka's story, a person metamorphoses into a bug";
	 "The treatment and diet transfigured her into a beautiful young woman"; 
	"Jesus was transfigured after his resurrection"</orig>
    <pos>v</pos>
  </entry>

WordNet Rightfully() Transmogrified: What I want is an XML document that contains the Wordnet XML database reduced down to entries that each contain a headword, gives the definition, list the synonyms if any, and if there is an example that matches the headword, then it includes the example. Just a simple dictionary in XML format that can then be translated to other programs or formats. By looking closely at the elements in the database, I find the <orig> element contains the definition that fits all the <term> values but after that, it may or may not include example sentences and may not contain examples that fit each term listed. So to that, I add that I am not going to decide if the synonym would be replaceable in the example sentence. I only want the ones that match.

Rightfully() shows how to use string manipulation of the WordNet XML elements to get the WordNet database converted to a dictionary with entries in XML format. On this one, I take the <orig> element from the database files (you can see this element in the above code examples) which includes the definition and examples that apply to the synset.

I split the data first on the semicolon which worked for most of the elements. I had to add string replace to deal with the handful of entries that were not separated by semicolons. I simply ran it to error and worked out the string replacement that was needed to deal with the error. I use replace to add a semicolon at the right place and therefore the text is the same as before when the semicolon gets stripped out in the split. Now I have a headword and all available example sentences. I split the examples and check if it contains the headword. If it does, then it gets matched up with that headword.

This produces an entry for the "transmogrify" example where I have achieved the desired result of example sentences that match the headword, etc.

XML
<!-- note: that each entry only has example sentences in the 
q element that apply only to the entry they are in.  Note also that
the term elements contain the synonyms for the headword. This is used in the
HTML() sub and stripped out in the Wordley() sub. -->

<entry>
    <hw>metamorphose</hw>
    <pos>v</pos>
    <def>change completely the nature or appearance of</def>
    <term>transfigure</term>
    <term>transmogrify</term>
    <q> "In Kafka's story, a person metamorphoses into a bug"</q>
  </entry>
  <entry>
    <hw>transfigure</hw>
    <pos>v</pos>
    <def>change completely the nature or appearance of</def>
    <term>metamorphose</term>
    <term>transmogrify</term>
    <q> "The treatment and diet transfigured her into a beautiful young woman"</q>
    <q> "Jesus was transfigured after his resurrection"</q>
  </entry>
  <entry>
    <hw>transmogrify</hw>
    <pos>v</pos>
    <def>change completely the nature or appearance of</def>
    <term>metamorphose</term>
    <term>transfigure</term>
  </entry>

I did some testing on the result XML files and I could not prove that any mistakes were entered in from the process of doing the string replace in rightfully(), but I did find one thing by accident as I wasn't looking for it here. There is at least 1 example sentence lost that should be in there. Some examples had "felt" as the past tense for "feel" which gets lost by using the contains("feel"). But the object had been achieved of having no wrongfully entered example sentences. I may have lost some along the way where the plural or past tense doesn't match up.

Wordley() is a sub that uses the XML file created in Rightfully() to show further string manipulation of the XML file to convert it to text files in the exact same format as the dictionary Alan Burkhart provides in his CP article: Wordley. The only exception being that this produces the full database converted to dictionary while Wordley is a trimmed back version. I start with StringBuilders and IDictionarys to build the Wordley files. I take advantage of the fact that you can't add a duplicate key in the IDictionary so I try to add it in Try and if it already exists, it goes to Catch so in Catch I look up the already existing one and add to it.

HTML() converts the file created in Rightfully() into individual numbered XML files, similar to but not exactly the same as the directory created for the XML dictionary in Christ Kennedy's CP Article GCIDE: A Complete English Language Dictionary. In my version, I have it worked out so each file carries all entries of the same headword, rather than one file per definition. In both Wordley and HTML, I take advantage of the fact that the IDictionary will not allow for duplicate keys by putting this in a Try - Catch. First, it tries creating a key for the headword of the next element. If it does not already exist, it does this. Otherwise, it goes to catch where I make it look up the already existing one and add to it. If you do not want to see the XML files and stylesheet and how they work together, I recommend commenting out the HTML() sub as it will make 147,306 XML files using about 600 mb of disk space. If you just want to see a few and how they work, you can stop the project anytime after the HTML() sub starts running because the XSL stylesheet is already in place. Then, if you double-click an XML file, it will open in your web browser, but it will be random selection as they are numbered files. The stylesheet ("wn.xsl") is created programmatically and saved in the WordnetFiles directory when the directories are being created. OR...

Viewing the XML as HTML: The following code will make a simple Visual Basic browser with autocomplete textbox for viewing the XML/HTML files. The XML file "WNdicty.xml" is created during the processing of the XML files and saves the dictionary as a key value pair in the form <p><k></k><v></v></p>. The file doesn't get saved until all the files are saved so if you want to try this out, you will have to run the whole sub.

  1. In VS 2010, create a new Windows forms project targeting 3.5 framework in Visual Basic. It might work in other versions, but it is up to you to convert if it doesn't.
  2. Add a textbox and dock it at the top of the form.
  3. Add a WebBrowser control and set the Dock property to "Fill" and the ScriptErrorSuppressed property to "true".
  4. Stretch the form out to a respectable viewing size.
  5. Double-click on the form (or F7) to get Form1 showing. Replace the empty Form1 with the following code.
  6. Copy the folder "WordnetFiles" created in this project into the debug folder of the new project.

This code is given without comments, no explanation, to give a bare bones viewer for looking up the files, or learning about XSL stylesheets (don't ask me - I read XML for Dummies before I found CodeProject) or a base for building a better dictionary, should you care to do so. Otherwise, I recommend Wordley.

VB.NET
Public Class Form1
    Public Shared AutoCompleteList As AutoCompleteStringCollection = New AutoCompleteStringCollection
    Public Shared WNDicty As IDictionary(Of String, String) = New Dictionary(Of String, String)
    Public Shared whereiam As String = My.Computer.FileSystem.CurrentDirectory & "\"
    Private Sub autocompletefill()
        Dim DictySource As XElement = XElement.Load(whereiam & "\WordnetFiles\WNdicty.xml")
        WNDicty.Clear()
        For Each kvp In DictySource.<p>
            Dim searchkey As String = kvp.<k>.Value
            Dim ID As String = kvp.<v>.Value
            WNDicty.Add(searchkey, ID)
            AutoCompleteList.Add(searchkey)
        Next
    End Sub
    Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load
        autocompletefill()
        Me.TextBox1.Select(0, 1)
        TextBox1.AutoCompleteSource = AutoCompleteSource.CustomSource
        TextBox1.AutoCompleteCustomSource = AutoCompleteList
        TextBox1.AutoCompleteMode = AutoCompleteMode.SuggestAppend
        WebBrowser1.Navigate(whereiam & "WordnetFiles\032\032088.xml", False)
    End Sub
    Private Sub TextBox1_KeyDown(sender As Object, e As System.Windows.Forms.KeyEventArgs) Handles TextBox1.KeyDown
        If e.KeyCode = Keys.Enter Then
           Dim path As String = ""
        If WNDicty.TryGetValue(TextBox1.Text, path) Then
            Dim foldername As String = path.Substring(0, 3) & "\"
            Dim makeurl As String = "file://"
            Dim filelocation As String = makeurl & whereiam & _
               "WordnetFiles\" & foldername & path & ".xml"
            WebBrowser1.Navigate(filelocation, False)
        End If
        End If
    End Sub
End Class

On the point of XSL stylesheets: I added extra title (tooltip) attributes to give it the hover explanation, colors, link to the WordNet site, etc. Yes, because it is in a WebBrowser control, it does look up http addresses if they are in the link. It is a bit obnoxious intentionally so as to give incentive to learn to edit the XSL or to use Wordley.

Points of Interest

In this article, I have attempted to show that there is a right way and a wrong way to do something and that time invested at the beginning to work out what you are going to do is time well spent.

I give a couple examples of ways to do string manipulation of XML files. Rightfully() shows the string manipulation that converts the WordNet synset into dictionary entries with correct example sentences. Wordley() shows further string manipulation and a way to convert the XML to .txt files compatible with Wordley. On the second, HTML() I show you how to convert the XML document into individual XML files, one per word, with an XSLT stylesheet applied that converts it to HTML.

HTML() also shows an example of using XDocument in real life. I did a lot of searching and there isn't much available on it that I could find. This is useful for including the processing instructions for converting the XML documents to HTML with the XSL stylesheet.

I attempt to show that XML is a versatile way to convert data from one form into another.

The WordNet project is part of the subject of computational lexicology. I am using it as the base for the main project I'm working on, of which the HTML() sub is a modified part of my current working model for this. It will probably be very different by the time I am done with it. The more I study about it, the more I find there is to learn but the one thing I have not yet seen a definition for computational lexicographer. So I would like to propose: someone who applies both computer programming and lexicology in order to build a computer program that can assist build a better dictionary. You know, not just the hack who tries to interpret what the lexi-guy wants but actually studying and applying it from both ends. Thanks to Princeton for their Wordnet project!

This is my article.single for CodeProject but I hope to make it my article.first. If I get a favorable response, then maybe I can show it to the Personnel Director to support my claim that I would be more valuable in IT than in Maintenance...

History

  • 5th November, 2012: Released
  • 19th November, 2012: Minor typos & clarifications in article; fixed point in stylesheet that occasionally rendered the wrong part of speech for a word

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)