Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Dictionary for Google Suggest like Dictionary

0.00/5 (No votes)
5 Jun 2005 2  
To make a dictionary that will work with Google Suggest like Dictionary.

Sample Image

Introduction

When I read the Google Suggest like Dictionary article which was written by Gavi Narra, I found that the dictionary was HTML plain text format that needs to be converted into a database table.

The data comes from a freely available online dictionary (a public domain English word list dictionary, based on the public domain portion of "The Project Gutenberg Etext of Webster's Unabridged Dictionary" which is in turn based on the 1913 US Webster's Unabridged Dictionary. You could download it from here.).

I built a small VB.NET application that will convert these HTML files and the data in them to a meaningful database table.

Using the code

Two steps are required for the conversion to succeed:

  1. Converting an HTML file into character separated values.
  2. Processing the separated values to generate INSERT INTO statements.

I used MS Access database to save the results but I believe that using MS SQL is more appropriate both while conversion and as a dictionary.

Step 1 (Making the substrate)

The substrate is the result of the HTML plain text conversion into separated values (CSV). The HTML file contains plain text dictionary as the follows:

Islam (n.) The religion of the Mohammedans; Mohammedanism; 
Islamism. Their formula of faith is: There is no God but Allah, 
and Mohammed is his prophet.

which must be converted somehow to be readable and to be inserted into database. The best way is to use character separated values as the following:

Islam@(n.)@The religion of the Mohammedans; Mohammedanism; 
Islamism. Their formula of faith is: There is no God but Allah, 
and Mohammed is his prophet.

where @ is the character used to separate between fields. The result after database insertion would be:

Dict
Word Islam
Type (n.)
Mean The religion of the Mohammedans; Mohammedanism; Islamism. Their formula of faith is: There is no God but Allah, and Mohammed is his prophet.

The scheme in which the project will run is HTML ----step1----> CSV ----step2----> Database.

Code conversion from HTML to CSV is placed at the Click event of a button. The HTML file contains some unwanted tags that could be used as a hint for conversion. txtpath2 textbox is the HTML file path while txtpathtarget textbox is the destination file path that would be in any extension. Opening the file name specified in txtpath2 using the IO.StreamReader and replacing some tags with the @ character, then saving the file in another location are the core actions of this step.

Private Sub Button2_Click(ByVal sender As System.Object, _
         ByVal e As System.EventArgs) Handles Button2.Click
  progbar2.Value = 0
  Dim myfile = New IO.StreamReader(txtpath2.Text)
  Try

    '---Proceesing the file that contain delimited contents

    Dim Line As String
    Line = myfile.Readtoend

    '----Remove unwanted tags

    Line = Line.Replace("<P>", "")
    progbar2.Value += 1

    Line = Line.Replace("</P>", "")
    progbar2.Value += 1

    Line = Line.Replace("<B>", "")
    progbar2.Value += 1

    Line = Line.Replace(" (<I></I>) ", "@")
    progbar2.Value += 1

    Line = Line.Replace("</B>", "@")
    progbar2.Value += 1

    Line = Line.Replace(" (<I>", "")
    progbar2.Value += 1

    Line = Line.Replace("</I>) ", "@")
    progbar2.Value += 1

    Dim myfileSave As New IO.StreamWriter(txtpathtarget.Text, True)
    myfileSave.Write(Line)
    myfileSave.Close()
    progbar2.Value += 1

    MessageBox.Show("Substrate completed successfully", _
                    "Dict Maker")


  Catch ex As IO.IOException
    MessageBox.Show("Error: " & ex.Message, "Critical error", _
                    MessageBoxButtons.OK, MessageBoxIcon.Error)
  Catch ex As Exception
    MessageBox.Show("Error: " & ex.Message, "Critical error", _
                    MessageBoxButtons.OK, MessageBoxIcon.Error)
  Finally
    myfile.Close()
  End Try

End Sub

As you can see, a destination file should be created that will contain the CSV. This file will be used in step 2 to generate the INSERT INTO statement.

Step 2 (Binding to enzyme)

Once you finish converting HTML files into CSV files, you can furthermore generate from CSV files the needed INSERT INTO statements. In this step, you'll specify the CSV file which is made by step 1.

This step consists of opening the database connection, generating the SQL commands, processing the CSV line by line using the IO.StreamReader and executing the commands. The txtpath textbox contains the CSV file path. I used the split function which identifies the substrings in a string that are delimited by one or more characters specified in an array (@), then places the substrings into a String array.

The array returned will have an upper bound of 2 (3 elements) 0 - 1 - 2. The array elements are word, type and mean respectively. An SQL statement will be generated corresponding to each value. As I used MS Access database, commands should be executed one by one. If you change the database to MS SQL, you can generate the INSERT INTO statements and then execute them at once as MS SQL server supports batch or multi-query. This will add to the application performance.

The conversion is attached with progress bar, which will take its maximum value from CalculateEstimate function that will loop throughout the file CSV to calculate the lines in it which will be processed.

Note: Don't forget to specify the data source path in the connection string or the application will not work.

Private Sub Button1_Click(ByVal sender As System.Object, _
            ByVal e As System.EventArgs) Handles Button1.Click
  ProgBar.Value = 0
  If txtpath.Text = "" Then Exit Sub
  MsgBox(txtpath.Text)
  '-----Open the connection

  Dim con As New OleDb.OleDbConnection
  Dim com As New OleDb.OleDbCommand
  '------

  con.ConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;" & _ 
                         "Data Source=d:\vbprograms\dictmaker\dict.mdb"

  '----calculate estimates and set the progress bar

  ProgBar.Maximum = CalculateEstimate(txtpath.Text)

  '----

  Dim myfile = New IO.StreamReader(txtpath.Text)
  Try

    '--open the database

    con.Open()

    '---Proceesing the file that contain delimited contents

    Dim Line As String
    Dim AfterSep() As String

    '--------------------Ignore the first 9 lines

    For t As Integer = 1 To 9
      myfile.ReadLine()
    Next t
    '--------------------


    While myfile.Peek > -1
      Application.DoEvents()

      Line = myfile.ReadLine()
      If Line = "</BODY>" Then Exit While
      '----If the carriage return, request next line

      If Line = "" Then Line = myfile.ReadLine()

      AfterSep = Line.Split("@"c)

      '----Build the command and executeit

      com.Connection = con

      com.CommandText = _
         "INSERT INTO dict (word,type,mean) VALUES (?,?,?)"
      com.Parameters.Clear()
      com.Parameters.Add("?", AfterSep(0))
      com.Parameters.Add("?", AfterSep(1))
      com.Parameters.Add("?", AfterSep(2))
      com.ExecuteNonQuery()

      '----Set value progress

      ' If ProgBar.Value = 5136 Then Stop

      ProgBar.Value += 1
      lblprocessed.Text = "Line(s) processed : " & ProgBar.Value

    End While

    MessageBox.Show("Conversion completed successfully", "Dict Maker")

  Catch ex As OleDb.OleDbException
    MessageBox.Show("Error: " & ex.Message, _
      "Critical error", MessageBoxButtons.OK, _
      MessageBoxIcon.Error)

  Catch ex As IO.IOException
    MessageBox.Show("Error: " & ex.Message, _
      "Critical error", MessageBoxButtons.OK, MessageBoxIcon.Error)
  Catch ex As Exception
    MessageBox.Show("Error: " & ex.Message, _
      "Critical error", MessageBoxButtons.OK, MessageBoxIcon.Error)
  Finally
    con.Close()
    myfile.Close()
  End Try

End Sub 

Calculates the lines which will be processed:

Function CalculateEstimate(ByVal path As String) As Integer
  Dim estimate As New IO.StreamReader(path)
  Dim NumOfLines As Integer
  While estimate.Peek > -1
    estimate.ReadLine()
    NumOfLines += 1
  End While
  estimate.Close()
  lblTotalLines.Text = "Total Line(s): " & NumOfLines - 11
  Return NumOfLines
End Function

Finally

Although VS2005 ASP 2.0 covered the idea Gavi Narra wrote about (HttpXmlRequest), really his work is great and I hope my work to be the complementary.

Best regards.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here