Introduction
When I read the Google Suggest like Dictionary article which was written by Gavi Narra, I found that the dictionary was HTML plain text format that needs to be converted into a database table.
The data comes from a freely available online dictionary (a public domain English word list dictionary, based on the public domain portion of "The Project Gutenberg Etext of Webster's Unabridged Dictionary" which is in turn based on the 1913 US Webster's Unabridged Dictionary. You could download it from here.).
I built a small VB.NET application that will convert these HTML files and the data in them to a meaningful database table.
Using the code
Two steps are required for the conversion to succeed:
- Converting an HTML file into character separated values.
- Processing the separated values to generate
INSERT INTO
statements.
I used MS Access database to save the results but I believe that using MS SQL is more appropriate both while conversion and as a dictionary.
Step 1 (Making the substrate)
The substrate is the result of the HTML plain text conversion into separated values (CSV). The HTML file contains plain text dictionary as the follows:
Islam (n.) The religion of the Mohammedans; Mohammedanism;
Islamism. Their formula of faith is: There is no God but Allah,
and Mohammed is his prophet.
which must be converted somehow to be readable and to be inserted into database. The best way is to use character separated values as the following:
Islam@(n.)@The religion of the Mohammedans; Mohammedanism;
Islamism. Their formula of faith is: There is no God but Allah,
and Mohammed is his prophet.
where @ is the character used to separate between fields. The result after database insertion would be:
Dict |
Word |
Islam |
Type |
(n.) |
Mean |
The religion of the Mohammedans; Mohammedanism; Islamism. Their formula of faith is: There is no God but Allah, and Mohammed is his prophet. |
The scheme in which the project will run is HTML ----step1----> CSV ----step2----> Database.
Code conversion from HTML to CSV is placed at the Click
event of a button. The HTML file contains some unwanted tags that could be used as a hint for conversion. txtpath2
textbox is the HTML file path while txtpathtarget
textbox is the destination file path that would be in any extension. Opening the file name specified in txtpath2
using the IO.StreamReader
and replacing some tags with the @ character, then saving the file in another location are the core actions of this step.
Private Sub Button2_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles Button2.Click
progbar2.Value = 0
Dim myfile = New IO.StreamReader(txtpath2.Text)
Try
Dim Line As String
Line = myfile.Readtoend
Line = Line.Replace("<P>", "")
progbar2.Value += 1
Line = Line.Replace("</P>", "")
progbar2.Value += 1
Line = Line.Replace("<B>", "")
progbar2.Value += 1
Line = Line.Replace(" (<I></I>) ", "@")
progbar2.Value += 1
Line = Line.Replace("</B>", "@")
progbar2.Value += 1
Line = Line.Replace(" (<I>", "")
progbar2.Value += 1
Line = Line.Replace("</I>) ", "@")
progbar2.Value += 1
Dim myfileSave As New IO.StreamWriter(txtpathtarget.Text, True)
myfileSave.Write(Line)
myfileSave.Close()
progbar2.Value += 1
MessageBox.Show("Substrate completed successfully", _
"Dict Maker")
Catch ex As IO.IOException
MessageBox.Show("Error: " & ex.Message, "Critical error", _
MessageBoxButtons.OK, MessageBoxIcon.Error)
Catch ex As Exception
MessageBox.Show("Error: " & ex.Message, "Critical error", _
MessageBoxButtons.OK, MessageBoxIcon.Error)
Finally
myfile.Close()
End Try
End Sub
As you can see, a destination file should be created that will contain the CSV. This file will be used in step 2 to generate the INSERT INTO
statement.
Step 2 (Binding to enzyme)
Once you finish converting HTML files into CSV files, you can furthermore generate from CSV files the needed INSERT INTO
statements. In this step, you'll specify the CSV file which is made by step 1.
This step consists of opening the database connection, generating the SQL commands, processing the CSV line by line using the IO.StreamReader
and executing the commands. The txtpath
textbox contains the CSV file path. I used the split
function which identifies the substrings in a string that are delimited by one or more characters specified in an array (@), then places the substrings into a String
array.
The array returned will have an upper bound of 2 (3 elements) 0 - 1 - 2. The array elements are word, type and mean respectively. An SQL statement will be generated corresponding to each value. As I used MS Access database, commands should be executed one by one. If you change the database to MS SQL, you can generate the INSERT INTO
statements and then execute them at once as MS SQL server supports batch or multi-query. This will add to the application performance.
The conversion is attached with progress bar, which will take its maximum value from CalculateEstimate
function that will loop throughout the file CSV to calculate the lines in it which will be processed.
Note: Don't forget to specify the data source path in the connection string or the application will not work.
Private Sub Button1_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles Button1.Click
ProgBar.Value = 0
If txtpath.Text = "" Then Exit Sub
MsgBox(txtpath.Text)
Dim con As New OleDb.OleDbConnection
Dim com As New OleDb.OleDbCommand
con.ConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source=d:\vbprograms\dictmaker\dict.mdb"
ProgBar.Maximum = CalculateEstimate(txtpath.Text)
Dim myfile = New IO.StreamReader(txtpath.Text)
Try
con.Open()
Dim Line As String
Dim AfterSep() As String
For t As Integer = 1 To 9
myfile.ReadLine()
Next t
While myfile.Peek > -1
Application.DoEvents()
Line = myfile.ReadLine()
If Line = "</BODY>" Then Exit While
If Line = "" Then Line = myfile.ReadLine()
AfterSep = Line.Split("@"c)
com.Connection = con
com.CommandText = _
"INSERT INTO dict (word,type,mean) VALUES (?,?,?)"
com.Parameters.Clear()
com.Parameters.Add("?", AfterSep(0))
com.Parameters.Add("?", AfterSep(1))
com.Parameters.Add("?", AfterSep(2))
com.ExecuteNonQuery()
ProgBar.Value += 1
lblprocessed.Text = "Line(s) processed : " & ProgBar.Value
End While
MessageBox.Show("Conversion completed successfully", "Dict Maker")
Catch ex As OleDb.OleDbException
MessageBox.Show("Error: " & ex.Message, _
"Critical error", MessageBoxButtons.OK, _
MessageBoxIcon.Error)
Catch ex As IO.IOException
MessageBox.Show("Error: " & ex.Message, _
"Critical error", MessageBoxButtons.OK, MessageBoxIcon.Error)
Catch ex As Exception
MessageBox.Show("Error: " & ex.Message, _
"Critical error", MessageBoxButtons.OK, MessageBoxIcon.Error)
Finally
con.Close()
myfile.Close()
End Try
End Sub
Calculates the lines which will be processed:
Function CalculateEstimate(ByVal path As String) As Integer
Dim estimate As New IO.StreamReader(path)
Dim NumOfLines As Integer
While estimate.Peek > -1
estimate.ReadLine()
NumOfLines += 1
End While
estimate.Close()
lblTotalLines.Text = "Total Line(s): " & NumOfLines - 11
Return NumOfLines
End Function
Finally
Although VS2005 ASP 2.0 covered the idea Gavi Narra wrote about (HttpXmlRequest
), really his work is great and I hope my work to be the complementary.
Best regards.