Introduction
In the previous article, we saw a simple VB application which will pull out the HTML page of a particular URL. In this article, we will build a small web crawler which will crawl through all the links in the given URL.
- Setting up the Visual Basic Environment with required Components and Libraries:
- Open Visual Basic and create a new project (user Standard EXE).
- Select Project -> References from the main menu and add the following Microsoft Libraries:
- Microsoft HTML Object Library
- Add Microsoft Windows Common Controls to the toolbox as follows. Select Project -> Components from the main menu. The Components window will open. With the controls tab selected, scroll down and click the check box preceding the components:
- Microsoft Windows Common Control 6.x
- Set up the UI for the Crawler
- Add a label, two button controls, a listbox, and a treeview control as below:
- Add the code for the Crawler:
- On click of the start button, populate the list box with all the links under the given URL:
Private Sub cmdStart_Click()
getLinks txtURL.Text, 1
End Sub
- The
getlinks
function based on the second parameter populates either the listbox
or the treeview
. Here since the parameter is 1
, it populates the listbox
with all the links under the URL:
-
Private Sub getLinks(strURL As String, iParentChild As Integer, _
Optional iParentNo As Integer)
Dim objLink As HTMLLinkElement
Dim objMSHTML As New MSHTML.HTMLDocument
Dim objDoc As New MSHTML.HTMLDocument
Dim objNode As Node
Set objDoc = objMSHTML.createDocumentFromUrl(txtURL.Text, vbNullString)
MousePointer = vbHourglass
While objDoc.readyState <> "complete"
DoEvents
Wend
For Each objLink In objDoc.links
If iParentChild = 1 Then
lstLinks.AddItem objLink
ElseIf iParentChild = 2 Then
Set objNode = trvLinks.Nodes.Add(iParentNo, tvwChild)
objNode.Text = objLink
End If
Next
MousePointer = vbNormal
End Sub
- If the user wishes to go further down with some of the links, then she/he can select the links and press Get Inner Links Button:
Private Sub cmdGet_Click()
Dim iCount As Integer
If lstLinks.SelCount = 0 Then
MsgBox "Please Select a Link"
Exit Sub
Else
iCount = 0
While iCount <= lstLinks.ListCount - 1
If lstLinks.Selected(iCount) Then
trvLinks.Nodes.Add , , , lstLinks.List(iCount)
getLinks lstLinks.List(iCount), 2, trvLinks.Nodes.Count
lstLinks.RemoveItem (iCount)
lstLinks.Refresh
Else
iCount = iCount + 1
End If
Wend
End If
End Sub
- All the inner links will get populated inside the
Treeview
. Now if the user further wishes to drilldown, he can double click on those URLs in the treeview
:
Private Sub trvLinks_DblClick()
getLinks trvLinks.SelectedItem.Text, 2, trvLinks.SelectedItem.Index
End Sub
- Then finally the screen would look something like this:
History
- 12th August, 2002: Initial post