Introduction
The Search Engine module will search an entire page also dynamic pages
for matching keyword(s) or a phrase and will count how many times the keyword(s)
or phrase are found on the page, and display the results with the highest
matches first. The module will search all files with the extensions that you can
easily place the extension name into the web.config file where indicated.
Files or folders that you don't want searched can be placed in the
web.config where indicated, so these files and folders are not searched.
Also now you can choose the encoding of your choice.
This updated article contains tips to globalize and enhance the code.
.
Note: It is best suited for small sites. Also you can modify this code to
crawl pages internally by using regular expressions. For larger site you will
need to write to XML file once periodically and then read from XML file. I have
offered tips at the end of the section to do so. I have also included a demo
project that reads and writes to XML.
Background
Site Search Engine helps a user to trace the pages of his interest. When I
was working on an ASP.NET project, I had to add the site search module. I had
one in ASP but not in .NET. Hence the birth of this site search engine. My first
version was just a single web form and I had not exploited the full features of
the object oriented .NET language. In my spare time, I reworked my code to make
the maximum use of the object oriented language. For this article, I further
enhanced my design on the basis of the experience and the good practices
suggested by different authors.
Mr. Song Tao from Beijing, China approached me with queries how to convert
the module to Chinese. With his help I enhanced the code to support other
languages. Also a few users encountered article errors when the SiteSearch.aspx
was placed in the root. I modified the code to rectify the error.
Source Code Overview
The structure of the Site Search Engine is as follows:
Classes
The ability to define a class and create instances of classes is one of the
most important capabilities of any object-oriented language. In the coming
section, we see the classes that we have used in the search module.
Class Name |
Description |
SiteSearch |
Class for the web form where the user can search the site for
certain words. |
Searches.CleanHtml |
Class to clean the HTMl content |
Searches.FileContent |
Class to get the content form the HTML File |
Searches.Page |
Class to store data of the pages |
Searches.PagesDataset |
Class to create and store results in dataset |
Searches.Site |
Class to read the configurations of the site |
Searches.UserSearch |
Class to store the search information per
user |
SiteSearch.aspx
Web Forms are one of the new, exciting features in Microsoft's .NET
initiative. SiteSearch.aspx is a web form which is also the start page
for the search module.
A Web Forms page consists of a page (ASPX file) and a code behind file
(.aspx.cs file or .aspx.vb file). Our web form consists of
SiteSearch.aspx and SiteSearch.aspx.vb. I will be treating them
simultaneously touching on the main elements of the web form.
ASP.NET is an event-driven programming environment. We will see some event
handlers and methods in the coming section.
Page_Load
The server controls are loaded on the Page
object and the view
state information is available at this point. The Page_Load
event
checks if sSite is nothing and assigns the Session("Site")
variable
to it.
Private Sub Page_Load(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles MyBase.Load
If IsNothing(sSite) Then
sSite = Session("Site")
End If
End Sub
srchbtn_Click
The search button event is fired when the search button is clicked. Here we
put the code to change control settings or display text on the page. Here we
check if the search contains text and then call the SearchSite
method. DisplayContent()
is called to assign values to different
controls in the web page.
Private Sub srchbtn_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles srchbtn.Click
Dim strSearchWords As String
pnlSearchResults.Visible = False
strSearchWords = Trim(Request.Params("search"))
If Not strSearchWords.Equals("") Then
Searchs.Site.ApplicationPath = String.Format("http://{0}{1}",
Request.ServerVariables("HTTP_HOST"), Request.ApplicationPath)
sSite = SearchSite(strSearchWords)
Session("Site") = sSite
dgrdPages.CurrentPageIndex = 0
DisplayContent()
End If
End Sub
DisplayContent
DisplayContent()
is called to assign values to different
controls in the web page. The DataGrid
content is set by calling
the BindDataGrid
method. ViewState("SortExpression")
is used to store the sort expression.
Private Sub DisplayContent()
If Not IsNothing(sSite.PageDataset) Then
pnlSearchResults.Visible = True
lblSearchWords.Text = sSite.SearchWords
If ViewState("SortExpression") Is Nothing Then
ViewState("SortExpression") = "MatchCount Desc"
End If
BindDataGrid(ViewState("SortExpression"))
lblTotalFiles.Text = sSite.TotalFilesSearched
lblFilesFound.Text = sSite.TotalFilesFound
End If
End Sub
Search
The main call to the search takes place in this method.
UserSearch
class which we will cover shortly stores the entire
search information and the result of the search. UserSearch
object,
i.e. srchSite
is created and its properties like
SearchWords
and SearchCriteria
assigned. Also
srchSite.Search
method is called.
Private Function SearchSite(ByVal strSearch_
As String) As Searchs.UserSearch
Dim srchSite As Searchs.UserSearch
srchSite = New Searchs.UserSearch()
srchSite.SearchWords = strSearch
If Phrase.Checked Then
srchSite.SearchCriteria = Searchs.SearchCriteria.Phrase
ElseIf AllWords.Checked Then
srchSite.SearchCriteria = Searchs.SearchCriteria.AllWords
ElseIf AnyWords.Checked Then
srchSite.SearchCriteria = Searchs.SearchCriteria.AnyWords
End If
srchSite.Search(Server.MapPath("./"))
Return srchSite
End Function
DataGrid
The DataGrid
control renders a multi-column, fully templated
grid and is by far the most versatile of all data bound controls. Moreover the
DataGrid
control is the ASP.NET control of choice for data
reporting. Hence, I have used it to display the search results. Since the focus
of the article is the internal search engine, I will just give a brief overview
of the features of the DataGrid
used here.
Databinding
Data binding is the process of retrieving data from a source and dynamically
associating it to a property of a visual element. Because a
DataGrid
handles (or at least has in memory) more items
simultaneously, you should associate the DataGrid
explicitly with a
collection of data�that is, the data source.
The content of a DataGrid
is set by using its
DataSource
property. The entire search result is stored in the
sSite.PageDataset.Tables("Pages")
. Hence the content of the
DataGrid
is set to dvwPages
i.e.
sSite.PageDataset.Tables("Pages").DefaultView
.
BindDataGrid
method is called every time the page loads.
Private Sub BindDataGrid(ByVal strSortField As String)
Dim dvwPages As DataView
dvwPages = sSite.PageDataset.Tables("Pages").DefaultView
dvwPages.Sort = strSortField
dgrdPages.DataSource = dvwPages
dgrdPages.DataBind()
End Sub
The control has the ability to automatically generate columns that are based
on the structure of the data source. Auto-generation is the default behavior of
DataGrid
, but you can manipulate that behavior by using a
Boolean
property named AutoGenerateColumns
. Set the
property to False
when you want the control to display
only the columns you explicitly add to the Columns
collection. Set
it to True
(the default) when you want the control to
add as many columns as is required by the data source. Auto-generation does not
let you specify the header text, nor does it provide text formatting. Hence,
here I have set it to False
. You typically bind columns
using the <columns>
tag in the body of the
<asp:datagrid>
server control.
<Columns>
<asp:TemplateColumn>
<ItemTemplate>
<%# DisplayTitle(Container.DataItem( "Title" ), _
Container.DataItem( "Path" )) %>
<br>
<%# Container.DataItem( "Description" ) %>
<br>
<span class="Path">
<%# String.Format("{0} - {1}kb", DisplayPath( _
Container.DataItem( "Path" )) , _
Container.DataItem( "Size" ))%>
</span>
<br>
<br>
</ItemTemplate>
</asp:TemplateColumn>
</Columns>
DisplayTitle
method and DisplayPath
method are used
to display customized information to the columns in the
DataGrid
.
Protected Function DisplayTitle(ByVal Title _
As String, ByVal Path As String) As String
Return String.Format("<A href="{1}">{0}</A>", Title, Path)
End Function
Protected Function DisplayPath(ByVal Path As String) As String
Return String.Format("{0}{1}/{2}", _
Request.ServerVariables("HTTP_HOST"), _
Request.ApplicationPath, Path)
End Function
Pagination
Unlike the DataList
control, the DataGrid
control
supports data pagination, that is, the ability to divide the displayed data
source rows into pages. The size of our data source easily exceeds the page real
estate. So to preserve scalability on the server and to provide a more
accessible page to the user, you display only a few rows at a time. To enable
pagination of the DataGrid
control, you need to tell the control
about it. You do this through the AllowPaging
property.
The pager bar is an interesting and complimentary feature offered by the
DataGrid
control to let users easily move from page to page. The
pager bar is a row displayed at the bottom of the DataGrid
control
that contains links to available pages. When you click any of these links, the
control automatically fires the PageIndexChanged
event and updates
the page index accordingly. dgrdPages_PageIndexChanged
is called
when the page index changes.
Protected Sub dgrdPages_PageIndexChanged(ByVal s As Object, _
ByVal e As DataGridPageChangedEventArgs) _
Handles dgrdPages.PageIndexChanged
dgrdPages.CurrentPageIndex = e.NewPageIndex
DisplayContent()
End Sub
You control the pager bar by using the PagerStyle
property�s
Mode
attribute. Values for the Mode
attribute come
from the PagerMode
enumeration. Here we have chosen a detailed
series of numeric buttons, each of which points to a particular page.
<PagerStyle CssClass="GridPager" Mode="NumericPages"></PagerStyle>
Sorting
The DataGrid
control does not actually sort rows, but it
provides good support for sorting as long as the sorting capabilities of the
underlying data source are adequate. The data source is always responsible for
returning a sorted set of records based on the sort expression selected by the
user through the DataGrid
control�s user interface. The built-in
sorting mechanism is triggered by setting the AllowSorting
property
to True
.
dgrdPages_SortCommand
is called to sort the
DataGrid
. The SortCommand
event handler knows about
the sort expression through the SortExpression
property, which is
provided by the DataGridSortCommandEventArgs
class. In our code,
the sort information is persisted because it is stored in a slot in the page�s
ViewState
collection.
Note: In my pages, I have disabled the header but if the header is shown, you
can use it to sort the DataGrid
.
Protected Sub dgrdPages_SortCommand(ByVal s As Object, _
ByVal e As DataGridSortCommandEventArgs) _
Handles dgrdPages.SortCommand
ViewState("SortExpression") = e.SortExpression
DisplayContent()
End Sub
Page.vb
The role of the Page
object is to store the data related to each
page of the site.
The Page
class defines the following properties
Path |
Stores the path of the file |
Title |
Stores the text in HTML title tag |
Keywords |
Stores the text in HTML meta keywords tags |
Description |
Stores the text in HTML meta description tags |
Contents |
Stores the text in HTML page |
Matchcount |
Stores the matches found in HTML page |
Public Property Size() As Decimal
Get
Return m_size
End Get
Set(ByVal Value As Decimal)
m_size = Value
End Set
End Property
Public Property Path() As String
Get
Return m_path
End Get
Set(ByVal Value As String)
m_path = Value
End Set
End Property
Public Property Title() As String
Get
Return m_title
End Get
Set(ByVal Value As String)
m_title = Value
End Set
End Property
Public Property Keywords() As String
Get
Return m_keywords
End Get
Set(ByVal Value As String)
m_keywords = Value
End Set
End Property
Public Property Description() As String
Get
Return m_description
End Get
Set(ByVal Value As String)
m_description = Value
End Set
End Property
Public Property Contents() As String
Get
Return m_contents
End Get
Set(ByVal Value As String)
m_contents = Value
End Set
End Property
Public Property MatchCount() As Integer
Get
Return m_matchcount
End Get
Set(ByVal Value As Integer)
m_matchcount = Value
End Set
End Property
The Page
class has two private methods and two public methods.
It defines the following methods:
CheckFileInfo Method
It is a public method which checks if title, description and content exists.
If the text for title is empty then it assigns the default value "No Title". If
the text for description is empty then it either assigns the contents of the
file or default value "There is no description available for this page".
Public Sub CheckFileInfo()
If IsNothing(m_title) Or m_title.Trim().Equals("") Then
m_title = "No Title"
End If
If IsNothing(m_description) Or _
m_description.Trim().Equals("") Then
If IsNothing(m_contents) Or _
m_contents.Trim().Equals("") Then
m_description = _
"There is no description available for this page"
Else
If m_contents.Length > 200 Then
m_description = m_contents.Substring(0, 200)
Else
m_description = m_contents
End If
End If
End If
End Sub
Search method
Search
method is a public method which calls
SearchPhrase
and SearchWords
methods depending on the
search criteria. SearchPhrase
method searches for phrases while
SearchWords
searches for all or any words. Both these methods calls
SearchPattern
metod which uses regular expressions to search the
files.
Public Sub Search(ByVal strSearchWords As String, _
ByVal SrchCriteria As SearchCriteria)
If SrchCriteria = SearchCriteria.Phrase Then
SearchPhrase(strSearchWords)
Else
SearchWords(strSearchWords, SrchCriteria)
End If
End Sub
Private Sub SearchPhrase(ByVal strSearchWords As String)
Dim mtches As MatchCollection
mtches = SearchPattern(strSearchWords)
If mtches.Count > 0 Then
m_matchcount = mtches.Count
End If
End Sub
Private Sub SearchWords(ByVal strSearchWords As String, _
ByVal SrchCriteria As SearchCriteria)
Dim intSearchLoopCounter As Integer
Dim sarySearchWord As String()
Dim mtches As MatchCollection
sarySearchWord = Split(Trim(strSearchWords), " ")
For intSearchLoopCounter = 0 To UBound(sarySearchWord)
mtches = SearchPattern(sarySearchWord(_
intSearchLoopCounter))
If SrchCriteria = SearchCriteria.AnyWords Then
m_matchcount = m_matchcount + mtches.Count
ElseIf SrchCriteria = SearchCriteria.AllWords Then
If mtches.Count > 0 Then
If m_matchcount = 0 Or (m_matchcount > 0 _
And m_matchcount > mtches.Count) Then
m_matchcount = mtches.Count
End If
Else
m_matchcount = 0
Exit Sub
End If
End If
Next
End Sub
The escaped character \b is a special case. In a regular expression, \b
denotes a word boundary (between \w and \W characters) except within a []
character class, where \b refers to the backspace character. In a replacement
pattern, \b always denotes a backspace.
We might need to drop the word
boundary when your are using encoding other than UTF-8.
Private Function SearchPattern( _
ByVal strSearchWord As String) As MatchCollection
Dim regexp As Regex
Dim strPattern
regexp = New Regex("", RegexOptions.IgnoreCase)
If Searchs.Site.Encoding.Equals("utf-8") Then
strPattern = "\b{0}\b"
Else
strPattern = "{0}"
End If
Return regexp.Matches(m_contents, String.Format(strPattern, _
strSearchWord), RegexOptions.IgnoreCase)
End Function
UserSearch.vb
It contains the following properties:
SearchCriteria |
The user choice of search is stored and retrieved from here |
SearchWords |
The search words used by the user is stored and retrieved from
here |
TotalFilesSearched |
Total Files Searched is read from here |
TotalFilesFound |
Total Files Searched found is read from here |
Public Property SearchCriteria() As Searchs.SearchCriteria
Get
Return m_searchCriteria
End Get
Set(ByVal Value As Searchs.SearchCriteria)
m_searchCriteria = Value
End Set
End Property
Public Property SearchWords() As String
Get
Return m_searchWords
End Get
Set(ByVal Value As String)
m_searchWords = Value
End Set
End Property
Public ReadOnly Property TotalFilesSearched() As Integer
Get
Return m_totalFilesSearched
End Get
End Property
Public ReadOnly Property TotalFilesFound() As Integer
Get
Return m_totalFilesFound
End Get
End Property
Public ReadOnly Property PageDataset() As DataSet
Get
Return m_dstPages
End Get
End Property
Search Method
Actual processing of the search begins here. DataSet
to store
the search results is created here and ProcessDirectory
method is
called.
Public Function Search(ByVal targetDirectory As String) As DataSet
If Searchs.Site.EnglishLanguage = True Then
m_searchWords = m_page.Server.HtmlEncode(m_searchWords)
Else
m_searchWords = Replace(m_searchWords, "<", "<", 1, -1, 1)
m_searchWords = Replace(m_searchWords, ">", ">", 1, -1, 1)
End If
If m_dstPages Is Nothing Then
m_dstPages = Searchs.PagesDataset.Create()
End If
ProcessDirectory(targetDirectory)
Return m_dstPages
End Function
ProcessDirectory Method
The ProcessDirectory
loops through all the files and calls the
ProcessFile
method. Later, it also loops through the subdirectories
and calls itself.
Private Sub ProcessDirectory(ByVal targetDirectory As String)
Dim fileEntries As String()
Dim subdirectoryEntries As String()
Dim filePath As String
Dim subdirectory As String
fileEntries = Directory.GetFiles(targetDirectory)
For Each filePath In fileEntries
m_totalFilesSearched += 1
ProcessFile(filePath)
Next filePath
subdirectoryEntries = Directory.GetDirectories(targetDirectory)
For Each subdirectory In subdirectoryEntries
If Not InStr(1, Searchs.Site.BarredFolders, _
Path.GetFileName(subdirectory), vbTextCompare) > 0 Then
ProcessDirectory(subdirectory)
End If
Next subdirectory
End Sub
ProcessFile Method
The ProcessFile
calls the GetInfo
which returns the
Searchs.Page
object which contains all the information of the
particular file. Later, it checks if the matchcount
is greater than
0 and calls the CheckFileInfo
to clean up the information stored in
the Page
object. It then stores the file in the
PagesDataset
.
Private Sub ProcessFile(ByVal FPath As String)
Dim srchFile As Searchs.Page
srchFile = GetInfo(FPath)
If Not IsNothing(srchFile) Then
srchFile.Search(m_searchWords, m_searchCriteria)
If srchFile.MatchCount > 0 Then
m_totalFilesFound += 1
srchFile.CheckFileInfo()
Searchs.PagesDataset.StoreFile(m_dstPages, srchFile)
End If
End If
End Sub
GetInfo Method
The GetInfo
method's main task is to get the data of the file.
It calls the shared method Searchs.FileContent.GetFileInfo
where
much of the work is done.
Private Function GetInfo(ByVal FPath As String) As Searchs.Page
Dim fileInform As New FileInfo(FPath)
Dim sr As StreamReader
Dim srchFile As New Searchs.Page()
Dim strBldFile As New StringBuilder()
Dim strFileURL As String
If InStr(1, Searchs.Site.FilesTypesToSearch, _
fileInform.Extension, vbTextCompare) > 0 Then
If Not InStr(1, Searchs.Site.BarredFiles, _
Path.GetFileName(FPath), vbTextCompare) > 0 Then
If Not File.Exists(FPath) Then
Return Nothing
End If
Searchs.FileContent.GetFileInfo(FPath, srchFile)
Return srchFile
End If
End If
Return Nothing
End Function
FileContent.vb
GetFileInfo Method
Here the chunk of the data in the page is retrieved. The file content is
either read from the source if the files are static using the
GetStaticFileContent
method. If the files are dynamic then content
is retreived from server using the GetDynamicFileContent
method.
Title information is retrieved from the title tags, and description and keywords
from meta tags by calling the GetMetaContent
method. Contents of
the file is stripped from the HTML page by calling
Searchs.CleanHtml.Clean
method.
Public Shared Sub GetFileInfo(ByVal FPath As String, _
ByVal srchFile As Searchs.Page)
Dim fileInform As New FileInfo(FPath)
Dim strBldFile As New StringBuilder()
Dim fileSize As Decimal = fileInform.Length \ 1024
srchFile.Size = fileSize
GetFilePath(FPath, srchFile)
If InStr(1, Searchs.Site.DynamicFilesTypesToSearch, _
fileInform.Extension, vbTextCompare) > 0 Then
m_page.Trace.Warn("Path", String.Format("{0}/{1}", "", _
srchFile.Path))
GetDynamicFileContent(srchFile)
Else
GetStaticFileContent(FPath, srchFile)
End If
If Not srchFile.Contents.Equals("") Then
srchFile.Contents = sr.ReadToEnd()
srchFile.Title = GetMetaContent(srchFile.Contents,_
"<title>", "</title>")
srchFile.Description = GetMetaContent(srchFile.Contents,_
"<meta name=""description"" content=""", ",""">")
'm_page.Trace.Warn("Page Desc", strPageDescription)
'Read in the keywords of the file
srchFile.Keywords = GetMetaContent(srchFile.Contents,_
"<meta name=""keywords"" content=""", ",""">")
srchFile.Contents = _
Searchs.CleanHtml.Clean(srchFile.Contents)
srchFile.Contents = _
strBldFile.AppendFormat("{0} {1} {2} {3}", _
srchFile.Contents, srchFile.Description, _
srchFile.Keywords, srchFile.Title).ToString.Trim()
End If
End Sub
Private Shared Sub GetStaticFileContent(_
ByVal FPath As String, ByVal srchFile As Searchs.Page)
Dim sr As StreamReader
If Searchs.Site.Encoding.Equals("utf-8") Then
sr = File.OpenText(FPath)
Else
sr = New StreamReader(FPath, _
Encoding.GetEncoding(Searchs.Site.Encoding))
End If
Try
srchFile.Contents = sr.ReadToEnd()
sr.Close()
Catch ex As Exception
m_page.Trace.Warn("Error", ex.Message)
srchFile.Contents = ex.Message
End Try
End Sub
GetDynamicFileContent
GetDynamicFileContent
branches into two method viz
GetDynamicFileContentOther
or GetDynamicFileContentUTF
depending on the encoding.
Private Shared Sub GetDynamicFileContent(ByVal srchFile As Searchs.Page)
Dim wcMicrosoft As System.Net.WebClient
If Searchs.Site.Encoding.Equals("utf-8") Then
GetDynamicFileContentUTF(srchFile)
Else
GetDynamicFileContentOther(srchFile)
End If
End Sub
System.Net.WebClient
provides common methods for sending data to
and receiving data from a resource identified by a URI. We make use of the
DownloadData
which downloads data from a resource and returns a
byte array.
Applications that target the common language runtime use encoding to map
character representations from the native character scheme (Unicode) to other
schemes. Applications use decoding to map characters from nonnative schemes
(non-Unicode) to the native scheme. The System.Text Namespace provides classes
that allow you to encode and decode characters.
Private Shared Sub GetDynamicFileContentOther( _
ByVal srchFile As Searchs.Page)
Dim wcMicrosoft As System.Net.WebClient
Dim fileEncoding As System.Text.Encoding
Try
fileEncoding = System.Text.Encoding.GetEncoding(_
Searchs.Site.Encoding)
srchFile.Contents = fileEncoding.GetString( _
wcMicrosoft.DownloadData(String.Format("{0}/{1}", _
Searchs.Site.ApplicationPath, srchFile.Path)))
Catch ex As System.Net.WebException
m_page.Trace.Warn("Error", ex.Message)
srchFile.Contents = ex.Message
Catch ex As System.Exception
m_page.Trace.Warn("Error", ex.Message)
srchFile.Contents = ex.Message
End Try
End Sub
UTF8Encoding
class encodes Unicode characters using UCS
Transformation Format, 8-bit form (UTF-8). This encoding supports all Unicode
character values and surrogates.
Private Shared Sub GetDynamicFileContentUTF( _
ByVal srchFile As Searchs.Page)
Dim wcMicrosoft As System.Net.WebClient
Dim objUTF8Encoding As UTF8Encoding
Try
wcMicrosoft = New System.Net.WebClient()
objUTF8Encoding = New UTF8Encoding()
srchFile.Contents = objUTF8Encoding.GetString( _
wcMicrosoft.DownloadData(String.Format("{0}/{1}", _
Searchs.Site.ApplicationPath, srchFile.Path)))
Catch ex As System.Net.WebException
m_page.Trace.Warn("Error", ex.Message)
srchFile.Contents = ex.Message
Catch ex As System.Exception
m_page.Trace.Warn("Error", ex.Message)
srchFile.Contents = ex.Message
End Try
End Sub
GetFilePath Method
The GetFilePath
method coverts local folder path to reflect the
URL of the site.
Private Shared Sub GetFilePath(ByVal strFileURL As String,_
ByVal srchFile As Searchs.Page)
strFileURL = Replace(strFileURL, m_page.Server.MapPath("./"), "")
strFileURL = Replace(strFileURL, "\", "/")
'Encode the file name and path into the URL code method
strFileURL = m_page.Server.UrlEncode(strFileURL)
'Just incase it's encoded any backslashes
strFileURL = Replace(strFileURL.Trim(), _
"%2f", "/", vbTextCompare)
srchFile.Path = strFileURL
m_page.Trace.Warn("Url", srchFile.Path)
End Sub
GetMetaContent Method
GetMetaContent
method uses regular expressions to strip the tags
and get the required information.
Private Shared Function GetMetaContent(ByVal strFile As String, _
ByVal strMetaStart As String, ByVal strMetaEnd As String) As String
Dim regexp As Regex
Dim strMeta As String
Dim strPattern As String
Dim strInPattern As String
If InStr(1, LCase(strFile), strMetaStart, 1) = 0 _
And InStr(strMetaStart, "name=") Then
strMetaStart = Replace(strMetaStart, "name=", "http-equiv=")
End If
strInPattern = "((.|\n)*?)"
strPattern = String.Format("{0}{1}{2}", _
strMetaStart, strInPattern, strMetaEnd)
regexp = New Regex(strPattern, RegexOptions.IgnoreCase)
strMeta = regexp.Match(strFile).ToString
strInPattern = "(.*?)"
strPattern = String.Format("{0}{1}{2}", _
strMetaStart, strInPattern, strMetaEnd)
strMeta = regexp.Replace(strMeta, strPattern,_
"$1", RegexOptions.IgnoreCase)
Return strMeta
End Function
PagesDataset.vb
This class is used to create and build the DataSet
. It consists
of two methods and StoreFile
.
Create
method creates the DataSet
to store the
searched results and the Storefile
is responsible for adding
records to DataTable
in the DataSet
.
Public Shared Function Create() As DataSet
Dim pgDataSet As New DataSet()
Dim keys(1) As DataColumn
pgDataSet.Tables.Add(New DataTable("Pages"))
pgDataSet.Tables("Pages").Columns.Add("PageId", _
System.Type.GetType("System.Int32"))
pgDataSet.Tables("Pages").Columns.Add("Title",_
System.Type.GetType("System.String"))
pgDataSet.Tables("Pages").Columns.Add("Description", _
System.Type.GetType("System.String"))
pgDataSet.Tables("Pages").Columns.Add("Path", _
System.Type.GetType("System.String"))
pgDataSet.Tables("Pages").Columns.Add("MatchCount", _
System.Type.GetType("System.Int32"))
pgDataSet.Tables("Pages").Columns.Add("Size", _
System.Type.GetType("System.Decimal"))
pgDataSet.Tables("Pages").Columns("PageID").AutoIncrement = True
pgDataSet.Tables("Pages").Columns("PageID").AutoIncrementSeed = 1
keys(0) = pgDataSet.Tables("Pages").Columns("PageId")
pgDataSet.Tables("Pages").PrimaryKey = keys
Return pgDataSet
End Function
Public Shared Sub StoreFile(ByVal dstPgs As DataSet,_
ByVal srchPg As Searchs.Page)
Dim pageRow As DataRow
pageRow = dstPgs.Tables("Pages").NewRow()
pageRow("Title") = srchPg.Title
pageRow("Description") = srchPg.Description
pageRow("Path") = srchPg.Path
pageRow("MatchCount") = srchPg.MatchCount
pageRow("Size") = srchPg.Size
dstPgs.Tables("Pages").Rows.Add(pageRow)
End Sub
CleanHtml.vb
CleanHtml
class contains a single public shared method which
uses regular expressions to clean the HTML content.
Public Shared Function Clean(ByVal Contents As String) As String
Dim regexp As Regex
Dim strPattern As String
strPattern = ""
regexp = New Regex(strPattern, RegexOptions.IgnoreCase)
Contents = regexp.Replace(Contents, _
"<(select|option|script|style|title)(.*?)" & _
">((.|\n)*?)</(SELECT|OPTION|SCRIPT|STYLE|TITLE)>",_
" ", RegexOptions.IgnoreCase)
Contents = regexp.Replace(Contents, "&(nbsp|quot|copy);", "")
Contents = regexp.Replace(Contents, "<([\s\S])+?>",_
" ", RegexOptions.IgnoreCase).Replace(" ", " ")
" ", RegexOptions.IgnoreCase)
Contents = regexp.Replace(Contents, "\W", " ")
Return Contents
End Function
Site.vb
Site
class consists of shared properties which store the
configurations of the entire site. These properties get their values from the
web.config file using the
ConfigurationSettings.AppSettings
.
Following are the properties of the site class:
FilesTypesToSearch |
Returns the files types you want to search |
DynamicFilesTypesToSearch |
Returns dynamic files to search |
BarredFolders |
Returns the barred folders |
EnglishLanguage |
Returns a Boolean value whether the language is English or
not. |
Encoding |
Returns the encoding for the site |
BarredFiles |
Returns barred files |
ApplicationPath |
Assign and returns the path of the application |
Public Shared ReadOnly Property FilesTypesToSearch() As String
Get
Return ConfigurationSettings.AppSettings(
"FilesTypesToSearch")
End Get
End Property
Public Shared ReadOnly Property DynamicFilesTypesToSearch() As String
Get
Return ConfigurationSettings.AppSettings(_
"DynamicFilesTypesToSearch")
End Get
End Property
Public Shared ReadOnly Property BarredFolders() As String
Get
Return ConfigurationSettings.AppSettings("BarredFolders")
End Get
End Property
Public Shared ReadOnly Property BarredFiles() As String
Get
Return ConfigurationSettings.AppSettings("BarredFiles")
End Get
End Property
Public Shared ReadOnly Property EnglishLanguage() As String
Get
Return ConfigurationSettings.AppSettings("EnglishLanguage")
End Get
End Property
Public Shared ReadOnly Property Encoding() As String
Get
Return ConfigurationSettings.AppSettings("Encoding")
End Get
End Property
Public Property ApplicationPath() As String
Get
Return m_ApplicationPath
End Get
Set(ByVal Value As String)
m_ApplicationPath = Value
End Set
End Property
Web.config
The ASP.NET configuration system features an extensible infrastructure that
enables you to define configuration settings at the time your ASP.NET
applications are first deployed, so that you can add or revise configuration
settings at any time, with minimal impact on operational Web applications and
servers. Multiple configuration files, all named Web.config, can appear
in multiple directories on an ASP.NET Web application server. Each
Web.config file applies configuration settings to its own directory and
all child directories below it. As mentioned earlier, the site configurations
can be assigned in the web.config file.
<appSettings>
-->
<add key="FilesTypesToSearch" value=".htm,.html,.asp,.shtml,.aspx"
/>
-->
<add key="DynamicFilesTypesToSearch" value=".asp,.shtml,.aspx" />
-->
<add key="BarredFolders"
value="aspnet_client,_private,_vti_cnf,_vti_log,_vti_pvt,
_vti_script,_vti_txt,cgi_bin,_bin,bin,_notes,images,scripts"
/>
-->
<add key="BarredFiles"
value="localstart.asp,iisstart.asp,AssemblyInfo.vb,
Global.asax,Global.asax.vb,SiteSearch.aspx"
/>
-->
<add key="EnglishLanguage" value="True" />
-->
<add key="Encoding" value="utf-8" />
</appSettings>
How to integrate
The application has been tested with the web form SiteSearch.aspx in
the root directory. So my suggestion is that you do the same. Later on, you can
try moving it to any subfolder. All my classes I have placed in
components folder. You can move them to any folder of your choice.
Note:
- For those users who don't have Visual Studio .Net
- Download
from the link 'Download latest version of demo project (Visual studio.net not
required)'
- Place the SearchDotnet.dll in the bin folder in the root.
- Place the SiteSearch.aspx and web.config in the root.
- To use the XML version
- Download from the link 'Download demo project
which reads and writes to XML(VB.net)'
- The project contains following
files.
- AdminSearch.aspx is used to write xml to file.
- SiteSearch.aspx
is used to search files.
- All my classes I have placed in components
folder.
Errors
When the application is place in the root you may get the following errors.
The remote server returned an error: (401) Unauthorized.
OR
The remote
server returned an error: (500) Internal Server Error.
This error is caused because
- If server returns (401) Unauthorized, the
application is unable to read the file because of right access.
- If server
returns (500) Internal Server Error, the page that it was trying to read
returned an error. The page that application was trying to read either has an
error or requires parameters because of which it returns error
Follow the steps to rectify the error
- In the Web.config ensure that the
BarredFolders
list is
comprehensive
aspnet_client,_private,_vti_cnf,
_vti_log,_vti_pvt,_vti_script,_vti_txt, cgi_bin,_bin,bin,_notes,images,scripts
- Ensure that the BarredFiles list is comprehensive and contains
localstart.asp,iisstart.asp
Globalization
The Search Engine module can be easily globalize. For this purpose we will
see how to convert it into Chinese language.
Web.config
The XML declaration must appear as the first line of the document without
other content, including white space, in front of the start <.
The XML declaration in the document map consists of the following:<?xml
version="1.0" encoding="Your Encoding" ?>. By default the visual studio uses
the utf-8 encoding this needs to be changed to encoding that you want to use.
Here we will change to gb2312. Hence the XML Declaration needs to be modified as
follows.
English
="1.0" ="utf-8"
Chinese
="1.0" ="gb2312"
The requestEncoding
and responseEncoding
specifies
the assumed encoding of each incoming request and outgoing response. The default
encoding is UTF-8, specified in the <globalization> tag included in the
Machine.config file created when the .NET Framework is installed. If encoding is
not specified in a Machine.config or Web.config file, encoding defaults to the
computer's Regional Options locale setting. We will need to change
requestEncoding
and responseEncoding
to reflect the
change in encoding.
English
<globalization requestEncoding="utf-8" responseEncoding="utf-8" />
Chinese
<globalization requestEncoding="gb2312" responseEncoding="gb2312" />
In order to avoid building the code when the encoding changes we need to add
the encoding key to the appsettings
.
<!---->
<add key="Encoding" value="gb2312" />
Also change the English Language key to false.
<!---->
<add key="EnglishLanguage" value="True" />
SiteSearch.aspx
Last but not the least, the codepage
attribute has to be added
in the page directive.
English
<%@ Page Language="vb" Trace="False" AutoEventWireup="false"
Codebehind="SiteSearch.aspx.vb"
Inherits="SearchDotnet.SiteSearch" debug="false" %>
Chinese
<%@ Page Language="vb" Trace="False" AutoEventWireup="false"
Codebehind="SiteSearch.aspx.vb" Inherits="SearchDotnet.SiteSearch"
debug="false" codePage="936" %>
Enhancing the code
The application is meant for small sites. For bigger sites, the code can be
further enhanced. Infact you will need to write to database say XML file
periodically and then read from it. I will offer a few tips to do so.(I have now
included a demo project that reads and writes to XML.)
(1) In my code I search and filter the data using regular expression. Instead
of this you will have to write entire data (not filtered data) to XML file using
the following method.
Private Shared Sub WriteXmlToFile(ByVal thisDataSet As DataSet)
If thisDataSet Is Nothing Then
Return
End If
thisDataSet.WriteXml(XMLFile)
End Sub
(2) Later you will need to read the xml from file save it to the shared
dataset say Searchs.Site.PageDataset.Tables("Pages").
Private Shared Function ReadXmlFromFile() As DataSet
Dim newDataSet As New DataSet("New DataSet")
Dim fsReadXml As New System.IO.FileStream(XMLFile,
System.IO.FileMode.Open)
Dim myXmlReader As New System.Xml.XmlTextReader(fsReadXml)
newDataSet.ReadXml(myXmlReader)
myXmlReader.Close()
Return newDataSet
End Function
(3) For each search you will later need use the Select method of
PageDataset.Tables
to filter it according to search results. Filter
dataset might look like this. FillDataset
method contains logic to
create and add search results (array of DataRow
) to database.
Private Sub FiterPagesDatset()
Dim strExpr As String
Dim foundRows As DataRow()
Dim Field() As String = {
"Title", "Description", "Keywords", "Contents"}
strExpr = SomeFunction
foundRows = Searchs.Site.PageDataset.Tables(
"Pages").Select(strExpr)
FillDataset(foundRows)
End Sub
(4) The filtered result store it into another dataset and use it to display
results
Points of Interest
When I was working on the project, the question was how to display the
results. DataGrid
was my choice as we can exploit a lot of its
features which are not present in other list controls. Once my question was
solved, the next was how to pass content to the DataGrid
.
DataSet
was the only alternative. As I worked further before
storing the information in DataSet
, I had to move to and fro with
bulk information about the page. I decided to use site object to store the
information.
One of the authors suggested the following best practices:
- The class should be small to have approximately half a dozen methods and
properties.
- Methods be short with again approximately half a dozen lines.
After carefully analyzing the code and keeping the best practices in mind, I
redesigned the code to what it is now.
History
- Modified the code to read dynamic pages.
- Enhanced the code to support globalization.
- I have also include a demo project that reads and writes to
XML.