Introduction
In this post, I describe a web application which is useful for collecting URLs of web pages with text matching a given set of words. The application can be run from here. (Note: The deployed application allows a maximum of five queries per user per day.)
The application is useful for finding pages with a large number of URLs relating to the topics being sought. For example, using the query "English podcast mp3", the application will find pages with many links to MP3 podcasts for learning English.
The application demonstrates some practical concepts including web crawling, generic collections, and the use of WebRequest (in System.Net) and WebGrid (in System.Web.Helpers) classes. The application serves as a good example of using Microsoft's WebMatrix (the IDE used for development) to develop a Razor-based application using a single ".cshtml" file, and not having several separate files for the model, view and controller.
How the Application Works
The application is a single page application (SPA) developed in MS Razor. It consists of a single Razor view page that contains both the HTML and the processing logic.
@{
int StartTimer = 0;
string ProgressInfo = "";
WebGrid grid = null;
int MaxRecords = 20; string searchWords = "English podcast mp3";
if (Request["hvar"] =="submit")
{
MaxRecords = int.Parse(Request["MaxRecords"]);
if (MaxRecords > 60)
{ ProgressInfo = "Maximum Records cannot exceed 60.";
goto finish;
}
StartTimer = 1;
Grabber grabber = new Grabber();
grabber.Session = Session;
var urlTable = (HashSet<RowSchema>) Session["urlTable"];
if (urlTable==null)
{ urlTable = new HashSet<RowSchema>();
Session["urlTable"] = urlTable;
}
else if (Request["refresh"] =="0")
{ urlTable.Clear(); }
searchWords = Request["searchWords"];
bool status = grabber.Search(searchWords, MaxRecords, dt);
grid = new WebGrid(source:urlTable, rowsPerPage:100);
grid.SortDirection = SortDirection.Descending;
grid.SortColumn = "Count";
int visitedCount = urlTable.Where(p => p.Visited).Count();
ProgressInfo = "Visited count = " + visitedCount + "; Page will refresh after 15 seconds ...";
if (status)
{ StartTimer=0; ProgressInfo = "Finished";
}
}
finish:
}
}
// some more stuff here
:
<form action="" method="post" >
<input name="hvar" type="hidden" value="submit" />
<input id="refresh" name="refresh" type="hidden" value="0" />
<label>Maximum Records</label><input type="text" name="MaxRecords" value="@MaxRecords" size="4" />
<label>Search Word(s)</label><input type="text" name="searchWords" value="@searchWords" size="35" />
<input type="submit" value="Search" onclick="submitForm()" />
<input type="button" value="Stop" onclick="DoStop()" />
</form>
<div style="margin-left:10px" >
<p id="status" >@ProgressInfo</p>
<!---->
@if (grid!=null)
{ @grid.GetHtml() }
</div>
The preceding listing shows the form's HTML and the server-side program code that is executed every time the page is visited.
In the code, the line Grabber grabber = new Grabber();
creates a "Grabber" object. The call grabber.Search(searchWords, MaxRecords, urlTable);
crawls the web and fills a collection (urlTable parameter) with URLs that have high relevance to the words specified by the searchWords parameter.
The line grid = new WebGrid(source:urlTable, rowsPerPage:100);
sets urlTable as the data source for a WebGrid object. In the HTML for the body of the page, the line { @grid.GetHtml() }
renders the object's data as an HTML table.
The urlTable is a HashSet object (a generic collection). Every time the page is refreshed, more URLs are added to urlTable by the call to grabber.Search(). To prevent the loss of the urlTable object between postbacks (page refreshes), it is saved into the page’s Session object. The rows in this table are of type RowSchema (a class defined in the Grabber.cs file in the App_Code folder). To avoid duplicate URLs, we have chosen a HashSet<RowSchema> collection. This necessitates defining some overrides for GetHashCode() and Equals() methods for the element's type (RowSchema in our case).
The call to Search() runs for about 10 seconds with every page refresh. The refresh process is terminated if the call to Search() returns true, which happens if number of rows reaches MaxRecords and all rows are visited.
The Grabber Class
The following table lists some key methods defined in the Grabber class.
Method |
Description |
public bool Search(string searchWords, int MaxRecords, HashSet<RowSchema> urlTable) |
The main (entry point) method in the Grabber class. It adds (removes, or updates) rows to (from) urlTable . The method returns true if number of rows reaches MaxRecords and all rows are visited. |
string FetchURL(string url) |
Fetches the html for a given url . It uses .NET WebRquest class. |
string GetTitle(string htmlHead, string searchWords) |
Returns the page's title. It returns empty string if no title is found or if none of the words in searchWords is found in the title. |
int CountWords(string htmlData, string searchWords) |
Returns the number of matches for words from searchWords in htmlData . |
HashSet<string> GrabURLs(string htmlData, string parentURL) |
Returns a set of absolute URLs from URLs found in htmlData . |
The following listing shows the Search() method from the Grabber class.
public bool Search(string searchWords, int MaxRecords, HashSet<RowSchema> urlTable)
{
DateTime t1 = DateTime.Now;
while (true)
{ if ((DateTime.Now - t1).TotalSeconds > MaxServerTime) break;
string SearchUrl = String.Format("http://www.bing.com/search?q={0}" ,
HttpUtility.UrlEncode(searchWords)) + "&first=" + rand.Next(500);
string parentURL = "";
RowSchema row1 = null;
if ((urlTable.Count > 5) && (rand.NextDouble() < 0.5))
{ var foundRows = urlTable.Where(p => p.Visited== false).ToList<RowSchema>();
if ((foundRows.Count == 0) && (urlTable.Count == MaxRecords))
return true;
if (foundRows.Count > 0)
{ row1 = foundRows[0];
SearchUrl = row1.URL;
row1.Visited = true; parentURL = SearchUrl;
}
}
string searchData = FetchURL(SearchUrl);
if (searchData.StartsWith("Error"))
{ if (row1!= null)
{ urlTable.Remove(row1); }
continue;
}
int i = searchData.IndexOf("<body", StringComparison.InvariantCultureIgnoreCase);
if (i == -1)
{ if (row1 != null)
{ urlTable.Remove(row1); }
continue;
}
string htmlHead = searchData.Substring(0,i-1);
string htmlBody = searchData.Substring(i).ToLower();
if (row1 != null)
{ string Title = GetTitle(htmlHead, searchWords);
if (Title == "")
{ urlTable.Remove(row1);
continue;
}
int Count = CountWords(htmlBody,searchWords);
if (Count == 0)
{ urlTable.Remove(row1);
continue;
}
row1.Title = Title;
row1.Count = Count;
}
foreach (string s in urlSet)
{ if (urlTable.Count == MaxRecords) break;
row1 = new RowSchema();
row1.URL = s;
row1.Visited = false;
urlTable.Add(row1);
}
}
if (logFile != null) logFile.Close();
return false;
}
The call FetchURL(SearchUrl) is used to fetch content where the SearchURL is either a search query to Bing or some unvisited URL from urlTable. The returned content (searchData) is then processed to extract URLs using the call GrabURLs(searchData), which returns a set of URLs (urlSet). Finally, the urls in urlSet are added to the urlTable.
History
- September 11, 2014: Version 1.0