Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

CodeProject Article Scraper, Revisited

13 Nov 2011 1  
New and improved! Keep an eye on your CodeProject articles and reputation without having to log onto CP.
Download CPAM3.zip - 807KB (last updated 05/11/2011)

Introduction

This article describes an assembly I've developed to help me keep a watchful eye over my articles, tips, and reputation points. Since it's primary discovery mechanism involves the act of "scraping" the CodeProject web site for data, its day-to-day viability is subject to the whims of the CodeProject gods, and those whims they are a changin'.

Back in 2008, I wrote an article that described a program that performed essentially the same process - scraping the CodeProject web site in order that someone might retrieve the current vote status of their articles. That version of the code has been made significantly outdated by CodeProject's continuing evolution, and the code needed to scrape relevant data from the site is so significantly different that I chose to write another article instead of performing a massive edit on the earlier one. I also wanted to give people the opportunity to compare code between the two versions if they were so inclined.

Screenshot_01.jpg

CPAMLib - General Architecture

This library of code will allow you to scrape the articles, blogs, and tips, as well as keep track of your reputation scores (support for blogs, tips, and reputation points are the new items in this version of the code). The scraping code itself is multi-threaded and posts progress events (as you'll see in the sample app described in Part 2 of this article series).

In essence, this library maintains a collection of Articles and a collection of reputation objects, as well as tracking changes to the objects in those lists. Once scraped, the articles and reputations are persisted into a XML file. This way, you can track changes on application startup (if you wish to do so).

To make things easy on the WPF guys, all of the collections are derived from ObservableCollection Beyond that, however, there is no direct support for WPF, but that shouldn't prevent you from writing a WPF application that uses this library. In all actuality, I started to do it, but got more interested in seeing some results, so I kind of abandoned the WPF project. I'll talk more about that in Part 2.

Finally, I'm not going to say any of this stuff is really very clever, but it does work, and most of the time, that's really all that counts.

The Article Class

This class represents one of three items - an article, a tip/trip, or a blog entry. It's task is simply to contain the properties for those items, and determine if those properties values have changed since the prior scrape.

Initialization

Although it wasn't really necessary, I provided four overloads for the constructor. One accepts no parameters (and really shouldn't be used, so it's access level is private, one accepts an XElement parameter (for use when loading the properties from a XML data file), and the other three are identical except for the last parameter, which allows you to set the group the article object belongs to (article, blog, or tip). This parameter can be specified as the appropriate enum, an integer representing the ordinal value of the enum, or as a string that represents the enum item's name.

//--------------------------------------------------------------------------------
private Article()

//--------------------------------------------------------------------------------
public Article(XElement value)

//--------------------------------------------------------------------------------
public Article(string title, string desc, string url, DateTime posted, DateTime updated, 
                int votes, int views, int bookmarks, decimal rating, decimal popularity, 
                int downloads, ItemGroup group)

//--------------------------------------------------------------------------------
public Article(string title, string desc, string url, DateTime posted, DateTime updated, 
                int votes, int views, int bookmarks, decimal rating, decimal popularity, 
                int downloads, int group)

//--------------------------------------------------------------------------------
public Article(string title, string desc, string url, DateTime posted, DateTime updated, 
                int votes, int views, int bookmarks, decimal rating, decimal popularity, 
                int downloads, string group)

Since all of the overloads do exactly the same thing, they all call the InitCommon method, which sets properties using the specified parameters.

//--------------------------------------------------------------------------------
private void InitObject(string title, string desc, string url, DateTime posted, DateTime updated, 
                        int votes, int views, int bookmarks, decimal rating, decimal popularity, 
                        int downloads, ItemGroup group)
{
    this.RecentChanges = new ChangesDictionary();
    this.Title         = title;
    this.Description   = desc;
    this.Url           = url;
    this.DatePosted    = posted;
    this.LastUpdated   = updated;
    this.Votes         = votes;
    this.Views         = views;
    this.Bookmarks     = bookmarks;
    this.Rating        = rating;
    this.Popularity    = popularity;
    this.Group         = group;
    this.Downloads     = downloads;
    this.TimeUpdated   = new DateTime(0);
}

Tracking Property Values

All trackable properties (rating, views, popularity, etc) have their current value as well as their prior value (if any) represented. This allows you to illustrate value changes in your application. Because these values are being tracked, we can also determine which of the articles in a given group have the highest rating or popularity, or for any of the other tracked properties. When the article is scraped, the values are updated with the ApplyChanges method:

public void ApplyChanges(Article incoming, DateTime updateTime)
{
    // since the url is always unique, we can use it as an identifier
    if (this.Url.ToLower() == incoming.Url.ToLower())
    {
        // Update the changes dictionary
        RecentChanges.AddUpdate(DataItem.Bookmarks,   ChangedValue(this.Bookmarks,   incoming.Bookmarks));
        RecentChanges.AddUpdate(DataItem.Description, ChangedValue(this.Description, incoming.Description));
        RecentChanges.AddUpdate(DataItem.Downloads,   ChangedValue(this.Downloads,   incoming.Downloads));
        RecentChanges.AddUpdate(DataItem.LastUpdated, ChangedValue(this.LastUpdated, incoming.LastUpdated));
        RecentChanges.AddUpdate(DataItem.Popularity,  ChangedValue(this.Popularity,  incoming.Popularity));
        RecentChanges.AddUpdate(DataItem.Rating,      ChangedValue(this.Rating,      incoming.Rating));
        RecentChanges.AddUpdate(DataItem.Votes,       ChangedValue(this.Votes,       incoming.Votes));
        RecentChanges.AddUpdate(DataItem.Views,       ChangedValue(this.Views,       incoming.Views, this.ViewChangeThreshold));
        RecentChanges.AddUpdate(DataItem.Title,       ChangedValue(this.Title,       incoming.Title));

        // set the properties to the new values
        this.Downloads   = incoming.Downloads;
        this.Bookmarks   = incoming.Bookmarks;
        this.LastUpdated = incoming.LastUpdated;
        this.Popularity  = incoming.Popularity;
        this.Rating      = incoming.Rating;
        this.Votes       = incoming.Votes;
        this.Views       = incoming.Views;
        this.Title       = incoming.Title;
        this.Description = incoming.Description;

        // log it
        this.TimeUpdated = updateTime;
    }
}

The RecentChangesproperty represents a Dictionary collection which allows us to maintain any trackable property that we might want to track. I used a Dictionarybecause in its simplest form, the data was representable as a key/value pair (the property and its value).

Since I wanted to make the library usable by a WPF application, I used a class I found on the internet called ObservableDictionary Please see the section below that talks about code I used that I didn't write.

The previous method makes a call to the ChangedValue method for each tracked property so that it can be determined whether or not a value has changed, and in what direction. This info is used by the UI to determine what to display and how to display it after an update scrape.

//--------------------------------------------------------------------------------
private ChangeType ChangedValue(int currentValue, int newValue, int threshold)
{
    ChangeType changeType = ChangeType.None;
    int changeAmount = newValue - currentValue;
    if (changeAmount >= threshold)
    {
        changeType = ChangeType.Up;
    }
    else if (changeAmount < 0 && Math.Abs(changeAmount) > threshold)
    {
        changeType = ChangeType.Down;
    }
    return changeType;
}

//--------------------------------------------------------------------------------
private ChangeType ChangedValue(int currentValue, int newValue)
{
    return DetermineChangeType(currentValue.CompareTo(newValue));
}

//--------------------------------------------------------------------------------
private ChangeType ChangedValue(DateTime currentValue, DateTime newValue)
{
    return DetermineChangeType(currentValue.CompareTo(newValue));
}

//--------------------------------------------------------------------------------
private ChangeType ChangedValue(decimal currentValue, decimal newValue)
{
    return DetermineChangeType(currentValue.CompareTo(newValue));
}

//--------------------------------------------------------------------------------
private ChangeType ChangedValue(string currentValue, string newValue)
{
    ChangeType changeType = ChangeType.None;
    if (newValue != currentValue)
    {
        changeType = ChangeType.Changed;
    }
    return changeType;
}

//--------------------------------------------------------------------------------
private ChangeType DetermineChangeType(int compareResult)
{
    ChangeType changeType = ChangeType.None;
    switch (compareResult)
    {
        case -1 : changeType = ChangeType.Up;   break;
        case 0  : changeType = ChangeType.None; break;
        case 1  : changeType = ChangeType.Down; break;
    }
    return changeType;
}

To support the UI, the following method determines if the specified item has a new value.

//--------------------------------------------------------------------------------
public bool GetFieldChanged(DataItem dataItem)
{
    ChangeType changeType = ChangeType.None;
    try
    {
        if (RecentChanges.ContainsKey(dataItem))
        {
            changeType = RecentChanges[dataItem];
        }
        else
        {
            RecentChanges.AddUpdate(dataItem, ChangeType.None);
            changeType = ChangeType.None;
        }
    }
    catch
    {
        throw new Exception (string.Format("DataItem {0} is invalid.", dataItem.ToString()));
    }
    return (changeType != ChangeType.None);
}

As you can see, the article object simply contains the data we'll be interested in from the UI point of view.

The ArticleManager Class

This class represents a list of article objects. It is responsible saving to and loading them from a file on a local disk. It is also responsible for scraping the info from CodeProject.

Initialization is a straightforward affair. First the constructor, which calls the discrete initialization methods:

//--------------------------------------------------------------------------------
public ArticleManager()
{
    // clear out any existing items in the list
    Clear();

    InitReputationList();
    InitAverages();
    InitScraperURLs();

    this.SaveScrapeResults = true;
    this.ScrapePosted = true;

	// for testing
    //this.SaveScrapeResults = false;
}

The initialization methods simply prepare the object for work. It's really not very interesting, but I did leave the method comments in the code block below so you can see what's what.

//--------------------------------------------------------------------------------
// Sets up the list of URLs that we need to actually scrape data from 
// CodeProject.  Putting them in a dictionary mitigates the inevitable typos 
// that tend to creep into string variables.
private void InitScraperURLs()
{
	if (this.m_scraperURLs == null)
	{
		this.m_scraperURLs = new Dictionary<scrapertype,>();
		this.m_scraperURLs.Add(ScraperType.MyArticles, "http://www.codeproject.com/script/Articles/MemberArticles.aspx?amid={0}");
		this.m_scraperURLs.Add(ScraperType.User,       "http://www.codeproject.com/script/Membership/View.aspx?mid={0}");
	}
}

//--------------------------------------------------------------------------------
// Initializes the list (actually, it's an observable dictionary) over average 
private void InitAverages()
{
	if (Averages == null)
	{
		Averages = new AveragesDictionary();
		Averages.Add("ArticlesRating",     0M);
		Averages.Add("ArticlesPopularity", 0M);
		Averages.Add("TipsRating",         0M);
		Averages.Add("TipsPopularity",     0M);
		Averages.Add("BlogsRating",        0M);
		Averages.Add("BlogsPopularity",    0M);
		Averages.Add("OverallRating",      0M);
		Averages.Add("OverallPopularity",  0M);
	}
}

//--------------------------------------------------------------------------------
//  Initialize the list of reputation categories. 
private void InitReputationList()
{
	// because of the ReputationList.AddOrUpdate method, we don't have to do 
	// anything else regarding initialization.
	if (this.Reputations == null)
	{
		this.Reputations = new ReputationList();
	}
}</scrapertype,>

The actual act of scraping CodeProject is run in a thread so it can be interrupted. To that end, I set up a thread method. I was having problems with CodeProject taking too long to respond and failing to retrieve info as a result, and this spurred me into implementing the automatic retries as well as the HtmlAgilityPack code.

//--------------------------------------------------------------------------------
public void ScrapeWebPage(string userID)
{
    // set our maximum connection retries
    this.m_maxConnectTries = (this.AutoRefresh) ? 10 : 3;
    // set the current user ID
    this.UserID = userID;

    // if the thread isn't null
    if (this.ScraperThread != null)
    {
        // if the thread is currently running
        if (this.ScraperThread.ThreadState == System.Threading.ThreadState.Running)
        {
            // abort the thread
            try
            {
                this.ScraperThread.Abort();
            }
            catch (Exception)
            {
                // we don't care about the exceptions here
            }
        }
        // set it to null
        this.ScraperThread = null; 
    }
    // start fresh
    this.ScraperThread = new Thread(new ThreadStart(ScrapeArticles));
    this.ScraperThread.Start();
}

//--------------------------------------------------------------------------------
// The actual thread delegate
private void ScrapeArticles()
{
    bool alreadyHasData = (this.Count > 0);
    this.LastScrapeResult = ScrapeResult.Fail;
    if (!ValidUserID())
    {
        return;
    }
    this.m_validUser = true;

    // get reputation
    GetUserInfo(true);
    GetArticles();

    RaiseEventThreadComplete();
}

//--------------------------------------------------------------------------------
// Allows us to abort the thread from the UI if necessary.
public void Abort()
{
    if (this.ScraperThread != null && this.ScraperThread.ThreadState == System.Threading.ThreadState.Running)
    {
        this.ScraperThread.Abort();
    }
}

Parsing the data is a drawn out affair. Here's an overview of how it goes:

  • The user's profile page is scraped to make sure the ID is valid. If it isn't, all further scraping is aborted.
  • User info (reputation scores) is retrieved.
  • Articles, blogs and tips are retrieved. Due to the nature of the page layout, all of this info is scraped at one time, but the info is parsed for each group separately.
  • If necessary, the article original post dates are scraped from each individual article page. This is only done if this is the first time the article appears in the list. One item of interest is that Chris has offered to start including the posted dates on the My Articles page with all of the other info. I've modified the code to take advantage of this when/if it actually happens, and until then, the previously described method will continue to work.
  • Article group averages are calculated (for use by the UI if desired).
  • Article high values are calculated (for use by the UI if desired).
  • During the processing, periodic progress events are posted so that the UI can update itself to reflect what's happening.

Due to the breadth of the code necessary to parse scraped info, I decided it would be best to refer you to the code itself.

The ReputationCategory Class

This class represents a reputation category, such as Author, Editor, etc. This item is responsible for determining it's status (platinum, gold, etc), and providing the appropriate display color (for the UI). It does this by maintaining status levels in a dictionary collection, and manipulating that collection:

//--------------------------------------------------------------------------------
// Set the current status for this category - this is done so that we can 
// establish the correct color displayed in the interface component. This is 
// called from the constructor and when the old points are updated.
private void SetCurrentStatus()
{
    if (this.StatusLevels == null)
    {
        BuildStatusLevels();
    }
    this.Status = (this.Points < StatusLevels[ReputationLevel.Platinum]) ? ReputationLevel.Gold   : ReputationLevel.Platinum;
    this.Status = (this.Points < StatusLevels[ReputationLevel.Gold])     ? ReputationLevel.Silver : this.Status;
    this.Status = (this.Points < StatusLevels[ReputationLevel.Silver])   ? ReputationLevel.Bronze : this.Status;
    this.Status = (this.Points < StatusLevels[ReputationLevel.Bronze])   ? ReputationLevel.None   : this.Status;
}

//--------------------------------------------------------------------------------
// Establishes the points levels that indicate what reputation color is 
// appropriate.
private void BuildStatusLevels()
{
    if (this.StatusLevels == null)
    {
        // There are two points spreads currently being used. Author and Authority 
        // use one point spread, and all of the other categories use another. We're 
        // going to create a dictionary that holds the point spread for this 
        // category, so we need to fdetermine which one we're going to use.

        // assume this is a category type 1 (almost all of the categories comform to this)
        int categoryType = 1;

        // if this category is on of the "exceptional" categories, set the type to 2
        switch (this.Category)
        {
            case ReputationCategoryType.Author:
            case ReputationCategoryType.Authority:
                categoryType = 2;
                break;
        }

        // now create the dictionary
        this.StatusLevels = new ReputationStatusLevels();
        this.StatusLevels.Add(ReputationLevel.None,     0);
        this.StatusLevels.Add(ReputationLevel.Bronze,   (categoryType == 1) ? 100  : 50);
        this.StatusLevels.Add(ReputationLevel.Silver,   (categoryType == 1) ? 500  : 1000);
        this.StatusLevels.Add(ReputationLevel.Gold,     (categoryType == 1) ? 1500 : 5000);
        this.StatusLevels.Add(ReputationLevel.Platinum, (categoryType == 1) ? 2500 : 10000);
    }
}

//--------------------------------------------------------------------------------
// Gets the appropriate color for the status item based on the current points 
// value.
public string GetStatusColorForBrowser()
{
    string color;
    switch (this.Status)
    {
        case ReputationLevel.Bronze   : color = "#F4A460"; break;
        case ReputationLevel.Silver   : color = "#D3D3D3"; break;
        case ReputationLevel.Gold     : color = "#FFD700"; break;
        case ReputationLevel.Platinum : color = "#ADD8E6"; break;
        default                       : color = "#FFFFFF"; break;
    }
    return color;
}

The ReputationList Class

This class allows us toadd or update current values for each status item. In all actuality, this class is prettty boring and contains few methods that would bear any real scrutiny in this article, but in the ineterst of completeness, here's the code.

Here are the obligatory overloaded AddOrUpdate methods which actually add the reputation category to the collection.

//--------------------------------------------------------------------------------
/// Add or update the specified item as scraped from the web page
public void AddOrUpdate(ReputationCategoryType categoryType, int points)
{
    ReputationCategory item = Find(categoryType);
    if (item == null)
    {
        item = new ReputationCategory(categoryType, points);
        this.Add(item);
    }
    else
    {
        item.Points = points;
    }
}

//--------------------------------------------------------------------------------
/// Add or update the item specified by the XElement object (as loaded from the xml data file public void AddOrUpdate(XElement element)
{
    ReputationCategoryType categoryType = Globals.StringToEnum(element.GetValue("Name", 
	                                                           ReputationCategoryType.Unknown.ToString()), 
                                                               ReputationCategoryType.Unknown);
    int points = Convert.ToInt32(element.GetValue("Points", "0"));
    AddOrUpdate(categoryType, points);
}

This method lets us find (and return) a category in the list

//--------------------------------------------------------------------------------
public ReputationCategory Find(ReputationCategoryType categoryType)
{
    foreach (ReputationCategory item in this)
    {
        if (item.Category == categoryType)
        {
            return item;
        }
    }
    return null;
}

And this method allows us to total the points for all categories.

//--------------------------------------------------------------------------------
public void SetTotalPoints()
{
    int total = 0;
    foreach(ReputationCategory category in this)
    {
        total += category.Points;
    }
    this.TotalPoints = total;
}

The ChangesDictionary Class

I added a method to this class that allows me to add new objects to, *or* update existing objects in the collection. Here's the method:

//--------------------------------------------------------------------------------
public void AddUpdate(DataItem dataItem, ChangeType changeType)
{
    if (this.ContainsKey(dataItem))
    {
        this[dataItem] = changeType;
    }
    else
    {
        this.Add(dataItem, changeType);
    }
}

Other Points of Interest

I hate typing, especially pointy brackets. For that reason, I almost always create a class derived from the appropriate collection type. This serves a double purpose of giving me a place to extend the collection to support the specific object type contained in the collection. This keeps me from having to cast my ass off (a pretty inefficient operation in .Net). Many times, there is no need to add any functionality, but that's okay - I think it's much easier to read the code when you're referring to something like this:

ChangesDictionary m_changesDictionary;

rather than this:

Dictionary<string,> m_changesDictionary;</string,>

The ExtensionMethods class

I like extnsion methods - they let you extend any class that comes with the .Net framework. In my case, I extended the XElement and ObservableCollection classes. First, I wanted to make it simpler to get values from an XElement without having all that ugly code in the code that's closer to the programmer. Since it's kind of self-defeating to simply process an exception when something doesn't happen as expected, I came up with this extension method for getting the value of the element and returning the specified default if the child element value doesn't exist. I have additional code (from another project) that overloads this method for various types other than a string object, but I didn't need them in this project, so they aren't included. If you want them, just ask.

//--------------------------------------------------------------------------------
public static string GetValue(this XElement root, string name, string defaultValue)
{
    return (string)root.Elements(name).FirstOrDefault() ?? defaultValue;
}

As far as ObservableCollectionis concerned, I need to be able to sort it, but it had no functionality for sorting, so I had to improvise. A quick web serach resulted in the following code being found on the "SDN forums". I tried to find the code again, but I couldn't, so you'll just have to trust me that I didn't come up with the code, but I did find it elsewhere. One caveat is that the items in the collection must be inherited from IComparable.

//--------------------------------------------------------------------------------
public static void Sort<t>(this ObservableCollection<t> collection)  where T : IComparable
{
    List<t> sorted = collection.OrderBy(x => x).ToList();
    for (int i = 0; i < sorted.Count(); i++)
    {
        collection.Move(collection.IndexOf(sorted[i]), i);
    }
}

//--------------------------------------------------------------------------------
public static void Sort<t>(this ObservableCollection<t> collection, GenericComparer<t> comparer)  where T : IComparable
{
    List<t> sorted = collection.ToList();
    sorted.Sort(comparer);
    for (int i = 0; i < sorted.Count(); i++)
    {
        collection.Move(collection.IndexOf(sorted[i]), i);
    }
}
</t></t></t></t></t></t></t>

Other Code Not of My Own Design

I'm lazy. I don't want to have to write any more code than necessary, so I often scour the web for code that might already have been implemented. If it workds for me, I use it. This project is no different, and here are the pieces I found somewhere else:

The HtmlAgilitypack Library

I earlier discussed the HtmlAgilityPack assembly and the bservableDictionary class. In the case of the HtmlAgilityPack, I only included the compiled DLL. It's file date is Feb 2010, so you might want to see if there's an update to it. The version I supply with my code works with my code, and I have no personal desire to make sure I have the latest and greatest.

As of 12/2010, you could find the latest code here: HtmlAgilityPack on CodePlex

The GenericComparer Class

I honestly don't recall where I found this, but when I did a google search for "GenericComparer", I got back over 6500 results. If you want to find the nearest version to satisfy your curiosity, be my guest. I'm only including this text so that nobody can accuse me of claiming something as mine that is not, in fact, my own invention.

The ObservableDictionary Classes

In a desire to be as compatible with WPF as I could possibly be, I searched for and found the ObservableDictionary class. I *think* it came from Dr. WPF's blog. Here's the URL to the applicable Blog entry: Dr. WPF - Can I bind my ItemsControl to a dictionary?

My project contains a folder with this code in it, as well as a text file containing some observations and notes about the class. Keep in mind that this class generates warnings in VS2010 that I haven't bothered to address (once again, I remind you that I'm lazy, and since nothing bad appears to be happening as a result of the warnings, I've chosen to ignore them.

Final Comments About CPAMLib

This library is currently up-to-date (as of 10 December 2010) with all of the latest format and content changes made to the user profile page, artcles/tips/blogs list page, and the individual article pages themselves. I don't personally have any blog entries, so I admit to a gratuitous amount of laziness where testing this aspect of the scraping is concerned. According to theory, it *should* work, but I haven't extensively tested it in that area.

I also put in support for a change that I *think* the CP dudes are going to implement regarding the original posted date of articles, tips, and blogs. Even if they don't do it, the code will still work fine (and can still scrape the article posted date, albeit much slower).

The Sample Application

To keep things simple, I decided to go with a WinForms application. This allowed me to borrow heavily from the original application in terms of layout, controls, and content. Afterall, there's really no point in completely re-inventing the wheel, and besides, it's REALLY hard to imrpove on perfection (grin). In other words, the old version of this app was essentially a pig beggin' for some lipstick and rouge.

The program grabs data scraped from the CodeProject web site (via the CPAMLib assembly) and displays it in a WebBrowser control. It also keeps track of changes, allows you to sort the displayed data on the various properties, and auto-scrape the web site every hour. I don't personally let it run long enough to auto-scrape, but someone else may want that feature, so there it is.

Visual Presentation

If you're familiar with the original CPAM application, this one should look familiar.

Screenshot_02.jpg

There are three primary areas of the window:

0) The Settings panel - this panel allows the user to select sorting options, what is/isn't displayed, and control the actual scraping of the data.

1) The Averages panel - This panel shows current scores of articles, tips, and logs, as well as average popularity of those groups of items.

2) The Main panel - this panel shows the list of articles, tips, and/or blogs, along appropriate status and statistic information for each individual item.

By default, the application is configured to show articles, blogs and tips, arranged in their individual groups, showing all of those items for the selected userID, and sorting them in descending order by rating. It is also initially configured to show the user's reputation scores.

Since all the truly dirty stuff is hidden inside the CPAM3Lib assembly, all we have to do is start the scrape process, and wait for it to finish. and then display the results.

Data Members

The following data members are needed. As you can see, there isn't a lot to manage:

#region Data Members
private BackgroundWorker m_refreshWorker         = new BackgroundWorker();
private bool             m_hasNavigateMsgHandler = false;
#endregion Data Members

#region Custom Events and Delegates
private enum          ScrapeEvents { Progress, Fail, Complete, Start};
private event         TimeToGoEventHandler TimeToGo = delegate {};
private delegate void DelegateUpdateForm(ScrapeEvents scrapeEvent, ScrapeEventArgs e);
private delegate void DelegateUpdateStatusStripResult();
private delegate void DelegateUpdateStatusStripProgress(ScrapeEventArgs e);
private delegate void DelegateUpdateStatusStripTimeToGo(TimeToGoArgs e);
#endregion Custom Events and Delegates

Initialization

The construct performs some necessary duties such as initializing the article manager object (declared as a static data member of the globals class) reading historic data to the data file, hooking into the article managers event pump, and initializing the controls on the form.
public Form1()
{
    InitializeComponent();

    // set up our data foilder (where we store the last-scraped data)
    Globals.CreateAppDataFolder("CPAM3");
    Globals.Manager.AppDataPath = Globals.AppDataFolder;
    // Load the data (if we have any)
    Globals.Manager.LoadData();

    this.comboBoxSortCategory.SelectedIndex = this.comboBoxSortCategory.FindStringExact("Rating");
    InitListView();
    InitRefreshWorker();
    Globals.Manager.ScrapeComplete += new ScraperEventHandler(articleManager_ScrapeComplete);
    Globals.Manager.ScrapeProgress += new ScraperEventHandler(articleManager_ScrapeProgress);
    Globals.Manager.ScrapeFail     += new ScraperEventHandler(articleManager_ScrapeFail);
    this.TimeToGo                  += new TimeToGoEventHandler(Form1_TimeToGo);

    InitFormControls();
}

private void InitFormControls()
{
    this.textboxUserID.Text                 = CPAM3Browser.Settings.Default.UserID;
    this.checkBoxShowArticles.Checked       = CPAM3Browser.Settings.Default.ShowArticles;
    this.checkBoxShowTips.Checked           = CPAM3Browser.Settings.Default.ShowTips;
    this.checkBoxShowBlogs.Checked          = CPAM3Browser.Settings.Default.ShowBlogs;
    this.checkBoxShowInGroups.Checked       = CPAM3Browser.Settings.Default.ShowInGroups;
    this.checkNewInfo.Checked               = CPAM3Browser.Settings.Default.ShowChangesOnly;
    this.checkShowIcons.Checked             = CPAM3Browser.Settings.Default.ShowIcons;
    this.checkShowIconLegend.Checked        = CPAM3Browser.Settings.Default.ShowIconLegend;
    this.checkShowReputation.Checked        = CPAM3Browser.Settings.Default.ShowReputation;
    this.checkboxSortDescending.Checked     = CPAM3Browser.Settings.Default.SortDescending;
    this.checkAutoRefresh.Checked           = (string.IsNullOrEmpty(this.textboxUserID.Text)) 
	                                          ? false 
                                              : CPAM3Browser.Settings.Default.AutoRefresh;
    this.comboBoxSortCategory.SelectedIndex = CPAM3Browser.Settings.Default.SortColumn;
}

Automatic Refresh

You have the option of having the application automatically refresh the results ever 60 minutes. If this is turned on, a background worker object sits and spins, kicking off a refresh at the top of every hour. I've enabled all of the events for the background worker, but the app currently only handles the progress event.

//--------------------------------------------------------------------------------
private void InitRefreshWorker()
{
    this.m_refreshWorker.WorkerReportsProgress      = true;
    this.m_refreshWorker.WorkerSupportsCancellation = true;
    this.m_refreshWorker.RunWorkerCompleted        += new RunWorkerCompletedEventHandler(refreshWorker_RunWorkerCompleted);
    this.m_refreshWorker.ProgressChanged           += new ProgressChangedEventHandler(refreshWorker_ProgressChanged);
    this.m_refreshWorker.DoWork                    += new DoWorkEventHandler(refreshWorker_DoWork);
}

//--------------------------------------------------------------------------------
// Fired when the user checks the auto-refresh checkbox. It sits and spins 
// waiting for the next scrape time (every hour).
void refreshWorker_DoWork(object sender, DoWorkEventArgs e)
{
    BackgroundWorker worker = sender as BackgroundWorker;
    //Globals.Manager.SupressWarnings = true;
    DateTime now      = DateTime.Now;
    DateTime nextTime = new DateTime(0);
    int      interval = 1000;
    int      updateMinutes = 60;

    do
    {
        if (!worker.CancellationPending && now >= nextTime)
        {
            DelegateUpdateForm method = new DelegateUpdateForm(UpdateFormControls);
            Invoke(method, ScrapeEvents.Start, new ScrapeEventArgs(""));
            Globals.Manager.ScrapeWebPage(this.textboxUserID.Text);

            // We do NOT support the rescan for tips posted dates here because I 
            // didn't feel like dealing with the invoke stuff.

            if (nextTime.Ticks == 0)
            {
                TimeSpan span = new TimeSpan(0, updateMinutes - now.Minute, 0);
                // for testing
                if (updateMinutes > 5)
                {
                    nextTime = now.AddMinutes((span.Minutes < 5) 
                               ? updateMinutes + span.Minutes 
                               : span.Minutes);
                }
                else
                {
                    nextTime = now.AddMinutes(span.Minutes);
                }
            }
            else
            {
                nextTime = now.AddMinutes(updateMinutes);
            }
        }
        if (!worker.CancellationPending)
        {
            RaiseEventTimeToGo(nextTime);
            Thread.Sleep(interval);
            now = DateTime.Now;
        }
    } while (!worker.CancellationPending);
}

//--------------------------------------------------------------------------------
void refreshWorker_ProgressChanged(object sender, ProgressChangedEventArgs e)
{
}

//--------------------------------------------------------------------------------
void refreshWorker_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
{
}

The WebBrowser Control

The meat and potatoes of the application is the WebBrowser control. I figured HTML was a simple way to show the data without too much fuss on my part. Boy, was I wrong. First, we have to initialize the control. We use the about:blankURL to give the control someplace to browse to, and then we update the control.

//--------------------------------------------------------------------------------
private void InitListView()
{
    this.webBrowser1.Navigate("about:blank");
    HtmlDocument doc = this.webBrowser1.Document;
    doc.Write(string.Empty);
    if (Globals.Manager.Count > 0)
    {
        UpdateListView(true);
    }
}

//--------------------------------------------------------------------------------
/// Set the web browser anchor property
private void DisplayWebBrowser_Load(object sender, EventArgs e)
{
    // We have to do this because there's no way to set the anchor property 
    // using the designer
    this.Anchor = AnchorStyles.Bottom | 
    AnchorStyles.Left   | 
    AnchorStyles.Right  | 
    AnchorStyles.Top;
}

//--------------------------------------------------------------------------------
// Builds the html that is displayed in the web browser
public void UpdateListView(bool startingWithData)
{
    if (!startingWithData)
    {
        this.Text = string.Format("(last update - {0}", 
		                          Globals.Manager.UpdateTime.ToString("yyyy/MM/dd at hh:mm"));
    }

    // remove the message handler temporarily while we build and navigate to our 
    // web browser control
    if (m_hasNavigateMsgHandler)
    {
        this.webBrowser1.Navigating -= new WebBrowserNavigatingEventHandler(webBrowser1_Navigating);
    }

	// Determine what we need to do based on the current state of the 
	// applicable checkboxes
    ShowArticles    = this.checkBoxShowArticles.Checked;
    ShowBlogs       = this.checkBoxShowBlogs.Checked;
    ShowTips        = this.checkBoxShowTips.Checked;
    ShowChanges     = this.checkNewInfo.Checked;
    ShowIcons       = this.checkShowIcons.Checked;
    ShowIconLegends = this.checkShowIconLegend.Checked;
    ShowReputation  = this.checkShowReputation.Checked;

    // construct the html for our web browser control
    bool          foundGroup = false;
    string        html       = "";
    StringBuilder htmlAll    = new StringBuilder("");

    // first, check to see if we have anything to do
    if (Globals.Manager.Count <= 0)
    {
        htmlAll.Append("<html><body style='font-family:arial;'>");
        htmlAll.Append("No articles found. CodeProject might be temporarily unavailable.");
        htmlAll.Append("</body></html>");
    }
    else
    {
		// build the html we're going to display
        htmlAll.Append(BuildHtmlHeader());
        htmlAll.Append(BuildReputationHtml());

        // we have articles, blogs, and/or tips, so get to work
        // if the user wants to show articles
        if (ShowArticles)
        {
            // build the appropriate table
            html = BuildArticleHtml(ItemGroup.Articles, ref foundGroup);
            // and if the group was found, add it to the stringbuilder
            if (foundGroup)
            {
                htmlAll.Append(html);
            }
        }

        // if the user wants to show blogs
        if (this.ShowBlogs)
        {
            // build the appropriate table
            html = BuildArticleHtml(ItemGroup.Blogs, ref foundGroup);
            // and if the group was found, add it to the StringBuilder
            if (foundGroup)
            {
                htmlAll.Append(html);
            }
        }
        // if the user wants to show tips/tricks
        if (this.ShowTips)
        {
            // build the appropriate html
            html = BuildArticleHtml(ItemGroup.Tips, ref foundGroup);
            // and if the group was found, add it to the StringBuilder
            if (foundGroup)
            {
            htmlAll.Append(html);
            }
        }
        htmlAll.Append(BuildHtmlFooter());
    }

    // set our docuement text
    this.webBrowser1.DocumentText = htmlAll.ToString();

    // re-add the message handler so we can respond to link clicks on the 
    // article items that are displayed
    this.webBrowser1.Navigating += new WebBrowserNavigatingEventHandler(webBrowser1_Navigating);
}

In the interest of brevity, I'm not going to show the code that actually builds the HTML (you can see them being used in the code snippet above). The app is simply building output with the data contained in the article manager. I will mention though, that to keep memory use down to a dull roar, I used StringBuilderobjects to hold the HTML as it's being constructed.

What The Icons Mean

In the legend below, "article" should be taken to mean either an article, tip, or blog entry. All of the groups are individually analyzed, and therefore, each group has its own icons.

new_32.png - Indicates a new article. All articles will display as new when you initially start the application.

bestrating_32.png - Indicates the article with the best rating.

lowrate_32.png - Indicates the article with the worst rating.

votes_32.png - Indicates the article with the most votes.

viewed_32.png - Indicates the article with the most page views.

popular_32.png - Indicates the most popular article.

bookmark_32.png - Indicates the article with the most bookmarks.

download_32.png - Indicates the article with the most downloads (not provided by CP yet)

up_32.png - Indicates that the associated field increased in value.

down_32.png - Indicates that the associated field decreased in value.

Features of the Sample Application

  • Scrape status, time to next scrape event, and current scrape progress are all reported in the status bar at the bottom of the window.
  • To keep the reported changes to something that I consider sane, the number of views that have to be made to report as "changed" to the statistics is 10. You can increase or decrease this value via the private Article.ViewChangeThreshold data member.
  • Initially, the program displays article in order of rating in descending order. You can change this at any time by selecting a different property in the combo box at the top of the form.
  • Articles that report changes are displayed with backgrounds that are shades of blue, as opposed to unchanged articles that are white/gray. Just so you can see the differences side-by-side, here's a screen shot:
Screenshot_03.jpg

History

  • 11/13/2011 - Another site change that a) renamed some element IDs, and b) exposed a coding error involving cleaning the data before parsing it.
  • 05/11/2011 - I fixed the titlebar (it was showing weird info), and fixed an issue where alternate tips were being counted in the tip/trick rating and popularity averages showing at the top of the form. Since they don't acquire votes, they're not supposed to be counted in the average values.
  • 05/05/2011 - It seems that the averages show in the statistics box at the top right corner of the form was displaying questionable values. I modified the way the data source was being initialized, and jerked the averages back into their proper alignment.
  • 05/04/2011 - Added support for sorting by number of downloads. I also added a new line to each group table that shows the total votes, views, bookmarks, and downloads for each group (articles, blogs, and tips).
  • 03/27/2011 - Added support for number of downloads. It *should* adapt with no problems, but if things look strange, just delete C:\Program Data\CPAM3\*.xml, and rerun the program. It will create a new file, and you'll be good to go.
  • 03/22/2011 - The fix I made yesterday was thwarted by a simple misspelling of the word "Organiser" (CP spells it wrong - it should be "Organizer" - but I have to conform to their usage if I want my code to work). So if you've already run the old new code once, you have to run this version twice before it fixes the display problem. The problem is that a new reputation points category called "Unknown" is displayed in the first table, and the Organizer points aren't updated, leading to an incorrect total points value. In any case, this fixes those problems, but remember, you have to run it twice before you see it completely fixed.
  • 03/22/2011 - The user profile page was changed a little, and that broke this application. I had to completely change ArticleManager.ParseReputationScores to effect the necessary modifications. If you have any problems, let me know in the forum below.
  • 12/29/2010 - I discovered that alternate tip/tricks do NOT accumulate a view count (yet), so I had to add a method that normalized empty strings to contain at least a valid numeric value of zero ("0"). This same fix also addressed the issue that was causing the posted dates on tips/tricks to always come up as 01/01/1990.
  • 12/13/2010 - Solved a problem that was causing the application to completely fail after a site update was made. Also fixed the bug that caused problems for people that weren't in the US (thanks Petr!). Many thanks to Chris Maunder for helping out with the format stuff as well. I was really surprised that he'd gotten around to it so quickly.
  • 12/12/2010 - The article list page format was changed, and I'm working out the issues with CP admins. In the meantime, the download for this article has been disabled until the problems are resolved.
  • 12/11/2010 - Original submission.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here