Reputationator - Part 1 of 4 [
^]
Reputationator - Part 2 of 4 (this article)
Reputationator - Part 3 of 4 [
^]
Reputationator - Part 4 of 4 [
^]
Introduction
This is a continuation of the previous article in this article series, and discusses, in detail, how the reputation scraper object works in the Reputationator software suit, as well as incidental objects of interest. If you're not really interested in how it works, you can skip to Part 3 which
describes the Windows Forms application. I promise you that I won't be upset if you skip this part of the series, and I will understand completely.
The RepScraper Class
The RepScraper
class performs the scraping process on your Codeproject profile page. It's located in the ReputationLib assembly because both the Windows service (described in Part 1) and the Windows Forms application (described in part 3) require access to it. In a nutshell, this object performs the following functions:
- Find and scrape the users Codeproject profile page for reputation info
- Parse the returned HTML in order to locate the reputation points
- Add or update the points to the
RepItemCollection
object
- Send a custom event indicating the status of the scraping process
Scraping CodeProject
The first thing we have to do is scrape the user's Codeproject profile page. When you see this:
...the scraping code sees thousands of lines of returned HTML. We're interested in this part:
In order to substantially ease the paint here, I make use of the HttpAgilityPack. It handles all the nastiness of connecting to Codeproject and returning the response. If you need to do the same thing, I can't recommend this library highly enough. The first thing we do is setup our agility pack object. To keep the UI flowing smoothly, the act of scraping the web page is performed in a separate thread. Here's the method that constructs and starts the thread:
public void ScrapeWebPage()
{
if (this.ScraperThread != null)
{
if (this.ScraperThread.ThreadState == System.Threading.ThreadState.Running)
{
try
{
this.ScraperThread.Abort();
}
catch (Exception)
{
}
}
this.ScraperThread = null;
}
this.ScraperThread = new Thread(new ThreadStart(ScrapeUser));
this.ScraperThread.Start();
}
And here's the actual thread method:
private void ScrapeUser()
{
FindUserName();
RaiseEventThreadComplete();
}
Which calls the method that does the actual work. The URL is constructed using your Codeproject UserID, and the HtmlAgilityPack browses to the specified URL. When it returns, we take the resulting HtmlDocument
, and pass it to the overload of this method for partial parsing.
NOTE: This code was borrowed from my CPAM3 code, and I retained the original cascading method calls because I was too lazy to make it more of a purpose-built solution.
private void FindUserName()
{
string url = string.Format("http://www.codeproject.com/script/Membership/View.aspx?mid={0}", Globals.UserID);
HtmlAgilityPack.HtmlDocument hdoc = BrowseToUrl(url);
if (hdoc != null)
{
FindUserName(hdoc);
}
}
private void FindUserName(HtmlAgilityPack.HtmlDocument hDoc)
{
string data = hDoc.DocumentNode.OuterHtml;
string userName = "";
if (!string.IsNullOrEmpty(data))
{
int pos = data.IndexOf("<title>");
if (pos >= 0)
{
int start = data.IndexOf(">", pos) + 1;
int stop = data.IndexOf(" - Professional Profile", start);
if (stop >= 0)
{
userName = data.Substring(start, stop-start);
pos = data.IndexOf("\"ctl00_MC_Prof_TotalRep\"", stop);
if (pos >= 0)
{
start = data.IndexOf(">", pos) + 1;
stop = data.IndexOf("</span>", start);
if (stop >= 0)
{
this.m_locatorPoints = data.Substring(start, stop-start);
}
}
}
}
}
this.m_mainUser = userName;
Int32.TryParse(this.m_locatorPoints.Replace(",",""), out this.m_userPoints);
if (!string.IsNullOrEmpty(this.m_mainUser) && !string.IsNullOrEmpty(this.m_locatorPoints))
{
GetReputation(hDoc);
}
else
{
throw new Exception("Could not determine user name.");
}
}
Parsing the Data
Now that we have our data for parsing, we can extract the information we're interested in. Once again, this code is borrowed heavily from my CPAM3 article, with only the most minimal effort invested into customizing it for this application. As you can see, there's not much at all to
this code (thank you, HtmlAgilityPack).
private void GetReputation(HtmlAgilityPack.HtmlDocument hDoc)
{
HtmlAgilityPack.HtmlNode node = hDoc.GetElementbyId("About");
ParseReputationScores(node);
}
private void ParseReputationScores(HtmlAgilityPack.HtmlNode node)
{
if (node == null)
{
return;
}
int start = node.InnerHtml.IndexOf("<table class=\"member-rep-list\">");
int stop = -1;
int total = 0;
if (start >= 0)
{
stop = node.InnerHtml.IndexOf("</table>", start);
string source = node.InnerHtml.Substring(start,stop+7);
if (GetPointsForCategory("Author", ref source, ref total) < 0 ||
GetPointsForCategory("Authority", ref source, ref total) < 0 ||
GetPointsForCategory("Debator", ref source, ref total) < 0 ||
GetPointsForCategory("Editor", ref source, ref total) < 0 ||
GetPointsForCategory("Enquirer", ref source, ref total) < 0 ||
GetPointsForCategory("Organiser", ref source, ref total) < 0 ||
GetPointsForCategory("Participant", ref source, ref total) < 0)
{
total = -1;
}
}
this.Reputations.AddOrUpdate(RepCategory.Total, DateTime.Now.Date, Globals.UserID, total);
this.Reputations.UpdateDatabase();
}
The GetPointsForCategory
method parses the specified node looking for the specified reputation category. I'm making a LOT of assumptions here. First, I'm assuming that if we find the category in the table, we must have points for that category, so after finding it, I assumed that the first instance of a "<B>" is the beginning of the points string. If the points value can't be parsed, we return a -1 to indicate a problem.
private int GetPointsForCategory(string category, ref string source, ref int total)
{
string points = "0";
int pos = source.IndexOf("#"+category);
int start = -1;
int stop = -1;
if (pos >= 0)
{
start = source.IndexOf("<b>", pos) + 3;
if (start >= 0)
{
stop = source.IndexOf("</b>", start);
if (stop >= 0)
{
points = source.Substring(start, stop-start).Replace(",", "");
}
}
}
RepCategory rcType = Globals.StringToEnum(category, RepCategory.Unknown);
int pts;
if (!Int32.TryParse(points, out pts))
{
pts = -1;
}
this.Reputations.AddOrUpdate(rcType, DateTime.Now.Date, Globals.UserID, pts);
if (pts >= 0)
{
total += pts;
}
return pts;
}
Updating the Reputation Item Collection
The ParseReputationScores
method described above contains the code that actually updates the collection and the database.
Sending the Event
The following methods are responsible for throwing the various scrape status events.
private void RaiseEventThreadComplete()
{
ScrapeComplete(this, new ScrapeEventArgs("Complete"));
}
private void RaiseEventThreadFail()
{
ScrapeFail(this, new ScrapeEventArgs("Failed"));
}
private void RaiseEventThreadProgress(string message)
{
ScrapeProgress(this, new ScrapeEventArgs(message));
}
Other Classes
AppEventLog
The AppEventLog
class is another one of those "long-in-the-tooth" static classes I wrote many moons ago, and it provides functionality to write to the application logs in Windows. I generally only use it for Windows services, but it can be used for desktop applications as well. Since this class isn't the focus of the article, I'm not going to describe it or mention it any further, but rest assured that it has comments in the file, and you can see an example of its use in the Windows service source files.
Extension Method Classes
ExtendDateTime Class
This class contains two methods.
bool Between(this DateTime date, DateTime from, DateTime to, bool inclusive)
This method allows me to determine if a given date is between two specified dates.
public static bool Between(this DateTime date, DateTime from, DateTime to, bool inclusive)
{
if (!inclusive)
{
from = from.AddDays(1);
to = to.AddDays(-1);
}
return (date.Date >= from.Date && date.Date <= to);
}
DateTime WeekStartDate(this DateTime date)
This method returns the date of the start of the current week. For instance, if you had a DateTime
object that was current set to Aug 6 2011, this method would return 31 July 2011 (if your beginning of week is set to Sunday).
public static DateTime WeekStartDate(this DateTime date)
{
DateTime startDate = new DateTime(0);
CultureInfo culture = Thread.CurrentThread.CurrentCulture;
DayOfWeek firstDay = culture.DateTimeFormat.FirstDayOfWeek;
DayOfWeek today = date.DayOfWeek;
startDate = date.AddDays(firstDay - today).Date;
return startDate;
}
Globals
Globals
is a static class that contains a number of utility methods hat are useful from pretty much everywhere in the various assemblies, but that are NOT extension methods.
public static string AppPath()
This method returns the path to the application (or more accurately, the executing application).
public static string AppPath()
{
return System.IO.Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
}
public static string CreateAppDataFolder(string folderName)
Create (if necessary) the specified application data folder. This method only creates the root folder, and will throw an exception if more than one folder is specified. For instance, "\MyApp" is valid, but "\MyApp\MySubFolder" is not. It returns the fully qualified path that was created (or that already existed). This code is kinda ancient, but I include it in pretty much all of my
applications.
public static string CreateAppDataFolder(string folderName)
{
string appDataPath = "";
string dataFilePath = "";
folderName = folderName.Trim();
if (folderName != "")
{
try
{
appDataPath = System.Environment.GetFolderPath(m_specialFolder);
}
catch (Exception ex)
{
throw ex;
}
if (folderName.Contains("\\"))
{
string[] path = folderName.Split('\\');
int folderCount = 0;
int folderIndex = -1;
for (int i = 0; i < path.Length; i++)
{
string folder = path[i];
if (folder != "")
{
if (folderIndex == -1)
{
folderIndex = i;
}
folderCount++;
}
}
if (folderCount != 1)
{
throw new Exception("Invalid folder name specified (this function only creates the root app data folder for the application).");
}
folderName = path[folderIndex];
}
}
if (folderName == "")
{
throw new Exception("Processed folder name resulted in an empty string.");
}
try
{
dataFilePath = System.IO.Path.Combine(appDataPath, folderName);
if (!Directory.Exists(dataFilePath))
{
Directory.CreateDirectory(dataFilePath);
}
}
catch (Exception ex)
{
throw ex;
}
AppDataFolder = dataFilePath;
return dataFilePath;
}
ToEnum Methods
The purpose of the method is to allow the programmer to initialize a data member of a specified enumerator type to a value contained in the ordinal list. The problem this method addresses is that if the programmer retrieves an enum
ordinal value as an int type, and wants to initialize an
enum
data member, he really has no programmatic idea if the value represents a valid ordinal. He simply tries to set it, and hopes for the best (handling an exception if the assignment goes sideways on him). For a more complete discussion on these methods, go look at my tip/trick regarding these methods:
Setting Enumerators From Questionable Data Sources (for C# and VB)
[
^]
public static T IntToEnum<T>(int value, T defaultValue)
{
T enumValue = (Enum.IsDefined(typeof(T), value)) ? (T)(object)value : defaultValue;
return enumValue;
}
public static T StringToEnum<T>(string value, T defaultValue)
{
T enumValue = (Enum.IsDefined(typeof(T), value)) ? (T)Enum.Parse(typeof(T), value) : defaultValue;
return enumValue;
}
Regression Methods For Calculating Trend Lines
A year or so ago, I was in search of am algorithm that would help me display a trend line on a line/column chart (it was a for a Silverlight application), and I happened to stumble on this code. It did exactly what I wanted, but according to the boss, it was not generating a correct trend line. His definition of "correct" was averaging all of the data point values, starting at zero, and creating new points for the trend line by simply multiplying the average value by the data point position. No amount of discussion could change his mind, but I kept this code in case I ever needed it (or had the opportunity to do it right). Well, Mr. Opportunity is knocking on the door, and I finally get to use this code. I've forgotten where I found it, but I wanted to let you
know this is not code that I wrote. You should also assume that I have no idea what this code is doing and I don't really care.
the first overload is the original code that I found, and includes a lot of stuff that I don't need. It's merely presented as my lame attempt at preserving history. The returned RegressionProcessInfo
object exists in the code, but I'm not going to bother describing it here, if for no other reason
than it simply isn't applicable since I don't actually use this overload.
public static RegressionProcessInfo Regress(double[] yAxis)
{
double sigmax = 0.0;
double sigmay = 0.0;
double sigmaxx = 0.0;
double sigmayy = 0.0;
double sigmaxy = 0.0;
double x;
double y;
double n = 0;
RegressionProcessInfo ret = new RegressionProcessInfo();
for (int i = 0; i < yAxis.Length; i++)
{
x = i;
y = yAxis[i];
if (x > ret.XRangeH) ret.XRangeH = x;
if (x < ret.XRangeL) ret.XRangeL = x;
if (y > ret.YRangeH) ret.YRangeH = y;
if (y < ret.YRangeL) ret.YRangeL = y;
sigmax += x;
sigmaxx += x * x;
sigmay += y;
sigmayy += y * y;
sigmaxy += x * y;
n++;
}
ret.b = sigmaxy / sigmaxx;
ret.a = (sigmay - ret.b * sigmax) / n;
ret.SampleSize = (int) n;
for (int i = 0; i < yAxis.Length; i++)
{
x = i;
y = yAxis[i];
double yprime = ret.a + ret.b * x; double Residual = y - yprime;
ret.SigmaError += Residual * Residual;
}
ret.XMean = sigmax / n;
ret.YMean = sigmay / n;
ret.XStdDev = Math.Sqrt(((double)n * sigmaxx - sigmax * sigmax) /((double)n * (double)n - 1.0));
ret.YStdDev = Math.Sqrt(((double)n * sigmayy - sigmay * sigmay) /((double)n * (double)n - 1.0));
ret.StandardError = Math.Sqrt(ret.SigmaError / ret.SampleSize);
double ssx = sigmaxx-((sigmax*sigmax)/n);
double ssy = sigmayy-((sigmay*sigmay)/n);
double ssxy = sigmaxy-((sigmax*sigmay)/n);
ret.PearsonsR = ssxy / Math.Sqrt( ssx * ssy);
ret.t = ret.PearsonsR / Math.Sqrt( (1-(ret.PearsonsR * ret.PearsonsR))/(n-2));
return ret;
}
Here is the version I use in this application. This method has the added benefit of being a bit faster since it's not performing all the calclations that the previous overload is performing.
public static void Regress(double[] yAxis, ref double a, ref double b)
{
double sigmax = 0.0;
double sigmay = 0.0;
double sigmaxx = 0.0;
double sigmayy = 0.0;
double sigmaxy = 0.0;
double n = 0;
double x;
double y;
for (int i = 0; i < yAxis.Length; i++)
{
x = i;
y = yAxis[i];
sigmax += x;
sigmaxx += x * x;
sigmay += y;
sigmayy += y * y;
sigmaxy += x * y;
n++;
}
b = sigmaxy / sigmaxx;
a = (sigmay - b * sigmax) / n;
}
ExternalProcess Class
The Winforms app needs to be able to run a process to manage the windows service installation, so I built this class to abstract out the nitty-gritty from the UI code. It's a static class whose entire purpose in life is to act as a wait-for-exit process launcher. I wrote this code a couple of years ago, and is admittedly not written specifically for this application, but with a few tweaks, it fits the bill nicely. There are only a few methods in the class.
public static bool Run(string filePath, string arguments, bool asAdmin)
Because we have to run INSTALLUTIL as admin in order for it to work, this class has to be ready to use ShellExecute, as well as make appropriate settings changes before actually spawning the new process.
public static bool Run(string filePath, string arguments, bool asAdmin = false)
{
bool success = false;
Initialize();
Process.StartInfo.FileName = filePath;
Process.StartInfo.Arguments = arguments;
Process.StartInfo.RedirectStandardError = CaptureStdErr;
Process.StartInfo.RedirectStandardOutput = CaptureStdOut;
if (asAdmin)
{
Process.StartInfo.Verb = "runas";
Process.StartInfo.UseShellExecute = true;
Process.StartInfo.RedirectStandardError = false;
Process.StartInfo.RedirectStandardOutput = false;
}
if (Process.StartInfo.RedirectStandardOutput)
{
Process.OutputDataReceived += new DataReceivedEventHandler(StdOutputHandler);
}
if (Process.StartInfo.RedirectStandardError)
{
Process.ErrorDataReceived += new DataReceivedEventHandler(StdErrorHandler);
}
if (SetWorkingDir)
{
Process.StartInfo.WorkingDirectory = WorkingDir;
}
StartTime = DateTime.Now;
try
{
Process.Start();
if (Process.StartInfo.RedirectStandardError)
{
Process.BeginErrorReadLine();
}
if (Process.StartInfo.RedirectStandardOutput)
{
Process.BeginOutputReadLine();
}
success = true;
}
catch (Exception e)
{
m_lastError = e.Message;
}
finally
{
m_process.Close();
}
if (m_startTime != null && m_stopTime != null)
{
m_executionTime = m_stopTime - m_startTime;
}
return success;
}
public static void Initialize()
This method initialzies our time keeprs and stdXXX properties to their default values, and is called from the constructor and the Run method.
public static void Initialize()
{
StdOut.Clear();
StdErr.Clear();
ExecutionTime = ExecutionTime.Subtract(ExecutionTime);
StartTime = StartTime.AddTicks(-(StartTime.Ticks));
StopTime = StopTime.AddTicks(-(StopTime.Ticks));
}
Remaining Methods
The remaining methods are simply handlers for stdOut and stdErr events, and another that allows you to kill the running process.
private static void StdOutputHandler(object process, DataReceivedEventArgs received)
{
if (!String.IsNullOrEmpty(received.Data))
{
m_stdOutput.Append(received.Data);
}
}
private static void StdErrorHandler(object process, DataReceivedEventArgs received)
{
if (!String.IsNullOrEmpty(received.Data))
{
m_stdError.Append(received.Data);
}
}
public static void KillProcess()
{
if (m_process != null)
{
m_process.Kill();
}
}
Excluded Classes
I added some other classes to the common library just in case I (or you) needed them, but since I haven't actually had a requirement to use them, I excluded them from the (compiled) project. I figured I'd leave 'em in the project in case anyone else can make use of them. Fee free to browse
the files to check them out:
- ExtendedListCollection
- ExtendedObservableCollection
- ExtendedXElement
- WinAPI
Storing the Scraped Data
I chose SQL server primarily because it's relatively performant, and pretty much every developer on CodeProject probably has SQL Server Express installed on their primary (and/or development) machine. If you don't, you're on your own regarding the installatin of that software, or the modification of this software to support your database of choice. If you do add an interface to your chosen database, feel free to write an article about it, and you can even mention it here. When time permits, I'll modify this article to provide a link to your CodeProject article (and ONLY a CodeProject artcle).
The download ZIP file contains a SQL script you can run that adds this app's database to your chosen server instance. I'm assuming you know how to do this, or at least have sufficient google-foo to find out how to do this.
Generally speaking, the data is stored by reputation category, so each category for each scrape generates one row in the database. Successive scrapes on the same day will simply update the previously stored data for that day. For example, let's say the windows service has already scraped the data for today (Aug 6 2011). If the service is scheduled to scrape more than once per day, the next scrape will UPDATE the existing records for today. This guarantees only one set of data will exist in the database for each day.
Store Procedures
Since no joining needs to be perofmred (hell - there's only one table), the stored procedures are very simple in their functionality. The one that has the most going on is the AddUpdate procedure, so we'll cover that a little here.
Since we only keep one record per day, the very first thing we have to do is see in an records exist for the current date, userID, and category. The reason category is in that list is because a category won't be represented in the database until there's a value retrieved from your Profile web page, and since the scraping happens every X hours per day, it's conceivable that you could
gain points in a category that was not found ealier in the day.
CREATE PROCEDURE [dbo].[AddOrUpdate]
(
@user int,
@date datetime,
@category int,
@points int
)
AS
BEGIN
SET NOCOUNT ON;
declare @exists int;
set @exists = (SELECT count(*)
FROM Reputations
WHERE ScrapeDate = @date
AND Category = @category
AND CPUserID = @user);
Now that we know how many records already exist, we can perform the appropriate transaction. If @exists is 0, that means we have to do an insert. For those of you not already wise in the ways of "best practice" regarding SQL Server queries, here's a little tip for you - when you insert, ALWAYS name your columns like you see below (the items in the parenthsis immediately following the words INSERT INTO Reputations). Not doing so will still work, until you add another column to the table and forget that you have to come back to this stored procedure and modify it accordingly. In any case, if @exists is greater than 0 (it should in fact be either 0 or 1), we want to update.
IF (@exists = 0)
BEGIN
INSERT INTO Reputations (Category, PointValue, Scrapedate, CPUserID)
VALUES(@category, @points, @date, @user);
END
ELSE
BEGIN
UPDATE Reputations
SET PointValue = @points
WHERE Category = @category
AND ScrapeDate = @date
AND CPUserID = @user;
END;
END
End of Part 2
Now that you're all caught up on the supporting drudgery that eventually latches onto pretty much every piece of useful software you've ever seen or used, it's time to move on to the Windows Form app.
Reputationator - Part 1 of 4 [^]
Reputationator - Part 2 of 4 (this article)
Reputationator - Part 3 of 4 [^]
Reputationator - Part 4 of 4 [^]
Major Revision 1
CodeProject Caching
A couple of days ago, I noticed that the total category displayed on the daily change (column) chart seemed a bit - well - out of whack. It was almost twice what it should have been. Upon investigation, I found that it was in fact pulling the data on the page and saving it, but when calculating the changed amount (which it was also doing correctly), it was simply the wrong value. I had been bitten by the Codeproject caching issue where the numbers don't necessarily add up (even manually doing the math on the numbers scraped from the page was showing a variation in the mathematical total vs. the one showing on the web page. What to do?
My solution was to change the application to accept an optional command line parameter that causes the program to adjust existing Total RepItem
objects, and if adjusted update the database.
The changes were as follows:
- Reputationator.Program.cs - add an
args
parameter to the Main
method, as well as a switch stament to handle the possible arguments.
- ReputationLib.Globals.cs - add a variable that's set by main when the appropriate parameter has been specified.
- ReputationLib.RepCollections.cs - add code to looks for the
Globals
variable when loading data from the database, and if true, executes a new method that will adjust the existing total values to be mathematically correct vs what the web page reports, and then update the database with the new values. Additionally, the method that updates the database was modified to head this problem off in future updates.
Retrieving Data Manually
The form has a button that allows you to scrape your data at any time. When performed, the chart should have been updated to reflect the new data. A Y-Axis position was in fact being added, but the data wasn't showing up. The workaround was to shut the app down, and start it back up again, allowing it to load the data from the database and thus properly represent the data in the chart. With this change, the workaround should no longer be necessary.
User-Reported Errors
The following problems were reported by Simon Bang Terkildsen, and have been addressed:
- When you enter another user id, then it's still the user id stored in the config file that is used.
- Scraping the web page was confusing
Author
with Authority
because of the similar spelling.
- Countries where comma is not the thousands separator was causing parsing issues in the stated goal points field.
New Feature!
I added a new panel to the top section on the right side labeled You Vs The Leader. What it shows is the approximate date at which you will overtake the current points leader. The longer you use Reputationator, the more accurate this number will become. For the first few days, you may notice fairly erratic differences, but as the average points earned by you and the leader evens out (which will happen over longer periods) the more stable this date will become.
WPF!
I added a mostly-completed WPF version of the Reputationator app, along with an associated Part 4 of this article series.
History
- 20 Aug 2011 : Revision 1 (see section above)
- 14 Aug 2011 - Original article