Figure 1 : CP Vanity Lite - the main dialog
Figure 2 : The exported CSV file in Excel
Introduction
This is a light version of Luc Pattyn's ever popular CP Vanity
application. Luc tired of frequently having to modify the parsing code due to
unavoidable site changes and has not updated his application since the last
major site-update (a few weeks ago). That's what led me to write this app which
is a miniature version of his application. Unlike CP Vanity, CP Vanity
Lite does not really go into the details, instead it will only fetch the
total reputation for members that are most likely to have high reputation
scores. I use the same approach Luc took, namely to get the top members based on
post-count and the top members based on article count, and then fetch the
reputation score for each of those members. I fetch the top 125 in each category
but because of certain members being in both groups, you'll find that the total
number of members for whom rep scores are fetched is less than 250. In addition
to the total score it also shows a daily score average that's calculated based
on the total score and the number of days since a member signed up.
A word of warning: The application fetches
and parses HTML from hardcoded URLs. If the URLs or the HTML content undergo changes, then the
parsing code will break. Until a time in future when CP offers a web-service that will
return this information, the premise on which this application has been written will be shaky. This is the same for similar applications written by
myself and folks like Luc Pattyn.
Using CP Vanity Lite
There's not much to using the app, and all you do is run the application and
wait for the UI to populate. The title bar will show fetching while the
HTTP fetches are performed, and once all the data is available, it'll say
completed. The UI is a simple ListView
that's sortable on the
Reputation Score column. Every time you click on the header, the sort is
toggled between ascending and descending, which is quite handy if you get tired
of seeing CG's name right there on top! The data is populated asynchronously via
a worker thread, so the app will be responsive throughout the fetch-period. You
can sort the data even before all the data has been fetched, but keep in mind
that newly added rows will not be sorted although you can keep sorting as many
times as you want to. Once all the data is fetched, the Export as CSV
button is enabled, and you can save a CSV file that you can then open in an app
like Excel to do all kinds of fancy data manipulations and charts. Take a look
at Figure 2 and you'll see how far ahead of the competition CG is, he's got more
than John and Pete combined! If you run the app periodically and save the CSV
files, you set yourself up nicely for some date-based data parsing to determine
changes in scores and how fast someone's catching up (if you are the
statistically curious type).
The code
The html scraping is done in a single class called RepScoreScraper
.
The code will block, so it's up to calling code to use it from a worker thread
or via some async/task pattern that won't block the main thread. The top
potential scoring members ids are fetched by querying the Who's who pages for
top message posters and for top article authors. The information is extracted
using regular expressions. (I've wrapped the code blocks to prevent
scrolling, the actual source code does not wrap this crazily)
private Regex memberNumberRegex = new Regex("Member No. (\\d*)");
private Regex repScoreRegex = new Regex(
"<span id=\"ctl00_.*?_TotalRep\" class=\"large-text\">([\\s\\S]*?)</span>");
private Regex displayNameRegex = new Regex(
"<h2 id=\"ctl00_.*?_P_Name\">([\\s\\S]*?)</h2>");
private Regex memberSinceRegex = new Regex("Member since (.+)\n");
Note that I have snipped the URLs in the code snippet here to prevent
horizontal scrolling.
public void StartScraping()
{
string[] ml_obs = new[] { "ArticleCount", "MessageCount" };
HashSet<int> ids = new HashSet<int>();
for (int j = 0; j < ml_obs.Length; j++)
{
for (int page = 1; page <= 5; page++)
{
string url = String.Format(
"**SNIPPED**?ml_ob={0}&mgtid=-1&mgm=False&pgnum={1}",
ml_obs[j], page);
string html = GetHttpPage(url, timeOut);
var memberNumberMatches = memberNumberRegex.Matches(html);
var repScoreMatches = repScoreRegex.Matches(html);
var displayNameMatches = displayNameRegex.Matches(html);
var memberSinceMatches = memberSinceRegex.Matches(html);
if (memberNumberMatches.Count == repScoreMatches.Count
&& memberNumberMatches.Count == displayNameMatches.Count
&& memberNumberMatches.Count == memberSinceMatches.Count)
{
for (int i = 0; i < memberNumberMatches.Count; i++)
{
int id = -1;
double score = -1.0;
double scorePerDay = 0.0;
DateTime memberSince = ParseDateTime(memberSinceMatches[i].Value);
if (memberNumberMatches[i].Groups.Count == 2 &&
Int32.TryParse(memberNumberMatches[i].Groups[1].Value, out id) &&
repScoreMatches[i].Groups.Count == 2 &&
Double.TryParse(repScoreMatches[i].Groups[1].Value, out score) &&
displayNameMatches[i].Groups.Count == 2)
{
if (!ids.Contains(id))
{
ids.Add(id);
if (memberSince != DateTime.MinValue)
{
scorePerDay = Math.Round(
score / (DateTime.Now - memberSince).TotalDays, 2);
}
var handler = MemberInfoScraped;
if (handler != null)
{
handler(this, new RepScoreScraperEventArgs()
{
Id = id,
DisplayName = StripOffHtml(
displayNameMatches[i].Groups[1].Value.Trim()),
ReputationScore = (int)score,
DailyAverage = scorePerDay
});
}
}
}
}
}
}
}
var finishedHandler = ScrapeFinished;
if (finishedHandler != null)
{
finishedHandler(this, EventArgs.Empty);
}
}
That's it. If you run into any problems, please use the forum below to report
those and I'll do my best to fix them. Thank you.
Acknowledgements
-
RaviRanjankr - For beta testing the application and providing useful
feedback that led me to improving and simplifying the fetch code.
History
- November 17, 2010 : First published
- November 19, 2010
- Added rank and daily average columns.
- You can now refresh the scores.