Introduction
In this tip, I'll present my Library WWW RobotRules (https://robotrules.codeplex.com/). This is a simple library to parse robots.txt and robots meta tag. The library fully respects the RFC 1808 and the RFC 1945.
Using the Code
Configuration
RobotRulesUseCache
: Boolean, to active or deactivate the cache support RobotRulesCacheLibrary
: Type definition string
, optional if RobotRulesUseCache
is False
="1.0"="utf-8"
<configuration>
<appSettings>
<add key="RobotRulesUseCache" value="False"/>
<add key="RobotRulesCacheLibrary"
value="RobotRules.Cache.MemoryCache, RobotRules"/>
<add key="RobotRulesCacheTimeout" value="00:01:00" />
</appSettings>
</configuration>
Use the Library
First, define a new parser with your robot user agent:
using RobotRules;
private RobotsFileParser RobotRules = new RobotsFileParser()
{
LocalUserAgent = @"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
};
Then, use it like this:
RobotRules.Parse(new Uri("http://blablabla.com"));
if (RobotRules.IsAllowed("GoogleBot", new Uri ("http://blablabla.com"))) {
}
This code is great, but if the robot control rules are embedded into the HTML code?
Sample
<!DOCTYPE html>
<html lang="en"
xmlns="<a href="http://www.w3.org/1999/xhtml">http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title>Test</title>
<meta name="robots" content="nofollow"/>
</head>
<body>
</body>
</html>
Don't be worried about that, just use the library like this:
RobotsFileParser RobotRules = new RobotsFileParser()
{
LocalUserAgent = @"Mozilla/5.0
(compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
};
RobotControlStrategy strategy = RobotRules.CheckRobotControlStrategy
("Googlebot", "HTML CONTENT");
if (strategy.CanFollow)
{
}
if (strategy.CanIndex)
{
}
Points of Interest
- Use MEF to load the cache plugin instead of reflection
History
- V1 : 03/06/2014
- V1.5.2.4
ICache
now inherits from IDisposable
- Fix cache initialization
RobotsFileParser
is disposable RobotsFileParser
exposes the method ClearCache()
- Add new configuration key
RobotRulesCacheTimeout
to specify cache timeout