Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#4.0

A Simple and Powerful Library to Deal with Web Robots Control Strategy

4.29/5 (3 votes)
6 Mar 2014MIT 10.4K  
How to parse robots.txt and robots meta tag

Introduction

In this tip, I'll present my Library WWW RobotRules (https://robotrules.codeplex.com/). This is a simple library to parse robots.txt and robots meta tag. The library fully respects the RFC 1808 and the RFC 1945.

Using the Code

Configuration

  • RobotRulesUseCache: Boolean, to active or deactivate the cache support
  • RobotRulesCacheLibrary: Type definition string, optional if RobotRulesUseCache is False
XML
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <appSettings>
    <add key="RobotRulesUseCache" value="False"/>
    <add key="RobotRulesCacheLibrary" 
    value="RobotRules.Cache.MemoryCache, RobotRules"/>
    <add key="RobotRulesCacheTimeout" value="00:01:00" />
  </appSettings>
</configuration>  

Use the Library

First, define a new parser with your robot user agent:

C#
using RobotRules; 
 
private RobotsFileParser RobotRules = new RobotsFileParser() 
{
 LocalUserAgent = @"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
}; 

Then, use it like this:

C#
RobotRules.Parse(new Uri("http://blablabla.com"));
if (RobotRules.IsAllowed("GoogleBot", new Uri ("http://blablabla.com"))) {
   // your code ...
}

This code is great, but if the robot control rules are embedded into the HTML code?

Sample

HTML
<!DOCTYPE html>
 
<html lang="en" 
xmlns="<a href="http://www.w3.org/1999/xhtml">http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title>Test</title>
    <meta name="robots" content="nofollow"/>
</head>
<body>
 
</body>
</html>

Don't be worried about that, just use the library like this:

C#
RobotsFileParser RobotRules = new RobotsFileParser()
{
    LocalUserAgent =  @"Mozilla/5.0 
    (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
};

RobotControlStrategy strategy = RobotRules.CheckRobotControlStrategy
("Googlebot", "HTML CONTENT");

if (strategy.CanFollow)
{
    // your code
}
if (strategy.CanIndex)
{
    // your code
}

Points of Interest

  • Use MEF to load the cache plugin instead of reflection

History

  • V1 : 03/06/2014
  • V1.5.2.4
    • ICache now inherits from IDisposable
    • Fix cache initialization
    • RobotsFileParser is disposable
    • RobotsFileParser exposes the method ClearCache()
    • Add new configuration key RobotRulesCacheTimeout to specify cache timeout

License

This article, along with any associated source code and files, is licensed under The MIT License