Introduction
Using Microsoft Internet Information Server (IIS) when you designate a page to handle HTTP 404 (Not Found) errors on a website, you don't have to return HTTP 404 errors at all. Instead, you can return dynamic content with an HTTP 200 (OK) result. This is helpful when you want to build a sitemap.xml file to enhance the search engine performance of your website, for example. In this article, I'll show you how I did this for my own blog.
Backgrounder
There are two ways that a 404 error handler page can be invoked when using IIS with ASP.NET. For the page types registered for ASP.NET -- e.g. ASPX, ASMX -- the <customError>
element in the <system.web>
section of your web.config file determines what page will be invoked when different kinds of errors occur. For 404 errors, ASP.NET performs the switch to the handler page by using an HTTP 302 (Moved) redirect. This is unacceptable when you want a clean, transparent transfer to the handler page without any knowledge on the part of the client. However, when IIS handles a 404 error instead of ASP.NET, it does something akin to a Server.Transfer()
call under the hood, meaning that the client is not redirected. This is good and it's exactly what we need to implement our dynamically generated sitemap.xml file. Since XML files are not handled by the ASP.NET engine, IIS will transfer to an ASP.NET page of our choice, where we can do whatever we like.
Google's Use of Site Maps
Site maps used by Google and other search engines depend on a simple XML schema that you can find here [^]. If you're like me, the best way to understand such a simple schema is to look at a real, live site map. Load the live sitemap.xml file for my own blog [^] into a new web browser window to see an example. It's very easy to understand, don't you think? Site maps are a good complement to the robots.txt file on your site because they allow you to specify what should be indexed by the search engine instead of what should not be indexed. Use the Google Webmaster Tools [^] to register your site map when it's ready.
The 404 Handler Page Code
Of course, the key to being able to dynamically generate a sitemap.xml file using a 404 error handler page is that the sitemap.xml file must not exist, physically, on your site. Start by creating a new ASPX page that will do the work instead. Remember that this page is probably going to do double duty by generating your sitemap.xml file and by handling real "not found" problems. So, it should be styled in a way that matches your site design.
At my gotnet.biz website, I store a reference to the pages that I want Google to index in a database. To build a dynamic site map, all I need to do is add each of those pages as a <url>
node, according to the sitemap.xml specification. Below is a helper method called AddUrlNodeToUrlSet
, which will do just that. In this first part, the AddUrlNodeToUrlSet
method is shown in part one of a two-part partial class:
using System;
using System.Web;
using System.Web.UI;
using System.Xml;
public partial class ErrorNotFound404 : Page
{
private static readonly string xmlns =
"http://www.sitemaps.org/schemas/sitemap/0.9";
private enum freq { hourly, daily, weekly, monthly, yearly, never }
private static void AddUrlNodeToUrlSet( Uri reqUrl, XmlDocument doc,
string loc, DateTime? lastmod, freq? changefreq )
{
if (reqUrl == null || doc == null || loc == null)
return;
AddUrlNodeToUrlSet( reqUrl, doc, loc, lastmod, changefreq, null );
}
private static void AddUrlNodeToUrlSet( Uri reqUrl, XmlDocument doc,
string loc, DateTime? lastmod, freq? changefreq, float? priority )
{
if (reqUrl == null || doc == null || loc == null)
return;
XmlNode urlNode = doc.CreateElement( "url", xmlns );
string url = String.Format( "{0}://{1}{2}", reqUrl.Scheme,
reqUrl.Authority, VirtualPathUtility.ToAbsolute(
String.Format( "~/{0}", loc ) ) ).Replace( "&", "&" )
.Replace( "'", "'" ).Replace( "''", """ )
.Replace( "<", "<" ).Replace( ">", ">" );
XmlNode newNode = doc.CreateElement( "loc", xmlns );
newNode.InnerText = url;
urlNode.AppendChild( newNode );
if (lastmod != null)
{
newNode = doc.CreateElement( "lastmod", xmlns );
newNode.InnerText = lastmod.Value.ToString( "yyyy-MM-dd" );
urlNode.AppendChild( newNode );
}
if (changefreq != null)
{
newNode = doc.CreateElement( "changefreq", xmlns );
newNode.InnerText = changefreq.Value.ToString();
urlNode.AppendChild( newNode );
}
if (priority != null)
{
newNode = doc.CreateElement( "priority", xmlns );
newNode.InnerText =
(priority.Value < 0.0f || priority.Value > 1.0f)
? "0.5" : priority.Value.ToString( "0.0" );
urlNode.AppendChild( newNode );
}
doc.DocumentElement.AppendChild( urlNode );
}
}
The AddUrlNodeToUrlSet
method defined above will be called during the Page_Load
event to construct the sitemap.xml file. It simply adds one <url>
node for each page on my site that I want to reference in the site map file. Please keep in mind that for my blog, I generate my site map from a list of page names stored in a database table. So, in this next section of code where I open a database and parse the results, the code that finds your searchable pages from your site might be very different. Now let's look at the Page_Load
method in part two of this page:
using System;
using System.Data.OleDb;
using System.Web;
using System.Web.UI;
using System.Xml;
public partial class ErrorNotFound404 : Page
{
protected void Page_Load( object sender, EventArgs e )
{
string QS = Request.ServerVariables["QUERY_STRING"];
if (QS != null && QS.EndsWith( "sitemap.xml" ))
{
XmlDocument doc = new XmlDocument();
doc.LoadXml( String.Format( "<?xml version=\"1.0\" encoding" +
"=\"UTF-8\"?><urlset xmlns=\"{0}\"></urlset>", xmlns ) );
AddUrlNodeToUrlSet( Request.Url, doc, "MyBlog.aspx", null,
freq.daily, 1.0f );
string connectionString = String.Format(
"NOTE: set this to suit the needs of your content database" );
string query = "SELECT PAGE_NAME, POSTING_DATE FROM BLOGDB " +
"ORDER BY POSTING_DATE";
OleDbConnection conn = new OleDbConnection( connectionString );
conn.Open();
OleDbCommand cmd = new OleDbCommand( query, conn );
OleDbDataReader rdr = cmd.ExecuteReader();
if (rdr.HasRows)
{
while (rdr.Read())
{
object page_name = rdr[0];
object posting_date = rdr[1];
if ((object)page_name != null && !(page_name is DBNull))
{
AddUrlNodeToUrlSet( Request.Url, doc, String.Format(
"{0}.ashx", page_name.ToString().Trim() ),
(DateTime?)posting_date, freq.monthly );
}
}
}
Page.TraceEnabled = false;
Response.Clear();
Response.Status = "200 OK";
Response.ContentType = "text/xml";
Response.Write( doc.OuterXml );
Response.End();
}
Response.Status = "404 Not Found";
}
}
When Page_Load
starts, it checks QUERY_STRING
to see if the sitemap.xml file was the missing one that caused the transfer to happen. This is possible because the transfer agent in IIS that handles the name of the missing file simply appends it to QUERY_STRING
. If the name is sitemap.xml, my code starts a new XML document and adds the virtual <url>
nodes using the AddUrlNodeToUrlSet
method shown above. Which page names you will include in your site map is totally dependent on your site's content, so you'll have to make most of your adjustments to my sample in that area. At the end of Page_Load
is some interesting code I want to highlight. There are five key things that happen at this point, in order:
- You must disable page tracing if it's turned on. If you don't, ASP.NET appends a
<div>
element to the end of the document making your XML appear as though it has two root nodes, which will invalidate it.
- You must clear the
Response
object in case some other code has already buffered some content to be sent back to the browser. You want just the XML of the site map in the output, nothing else.
- You need to set the HTTP status code to 200 to make sure that the client sees the result of its request as successful. Google bots don't like anything but success.
- You must set the MIME type of
Response
to text/XML because that's what the search engine bots expect for the document type you are returning.
- Finally, grab the
OuterXml
property of the XML document and Write()
it back to the browser before ending the Response.
Configuring IIS to Transfer to the Error Handler Page
To get the page defined above to handle HTTP 404 errors, it first has to be registered with IIS to handle them. Remember, you can register the same ASPX page to handle errors for both ASP.NET type pages and non-ASP.NET type pages. However, for file types handled by ASP.NET, you use web.config to register them. Since the XML type is not handled by the ASP.NET engine, you need to tell IIS about this new page, which cannot be done through the web.config file. Instead, you must use the IIS Management Console to register the error handler page. The Microsoft TechNet website has very good instructions on this topic here [^]. On my test site using the IIS Management Console, the registration of the ErrorNotFound404.aspx page looks like this:
Conclusion
You can also register the same page as an error handler with ASP.NET via the web.config file as discussed above. Just be aware that when the ASP.NET engine handles an error, it will redirect the browser to the page you specify. So, if you're depending on a clean transfer to the error handler, you probably aren't going to get exactly what you want. For the sitemap.xml file, though, the approach shown above is very clean because of the way IIS handles missing files. Once you're done, use the Fiddler2 Web Debugging Proxy [^] to open your sitemap.xml file and use the session inspector to see exactly what's happening on the wire. You'll see just how clean this code makes the would-be 404 error for that missing file appear to the search engine bots.
One Parting Thought
If you can generate a dynamic sitemap.xml file using this technique, you could probably use it to generate almost any kind of virtual file: robots.txt, RSS feeds, etc. This means that even more of your site could be dynamically generated from database content. Think about that. Enjoy!
History of This Article
- 24 Nov 2007 - Initial publication
- 28 Nov 2007 - Article edited and moved to the main CodeProject.com article base