Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Generate a Google Site Map Using the HTTP 404 Handler

0.00/5 (No votes)
24 Nov 2007 1  
Site maps make your websites search engine friendly. Learn how to generate them dynamically using your site's HTTP 404 error handler page.

Introduction

Using Microsoft Internet Information Server (IIS) when you designate a page to handle HTTP 404 (Not Found) errors on a website, you don't have to return HTTP 404 errors at all. Instead, you can return dynamic content with an HTTP 200 (OK) result. This is helpful when you want to build a sitemap.xml file to enhance the search engine performance of your website, for example. In this article, I'll show you how I did this for my own blog.

Backgrounder

There are two ways that a 404 error handler page can be invoked when using IIS with ASP.NET. For the page types registered for ASP.NET -- e.g. ASPX, ASMX -- the <customError> element in the <system.web> section of your web.config file determines what page will be invoked when different kinds of errors occur. For 404 errors, ASP.NET performs the switch to the handler page by using an HTTP 302 (Moved) redirect. This is unacceptable when you want a clean, transparent transfer to the handler page without any knowledge on the part of the client. However, when IIS handles a 404 error instead of ASP.NET, it does something akin to a Server.Transfer() call under the hood, meaning that the client is not redirected. This is good and it's exactly what we need to implement our dynamically generated sitemap.xml file. Since XML files are not handled by the ASP.NET engine, IIS will transfer to an ASP.NET page of our choice, where we can do whatever we like.

Google's Use of Site Maps

Site maps used by Google and other search engines depend on a simple XML schema that you can find here [^]. If you're like me, the best way to understand such a simple schema is to look at a real, live site map. Load the live sitemap.xml file for my own blog [^] into a new web browser window to see an example. It's very easy to understand, don't you think? Site maps are a good complement to the robots.txt file on your site because they allow you to specify what should be indexed by the search engine instead of what should not be indexed. Use the Google Webmaster Tools [^] to register your site map when it's ready.

The 404 Handler Page Code

Of course, the key to being able to dynamically generate a sitemap.xml file using a 404 error handler page is that the sitemap.xml file must not exist, physically, on your site. Start by creating a new ASPX page that will do the work instead. Remember that this page is probably going to do double duty by generating your sitemap.xml file and by handling real "not found" problems. So, it should be styled in a way that matches your site design.

At my gotnet.biz website, I store a reference to the pages that I want Google to index in a database. To build a dynamic site map, all I need to do is add each of those pages as a <url> node, according to the sitemap.xml specification. Below is a helper method called AddUrlNodeToUrlSet, which will do just that. In this first part, the AddUrlNodeToUrlSet method is shown in part one of a two-part partial class:

using System;
using System.Web;
using System.Web.UI;
using System.Xml;

public partial class ErrorNotFound404 : Page
{
    // the standard schema namespace and change frequencies

    // for site maps defined at http://www.sitemaps.org/protocol.php

    private static readonly string xmlns =
        "http://www.sitemaps.org/schemas/sitemap/0.9";
    private enum freq { hourly, daily, weekly, monthly, yearly, never }

    // add a url node to the specified XML document with standard

    // priority to the urlset at the document root

    private static void AddUrlNodeToUrlSet( Uri reqUrl, XmlDocument doc,
        string loc, DateTime? lastmod, freq? changefreq )
    {
        // sanity checks

        if (reqUrl == null || doc == null || loc == null)
            return;

        // call the overload with standard priority

        AddUrlNodeToUrlSet( reqUrl, doc, loc, lastmod, changefreq, null );
    }

    // add a url node to the specified XML document with variable

    // priority to the urlset at the document root

    private static void AddUrlNodeToUrlSet( Uri reqUrl, XmlDocument doc,
        string loc, DateTime? lastmod, freq? changefreq, float? priority )
    {
        // sanity checks

        if (reqUrl == null || doc == null || loc == null)
            return;

        // create the child url element

        XmlNode urlNode = doc.CreateElement( "url", xmlns );

        // format the URL based on the site settings and then escape it

        // ESCAPED( SCHEME + AUTHORITY + VIRTUAL PATH + FILENAME )

        string url = String.Format( "{0}://{1}{2}", reqUrl.Scheme,
            reqUrl.Authority, VirtualPathUtility.ToAbsolute(
            String.Format( "~/{0}", loc ) ) ).Replace( "&", "&amp;" )
            .Replace( "'", "&apos;" ).Replace( "''", "&quot;" )
            .Replace( "<", "&lt;" ).Replace( ">", "&gt;" );

        // set up the loc node containing the URL and add it

        XmlNode newNode = doc.CreateElement( "loc", xmlns );
        newNode.InnerText = url;
        urlNode.AppendChild( newNode );

        // set up the lastmod node (if it should exist) and add it

        if (lastmod != null)
        {
            newNode = doc.CreateElement( "lastmod", xmlns );
            newNode.InnerText = lastmod.Value.ToString( "yyyy-MM-dd" );
            urlNode.AppendChild( newNode );
        }

        // set up the changefreq node (if it should exist) and add it

        if (changefreq != null)
        {
            newNode = doc.CreateElement( "changefreq", xmlns );
            newNode.InnerText = changefreq.Value.ToString();
            urlNode.AppendChild( newNode );
        }

        // set up the priority node (if it should exist) and add it

        if (priority != null)
        {
            newNode = doc.CreateElement( "priority", xmlns );
            newNode.InnerText =
                (priority.Value < 0.0f || priority.Value > 1.0f)
                ? "0.5" : priority.Value.ToString( "0.0" );
            urlNode.AppendChild( newNode );
        }

        // add the new url node to the urlset node

        doc.DocumentElement.AppendChild( urlNode );
    }
}

The AddUrlNodeToUrlSet method defined above will be called during the Page_Load event to construct the sitemap.xml file. It simply adds one <url> node for each page on my site that I want to reference in the site map file. Please keep in mind that for my blog, I generate my site map from a list of page names stored in a database table. So, in this next section of code where I open a database and parse the results, the code that finds your searchable pages from your site might be very different. Now let's look at the Page_Load method in part two of this page:

using System;
using System.Data.OleDb;
using System.Web;
using System.Web.UI;
using System.Xml;

public partial class ErrorNotFound404 : Page
{
    protected void Page_Load( object sender, EventArgs e )
    {
        string QS = Request.ServerVariables["QUERY_STRING"];

        // was it the sitemap.xml file that was not found?

        if (QS != null && QS.EndsWith( "sitemap.xml" ))
        {
            // build the sitemap.xml file dynamically from add all of the

            // articles from the database, set the MIME type to text/xml

            // and stream the file back to the search engine bot

            XmlDocument doc = new XmlDocument();
            doc.LoadXml( String.Format( "<?xml version=\"1.0\" encoding" +
                "=\"UTF-8\"?><urlset xmlns=\"{0}\"></urlset>", xmlns ) );

            // add the fixed blog URL for this site with top priority

            AddUrlNodeToUrlSet( Request.Url, doc, "MyBlog.aspx", null,
                freq.daily, 1.0f );
            // NOTE: add more fixed urls as necessary for your site

            // this could be done programmatically or better still by

            // dependency injection


            // now query the database and add the virtual URLs for this site

            string connectionString = String.Format(
               "NOTE: set this to suit the needs of your content database" );
            string query = "SELECT PAGE_NAME, POSTING_DATE FROM BLOGDB " +
                "ORDER BY POSTING_DATE";

            OleDbConnection conn = new OleDbConnection( connectionString );
            conn.Open();
            OleDbCommand cmd = new OleDbCommand( query, conn );
            OleDbDataReader rdr = cmd.ExecuteReader();

            if (rdr.HasRows)
            {
                while (rdr.Read())
                {
                    object page_name = rdr[0];
                    object posting_date = rdr[1];
                    if ((object)page_name != null && !(page_name is DBNull))
                    {
                        AddUrlNodeToUrlSet( Request.Url, doc, String.Format(
                            "{0}.ashx", page_name.ToString().Trim() ),
                            (DateTime?)posting_date, freq.monthly );
                    }
                }
            }

            // IMPORTANT - trace has to be disabled or the XML returned will

            // not be valid because the div tag inserted by the tracing code

            // will look like a second root XML node which is invalid

            Page.TraceEnabled = false;

            // IMPORTANT - you must clear the response in case handlers

            // upstream inserted anything into the buffered output already

            Response.Clear();

            // IMPORTANT - set the status to 200 OK, not the 404 Not Found

            // that this page would normally return

            Response.Status = "200 OK";

            // IMPORTANT - set the MIME type to XML

            Response.ContentType = "text/xml";

            // buffer the whole XML document and end the request

            Response.Write( doc.OuterXml );
            Response.End();
        }

        // not the sitemap.xml file so set the standard 404 error code

        Response.Status = "404 Not Found";
    }
}

When Page_Load starts, it checks QUERY_STRING to see if the sitemap.xml file was the missing one that caused the transfer to happen. This is possible because the transfer agent in IIS that handles the name of the missing file simply appends it to QUERY_STRING. If the name is sitemap.xml, my code starts a new XML document and adds the virtual <url> nodes using the AddUrlNodeToUrlSet method shown above. Which page names you will include in your site map is totally dependent on your site's content, so you'll have to make most of your adjustments to my sample in that area. At the end of Page_Load is some interesting code I want to highlight. There are five key things that happen at this point, in order:

  • You must disable page tracing if it's turned on. If you don't, ASP.NET appends a <div> element to the end of the document making your XML appear as though it has two root nodes, which will invalidate it.
  • You must clear the Response object in case some other code has already buffered some content to be sent back to the browser. You want just the XML of the site map in the output, nothing else.
  • You need to set the HTTP status code to 200 to make sure that the client sees the result of its request as successful. Google bots don't like anything but success.
  • You must set the MIME type of Response to text/XML because that's what the search engine bots expect for the document type you are returning.
  • Finally, grab the OuterXml property of the XML document and Write() it back to the browser before ending the Response.

Configuring IIS to Transfer to the Error Handler Page

To get the page defined above to handle HTTP 404 errors, it first has to be registered with IIS to handle them. Remember, you can register the same ASPX page to handle errors for both ASP.NET type pages and non-ASP.NET type pages. However, for file types handled by ASP.NET, you use web.config to register them. Since the XML type is not handled by the ASP.NET engine, you need to tell IIS about this new page, which cannot be done through the web.config file. Instead, you must use the IIS Management Console to register the error handler page. The Microsoft TechNet website has very good instructions on this topic here [^]. On my test site using the IIS Management Console, the registration of the ErrorNotFound404.aspx page looks like this:

Screenshot - ConfiguringIISErrors.gif

Conclusion

You can also register the same page as an error handler with ASP.NET via the web.config file as discussed above. Just be aware that when the ASP.NET engine handles an error, it will redirect the browser to the page you specify. So, if you're depending on a clean transfer to the error handler, you probably aren't going to get exactly what you want. For the sitemap.xml file, though, the approach shown above is very clean because of the way IIS handles missing files. Once you're done, use the Fiddler2 Web Debugging Proxy [^] to open your sitemap.xml file and use the session inspector to see exactly what's happening on the wire. You'll see just how clean this code makes the would-be 404 error for that missing file appear to the search engine bots.

One Parting Thought

If you can generate a dynamic sitemap.xml file using this technique, you could probably use it to generate almost any kind of virtual file: robots.txt, RSS feeds, etc. This means that even more of your site could be dynamically generated from database content. Think about that. Enjoy!

History of This Article

  • 24 Nov 2007 - Initial publication
  • 28 Nov 2007 - Article edited and moved to the main CodeProject.com article base

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here