Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Exact Phrase Website Live Search Script .NET 3.5

0.00/5 (No votes)
18 Aug 2010 1  
This script can be used in an ASP.NET 3.5 website. It can search out exact phrases and keywords live from all HTML pages and text files on the website.

Introduction

My project is an Exact Phrase Website Live Search Script. This script is compatible with ASP.NET 3.5, and can be used in websites and web servers, including localhost, domain.com, etc. The script's primary programming is done in C#. This script is a utility which searches out the entire domain.com or the website's files: .html, .htm, .txt, .rtf, .nfo, to find an exact keyword or a phrase. The search does not require any database or temporary file creation or temporary space. Search is done live in the website, in the server that runs the website! In one go, in a matter of only two seconds on a Xeon server, the search completes, parsing out at least 500 web pages, and indexes the searched files right away.

Extended Information

Put this ASP.NET 3.5 script into yourdomain.com/<folder> and search the entire website for an exact word or a phrase. This script is based on ASP.NET 3.5, so if you have installed .NET 3.5 on your server/localhost, there will not be any problems. This script is a freeware, for personal use, and for educational purposes. You can implement this script in universities, schools, and public libraries to search all* files on the website. The script is upgradable. Currently, because the script searches out files live, it does not require any database, file, or any temporary space to index files. What it requires is memory. The search is performed live on the entire website, and only in about two seconds. A minimum of 500 web-pages and text files can be parsed. The phrase or the keyword you specify is searched exactly as you had specified in the search form. All search phrases are searched in lower case so that if the user does not know if the word he is searching is in lower case or in upper case alphabets, the search will seek out all matching words or phrases. 99.9% exact search is guaranteed in a valid parsed file (.html, .htm, .txt., .nfo, .rtf), so this script is ideal for public libraries to search for books, titles, names, etc.

Background

This is the newest search engine module written in ASP.NET and C#. It contains what a demanding, educational, and a free webmaster requires in his website. The script is upgradable because it is driven totally by methods/functions and not by open programming. It is all very well organized.

Using the Code

Download the code, extract the files into the <localhost>/search folder, or <domain.com>/search folder directly in your website in a virtual directory. If you have ASP.NET, the script will be executed from the main display file: phrasesearch.aspx, which will then display a search engine form. The phrasesearch.aspx.cs file is the primary C# code file in which at the top are some constant or global variables required to be configured manually. There are only four to six variables, and are easy to understand for programmers who have knowledge of web development and ASP.

In phrasesearch.aspx.cs, the global variables are only "optional". There is actually no need to configure if no error arises. Unless an error comes up, there would be no requirement to configure these variables because the script automatically takes up the job of configuration as well.

/*****************************************************************
//
//  MANUAL CONFIGURATION SECTION**/
//*****************************************************************/

// Configure your website address here:
// example: http://www.mydomain.com:80/
// example: http://www.mydomain2.com:8080/
// example: http://www.mydomain3.com:2200/
// also include port number of your http server along with the website url
// please do not put slash mark "/" after the fully qualified website address

public const String mywebsiteaddr = "";

// Configure your domain here:
// example: mydomain.com
// no slashes, no www., no http://, only domain name with .com or another .extension
public const String mydomain = "";

// Configure your domain/website's root directory here:
// example: "public_html" or "wwwroot" or "httpdocs"

// please do not put slash mark "/" after the directory's name
public const String mydomainrootaddr = "";

// Configure your starting search directory which will be entirely search
// example: "~/" (root of your domain)
// or "~/folder1" or "~/folder2"
// please put back slash mark "/" after the directory's name
public const String mysearchdir = "~/";

// Enable the flag that you have filled the variables above:
// example: isSetupDone = true;
public const Boolean isSetupDone = false;

//*****************************************************************/

// Configure maximum number of files to be search in the entire website.
// example: myfileslimit = 10000 -> will search only Ten thousand files
// example: myfileslimit = 1000 -> will search only One thousand files
// example: myfileslimit = 50 -> will search only Fifty files
// please only put digits and no negative number
// This variable is not under the isSetupDone flag and so is 
// always enabled in the code.
public static int myfileslimit = 1000;

// Recommended configuration
// Configure your domain/website's physical path here, 
// if you actually and exactly know it!:
// example: myphysicalpath = "C:\\Inetpub\\wwwroot"

// please do not put slash mark "\" after the directory's name
// please put double backslash between all the directory names 
// in the path: C:\\dir1\\dir2\\dir3
// This setting does not depend on isSetupDone flag.
// Example2: myphysicalpath = "C:\\Inetpub\\vhosts\\lawsofbrahman.com\\httpdocs"
public const String myphysicalpath = ""; 

Some Functions in the Code

public static string PageName()

Returns the script <filename.aspx> filename. This function is optional.

public static Boolean initializeSiteSearching()

This function is initialized only when a new search is about to take place. This function resets all the required variables to null or 0 values. To automate the task so that no one ASP.NET function is called limitless times while parsing an uncountable number of files, the function also obtains global website related values including the website URL, domain name, and the port number on which the website is running. This function also sets a global configurable variable, a limit of the number of files to be searched in one go.

public static String GenerateInternalLink(String strPath)

This function is called whenever a valid file is parsed and it is about to be indexed live on the search results page. This function is important because it removes the physical path of the file, converts the file to a relative URL path, then returns the URL relative path to be given in the indexed result as a URL or the hyperlink of the file which is being indexed for the matched pattern.

public static String GetRelavitePathOfFile(String filepath)

This function is also called whenever a valid file is parsed and is about to be indexed. The function parses the physical path relative to the localhost. This is the primary function for converting the physical path to the website relative path/URL. Suppose this function returns an error and not the processed website relative URL; only GenerateInternalLink is called then as the final method to obtain a website relative URL. These two methods are upgradable, and are called in the FindMatchingCurrentFile() method in which the required automated tasks are performed such as extracting the entire data of the file, stripping tags, etc., and then matching the user's specified phrase to the text contained in the file buffer. If the text is matched any number of times, the file is then indexed, a title obtained from the HTML file, and the link is produced so that the user can click the link to view the file in which the user's specified phrase or the keyword are present.

public static Boolean FindMatchingCurrentFile(string path)

This function performs file related tasks, tries to search for a match of the user specified phrase, and indexes the file into the results section when a matched pattern is found in the file. This is the primary file and indexing related function.

Some Sample Code Snippets

public static Boolean initializeSiteSearching()
{
    Boolean returnval = false;
    
    // Reset all static variables before beginning a new search...
    strSearchPhrase = "";
    inumResultsLimit = 0;
    strDomainRootDir = "";
    strDomainUrl = "";
    boolSearchingDone = false;
    arrayPhraseMatches = null;
    iPhraseMatchCount = 0;
    arrayPhrasePositions = null;
    numMatchingFilesFound = 0;
    iCountSearchedFiles = 0;
    iFilesLimitReached = false;
    
    // Find the domain's root directory
    strDomainRootDir = HttpContext.Current.Server.MapPath("~/");
    
    // Find domain's url
    strDomainUrl = HttpContext.Current.Request.Url.Host;
    
    // Get the server/domain.com's port
    strDomainPort = HttpContext.Current.Request.ServerVariables["SERVER_PORT"];
    
    // Configure the total files limit if user specified negative digits, 
    // reconfigure it with default value
    if (myfileslimit <= 0)
        myfileslimit = 10000;
        
    // Finalize if we got the domain root directory and the domain url
    if (strDomainRootDir.Length > 0 && strDomainUrl.Length > 0)
    {
        returnval = true;
    }
    
    return returnval;
}

public static String GenerateInternalLink(String strPath)
{
    String strfinalurl = "";
    String strPathstipped = "";
    String strDomainRootStripped = "";
    // Strip drive letter and trailing :\
    strPathstipped = Replace(strPath, @"^(([a-zA-Z0-9-]+:+\\))", "", 
        -1, 0, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    strDomainRootStripped = Replace(strDomainRootDir, @"^(([a-zA-Z0-9-]+:+\\))", "", 
        -1, 0, RegexOptions.IgnoreCase | RegexOptions.Singleline);

    // Remove the physical path
    strPathstipped = Replace(strPathstipped, @"\\", "/", 
        -1, 0, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    strDomainRootStripped = Replace(strDomainRootStripped, @"\\", "/", 
        -1, 0, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    strfinalurl = Replace(strPathstipped, strDomainRootStripped, "", 
        -1, 0, RegexOptions.IgnoreCase | RegexOptions.Singleline);

    return strfinalurl;
}

// This function shall search all the files and display the files, 
// and their extracts.
public static Boolean SearchForPhraseInFiles(String strPhrase, int ireslimit)
{
    // start the timer to count elapsed time until the search completes.
    System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
    sw.Start();
    // Response.write will be used live.
    Boolean returnval = false;

    // Check if the search was performed right from the 
    // "phrasesearch.aspx" search engine file.
    // If another file was used, abort and return error.
    if (!isFormPageLoaded)
    {
        String strstyle = "<div style=\"background-color:silver;" + 
            "border:thin; border-style:solid; border-color:olive;\">";
        strstyle += "<b><em>Please load phrasesearch.aspx file." + 
            "That is the main display file... Aborted" + 
            "</em></b><br /><br /></div>";
        System.Web.HttpContext.Current.Response.Write(strstyle);
        HttpContext.Current.Response.End();
        return false;
    }

    // Initialize all the params, reset the variables and obtain 
    // important global values
    // This function is initialized when the new search before any search 
    // is initialized
    if (initializeSiteSearching() == false)
    {
        // Error, abort
        System.Web.HttpContext.Current.Response.Write
            ("<p><b>Critical error finding path URLs " +
            "while initiliazing search! aborting...</b></p>");
        return false;
    }

    // Set variables of search
    strSearchPhrase = strPhrase;
    inumResultsLimit = ireslimit;

    if (isSetupDone)
        System.Web.HttpContext.Current.Response.Write
            ("<b>Searching the domain:<em>" + mywebsiteaddr + 
            "</em></b><br /><br />");
    else
        System.Web.HttpContext.Current.Response.Write
            ("<b>Searching the domain:<em>" + strDomainUrl + 
            "</em></b><br /><br />");

    // Finally, now begin searching for valid files in all the directories and 
    // find the exact keywords and phrases in them...
    if (!isSetupDone)
    {
        if (ProcessDirectory(strDomainRootDir) == true)
        {
            // Found file(s) and processed...
            returnval = true;
        }
    }
    else
    {
        if (ProcessDirectory(mysearchdir) == true)
        {
            // Found file(s) and processed...
            returnval = true;
        }
    }

    // Finished the searching and indexing of files.
    sw.Stop();
    TimeSpan tsobj = sw.Elapsed;
    HttpContext.Current.Response.Write("Search took: " +
        "(<b style=\"color:purple\"><em>" + 
        tsobj.TotalSeconds + "</em></b>) seconds.");

    return returnval;
}


// This function extracts the title from the .htm and .html file
// (if the title exists)
public static String getTitleTagValue(String contents)
{
    Regex pattern;
    Match match;
    pattern = new Regex(@"<title>([\w\s,.:'-]+)</title>", 
            RegexOptions.Compiled | RegexOptions.ECMAScript | 
            RegexOptions.Multiline | RegexOptions.IgnoreCase);
    match = pattern.Match(contents);
    if (match.Success)
        return match.Groups[1].Value;
    else
        return "";
}

Code/User Interface Language

The language used is English.

Conclusion

This is the second release and is the latest one; it runs on IIS 5+, .NET 3.5, and is fully automated and bugs are fixed. Contact me at: aroratushar@gmail.com if any problem arises.

History

  • 27 August 2009: First release.
  • 18 August 2010: Updated article and download files.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here