(untagged)

StringParser

Ravi Bhavnani

0.00/5 (No votes)

15 Jan 2006

An object that makes it easy to extract information from strings, especially HTML content.

Introduction

StringParser is an object that helps you extract information from a string. The class is perhaps best suited to parse HTML pages downloaded from the web (see my WebResourceProvider class that helps you do this). You use StringParser by constructing it with some content (i.e. a string) and using its navigational and extraction methods to extract substrings from the content. StringParser also provides some static methods designed specifically for parsing HTML.

API

Here are some of the methods provided by StringParser. Please see the accompanying documentation for an exhaustive list.

Navigational API
resetPosition()
skipToEndOf()
skipToEndOfNoCase()
skipToStartOf()
skipToStartOfNoCase() Extraction API
extractTo()
extractToNoCase()
extractUntil()
extractUntilNoCase()
extractToEnd() Position query API
at()
atNoCase() HTML parsing API
getLinks()
removeComments()
removeEnclosingAnchorTag()
removeEnclosingQuotes()
removeHtml()
removeScripts()

Example 1 - Extracting delimited text

This example shows how to extract text contained between two delimiters.

  // Extract text between the comma and question mark
  string strExtract = "";
  string str = "Hello Sally, how are you?";
  StringParser p = new StringParser (str);
  if (p.skipToStartOf (",") && p.extractTo ("?", ref strExtract))
     Console.Writeln ("Extracted text = {0}", strExtract);
  else
     Console.Writeln ("No text extracted.");

Example 2 - Extracting the nth occurence of a delimited string

This example shows how to obtain the href attribute of the third anchor tag (<a>) in an HTML string. The example assumes the string contains valid HTML.

  // Get href attribute of 3rd <a> tag
  string strExtract = "";
  string str = "..."; // HTML
  StringParser p = new StringParser (str);
  if (p.skipToStartOfNoCase ("<a") &&
      p.skipToStartOfNoCase ("<a") &&
      p.skipToStartOfNoCase ("<a") &&
      p.skipToStartOfNoCase ("href=\"") &&
      p.extractTo ("\"", ref strExtract))
     Console.Writeln ("Extracted text = {0}", strExtract);
  else
     Console.Writeln ("No text extracted.");

Example 3 - Global case-insensitive replacement

This example shows how to case-insensitively replace a string in the parser's content..

  // Replace every occurence of <td> with <td class="foo">
  string str = "..."; // HTML
  StringParser p = new StringParser (str);
  p.replaceEvery ("<td>", "<td class=\"foo\">");

Example 4 - Poor man's web scraping

This example shows how to obtain a stock's quote from the content downloaded from Yahoo Finance (MSFT). The example makes assumptions about the format of the web page.

  // Scrape http://finance.yahoo.com/q?s=msft
  string strQuote = "";
  string str = "..."; // HTML downloaded from http://finance.yahoo.com/q?s=msft
  StringParser p = new StringParser (str);
  if (p.skipToEndOfNoCase ("Last Trade:</td><td class="yfnc_tabledata1"><big><b>") &&
      p.extractTo ("</b>", ref strQuote))
     Console.Writeln ("MSFT (delayed) = {0}", strQuote);

Example 5 - Get list of hyperlinked phrases

This example shows how to obtain the list of hyperlinked phrases in HTML content.

  ArrayList phrases = new ArrayList();
  string str = "..."; // HTML content
  StringParser p = new StringParser (str);
  while (p.skipToStartOfNoCase ("<a")) {
    string strPhrase = "";
    if (p.skipToEndOf (">") && p.extractTo ("<a>", ref strPhrase))
       phrases.Add (strPhrase);
  }

Demo applications

C# applications (with full source code) that use StringParser can be found here:

DomainWalker - a web topology analyzer
GoogleTranslator - an object that uses Google to translate natural language
SimpleRSS - an RSS channel reader

Revision History

15 Jan 2006
Initial version.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here