Introduction
StringParser
is an object that helps you
extract information from a
string
. The class is perhaps best
suited to parse HTML pages downloaded from the web (see my
WebResourceProvider class that
helps you do this). You use
StringParser
by constructing it
with some content (i.e. a
string
) and using its navigational and
extraction methods to extract substrings from the content.
StringParser
also provides some static methods designed
specifically for parsing HTML.
API
Here are some of the methods provided by
StringParser
. Please see the
accompanying
documentation for an exhaustive list.
Navigational
API
resetPosition()
skipToEndOf()
skipToEndOfNoCase()
skipToStartOf()
skipToStartOfNoCase() |
|
Extraction API
extractTo()
extractToNoCase()
extractUntil()
extractUntilNoCase()
extractToEnd()
|
|
Position query
API
at()
atNoCase()
|
|
HTML parsing
API
getLinks()
removeComments()
removeEnclosingAnchorTag()
removeEnclosingQuotes()
removeHtml()
removeScripts()
|
Example 1 - Extracting delimited text
This example shows how to extract
text contained between two delimiters.
string strExtract = "";
string str = "Hello Sally, how are you?";
StringParser p = new StringParser (str);
if (p.skipToStartOf (",") && p.extractTo ("?", ref strExtract))
Console.Writeln ("Extracted text = {0}", strExtract);
else
Console.Writeln ("No text extracted.");
Example 2 - Extracting the nth occurence of a delimited string
This
example shows how to obtain the
href
attribute of the third anchor
tag (
<a>
) in an HTML string. The example assumes the
string contains valid HTML.
string strExtract = "";
string str = "...";
StringParser p = new StringParser (str);
if (p.skipToStartOfNoCase ("<a") &&
p.skipToStartOfNoCase ("<a") &&
p.skipToStartOfNoCase ("<a") &&
p.skipToStartOfNoCase ("href=\"") &&
p.extractTo ("\"", ref strExtract))
Console.Writeln ("Extracted text = {0}", strExtract);
else
Console.Writeln ("No text extracted.");
Example 3 - Global case-insensitive replacement
This example shows how
to case-insensitively replace a string in the parser's content..
string str = "...";
StringParser p = new StringParser (str);
p.replaceEvery ("<td>", "<td class=\"foo\">");
Example 4 - Poor man's web scraping
This example shows how to obtain a
stock's quote from the content downloaded from
Yahoo Finance (MSFT).
The example makes assumptions about the format of the web page.
string strQuote = "";
string str = "...";
StringParser p = new StringParser (str);
if (p.skipToEndOfNoCase ("Last Trade:</td><td class="yfnc_tabledata1"><big><b>") &&
p.extractTo ("</b>", ref strQuote))
Console.Writeln ("MSFT (delayed) = {0}", strQuote);
Example 5 - Get list of hyperlinked phrases
This example shows how to
obtain the list of hyperlinked phrases in HTML content.
ArrayList phrases = new ArrayList();
string str = "...";
StringParser p = new StringParser (str);
while (p.skipToStartOfNoCase ("<a")) {
string strPhrase = "";
if (p.skipToEndOf (">") && p.extractTo ("<a>", ref strPhrase))
phrases.Add (strPhrase);
}
Demo applications
C# applications (with full source code) that use
StringParser
can be found here:
Revision History
- 15 Jan 2006
Initial version.