Click here to Skip to main content
16,023,339 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
How can I search for a particular tag in a html file? For example if my html page has h1 tag somewhere in the middle, how to ensure programmatically that my page has that particular tag?
Posted
Updated 31-May-10 0:25am
v4

Here is a code that i use to find all images tags within a webpage
hope this helps

C#
public List<string> FetchImages(string Url)
        {
            List<string> imageList = new List<string>();

            //Append http:// if necessary
            if (!Url.StartsWith("http://") && !Url.StartsWith("https://"))
                Url = "http://" + Url;

            string responseUrl = string.Empty;
            string htmlData = ASCIIEncoding.ASCII.GetString(DownloadData(Url, out responseUrl));

            if (responseUrl != string.Empty)
                Url = responseUrl;

            if (htmlData != string.Empty)
            {
                string imageHtmlCode = "<img";
                string imageSrcCode = @"src=""";

                int index = htmlData.IndexOf(imageHtmlCode);
                while (index != -1)
                {
                    //Remove previous data
                    htmlData = htmlData.Substring(index);

                    //Find the location of the two quotes that mark the image's location
                    int brackedEnd = htmlData.IndexOf('>'); //make sure data will be inside img tag
                    int start = htmlData.IndexOf(imageSrcCode) + imageSrcCode.Length;
                    int end = htmlData.IndexOf('"', start + 1);

                    //Extract the line
                    if (end > start && start < brackedEnd)
                    {
                        string loc = htmlData.Substring(start, end - start);

                        //Store line
                        imageList.Add(loc);
                    }

                    //Move index to next image location
                    if (imageHtmlCode.Length < htmlData.Length)
                        index = htmlData.IndexOf(imageHtmlCode, imageHtmlCode.Length);
                    else
                        index = -1;
                }

                //Format the image URLs
                for (int i = 0; i < imageList.Count; i++)
                {
                    string img = imageList[i];

                    string baseUrl = GetBaseURL(Url);

                    if ((!img.StartsWith("http://") && !img.StartsWith("https://"))
                        && baseUrl != string.Empty)
                        img = baseUrl + "/" + img.TrimStart('/');

                    imageList[i] = img;
                }
            }

            return imageList;
        }
         // you can find the c# code here although needs much work
         // not my code by the way mine is completed but originated
         // from this code
        //http://www.vcskicks.com/download_file_http.html
        private byte[] DownloadData(string Url)
        {
            string empty = string.Empty;
            return DownloadData(Url, out empty);
        }
 
Share this answer
 
v2
Comments
vinodkalanji87 2-Jun-10 5:12am    
Hi all, Thanks a lot for the answers...
indexOf should work as told, just for an alternative you could use string.Contains() method.
XML
string str = "<h1>some_unnecessary_string</h1>";

Example for contains method:
bool result = str.Contains("<h1>");

or regular expressions for a complete tag search:
C#
bool result = check(str);
bool check(string source)
{
    return new Regex("<h1>.*?</h1>").IsMatch(source);
}
 
Share this answer
 
Once you get your document loaded, just do a string.IndexOf("<mytag>")</mytag>. If the method returns anything less that 0, the string you were looking for doesn't exist. It's faster than using the DOM stuff.
 
Share this answer
 
Comments
Dalek Dave 31-May-10 7:29am    
Seems the best way, and works on several search criteria (by nexting). 5!
hi TRY USING this DOM methods..


x.getElementById(id) - get the element with a specified id
x.getElementsByTagName(name) - get all elements with a specified tag name

x is a node object (HTML element)
 
Share this answer
 
Comments
Dalek Dave 31-May-10 7:27am    
Good Answer!
vinodkalanji87 1-Jun-10 2:59am    
Thanks a lot for the answers

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900