Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / web

Build a Tool for Filter, Export, Download web links

3.94/5 (6 votes)
18 Oct 2013CPOL8 min read 22.7K   556  
web site links and link file manager for automatically manage and filtering links to avoid vast of time.

Image 1

Introduction

today's with growing web sites and their services , web users for find their desired links and filter them have to check each link for specific word or strings or structures that that's follow some rule for selecting or unselect links . These steps for gallery site or multi level sites with particular links get more time and vast user's time.

If user can filter web site links with some rules it is possible desired links extracted with high accuracy.

For example when we have following links in web page :

1-20x20.jpg

1-100x100.jpg

1.jpg.gif

1.jpg

2-20x20.jpg

2-100x100.jpg

2.gif

2.jpg

.

.

.

1000-20x20.jpg

1000.jpg.gif

1000-100x100.jpg

1000.jpg

If desire links just be 1.jpg , 2.jpg , … , 999.jpg , 1000.jpg  , we need filter web links with delete links with string 20 or 100 and get remain links list.

Rules:

Delete links with at least contain one of strings "20" or "100" or ".gif"  and get  links with string ".jpg" (without quotes )    .

that can explain with string delimited by / (slash character )  20/100/.gif  for undesired string list that links with these string exclude from final list ,  and .jpg for force contain .jpg in final links , this can be advances for undesired and desire string list.

Web sites expanded by links and links folder tree create site level , at most case web users interested to a sub tree of site map and stay on this section for web browsing. Following setting control this:

  1. Max search level from start URL
  2. Stay on same domain or not
  3. Max link count for avoid from loop in web complexity large number of links after hours of search and guaranty export found link list
  4. Stop searching function (button) for stop searching

The search result can export on link files , manage huge number of links in link's file can be also time consuming , in this tools following option added for manage  link files:

  1. Filter links in link file by filter options (desire string and undesired string list).
  2. Get list of file names from a folder in top level or with subfolder and remove links of a link file that its link's file name is frequency and export just new links. 

Background

web downloaded tools and some web convertors have some same options , but I develop this tools for presenting this tools as Open Source and ready to change if it necessary, easy to use and filter as count as wish strings( Case Sensitive , desired strings and undesired strings  ).

Idea of building this tools is always in my mind and after that I web searching find out WebClient capabilities especially in .Net framework 3.5, implementing is started.

Using the code

WebManagement Class:

For using from web resources, first we must build a web manager that implement client operations such as download web page source, extract links, download link target, using from proxy server and connection workflow.

Project main class is WebManagement that properties and methods in this class implemented as static for scalar and ease of use.

Properties:

I assume that at most one download exist at once and use one control for visually show download progress: progressBar. One label from user form that show WebManagement status: lblMessage. A file path for temporary works: TempFile.

For control search process using from Boolean stop variable: isStop.

Connection proxy using controlled with Boolean variable: UsingFromProxy.

Proxy server IP Address: ProxyServer 

Proxy server port number: ProxyPort

C#
public static ProgressBar progressBar = null;
public static Label lblMessage = null;
public static string TempFile = null;
public static bool isStop = false;
public static bool UsingFromProxy;
public static string ProxyServer;
public static string ProxyPort; 

Note : if connection must established with proxy server , proxy property of webclient that is a webproxy instance class created and set to proxy server and port: 

C#
if (UsingFromProxy == true)
    webClient.Proxy = new WebProxy(ProxyServer, int.Parse(ProxyPort));

Methods

Method GetFileSource(string url):

Download web page source code( Html Source), using from webClient.DownloadString to download web page source as string and export that.

C#
public static string GetFileSource(string url)
{
    try
    {
        if (isStop == true)
            return "";
        WebClient webClient = new WebClient();
        if (UsingFromProxy == true)
            webClient.Proxy = new WebProxy(ProxyServer, int.Parse(ProxyPort));
        return webClient.DownloadString(new Uri(url));
    }
    catch
    {
        return "";
    }
}

Method DownloadWebFile(string url, string tarPath):

Download from web url to tarPath and connected with two event DownloadFileCompleted and DownloadProgressChangedEventHandler and use from method DownloadFileAsync in WebClient class for download from web.

C#
public static bool DownloadWebFile(string url, string tarPath)
{
    try
    {

        WebClient webClient = new WebClient();
        if (UsingFromProxy == true)
            webClient.Proxy = new WebProxy(ProxyServer, int.Parse(ProxyPort));
        webClient.DownloadFileCompleted += new AsyncCompletedEventHandler(Completed);
        webClient.DownloadProgressChanged += new DownloadProgressChangedEventHandler(ProgressChanged);
        webClient.DownloadFileAsync(new Uri(url), tarPath);
        webClient.Dispose();
        webClient = null;
        return true;
    }
    catch
    {
        return false;
    }
}

private static void Completed(object sender, AsyncCompletedEventArgs e)
{
    if (lblMessage != null)
        lblMessage.Text = "Download completed!";

}
private static void ProgressChanged(object sender, DownloadProgressChangedEventArgs e)
{
    if (progressBar != null)
    {
        progressBar.Value = e.ProgressPercentage;
        lblMessage.Text = e.ProgressPercentage.ToString() + " % is completed";
    }
}

Method List<string> FindLinks(string FileText) :

This method extract links from webpage source ( HTML Source text ) and export as List<string>:

A link tag structure in HTML source :

<a href=URL > text  to display </a>

We must collect URLs in source text, for that using from Regex (Regular Expression) class is best selection and do three following steps for extract URL from web page source:

  1. Find links by matche with @"(<a.*?>*?</a>)"
  2. Extract URL from each link by matche with @"href=""(.*?)"""
  3. Remove inner tags from href URL text with function :
C#
Regex.Replace(value, @"\s*<.*?>\s*", "",RegexOptions.Singleline);

public static List<string> FindLinks(string FileText)
{
    Application.DoEvents();
    List<string> list = new List<string>();

    // 1.
    // Find all matches in file.
    //MatchCollection m1 = Regex.Matches(URL, @"(<a.*?>.*?</a>)",
    //    RegexOptions.Singleline);

    MatchCollection m1 = Regex.Matches(FileText, @"(<a.*?>*?</a>)",
        RegexOptions.IgnoreCase | RegexOptions.Singleline);
    // 2.
    // Loop over each match.
    foreach (Match m in m1)
    {
        string value = m.Groups[1].Value;
        LinkItem i = new LinkItem();

        // 3.
        // Get href attribute.
        //Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
        //RegexOptions.Singleline);

        Match m2 = Regex.Match(value, @"href=""(.*?)""",
        RegexOptions.Singleline | RegexOptions.IgnoreCase);

        if (m2.Success)
        {
            i.Href = m2.Groups[1].Value;
        }

        // 4.
        // Remove inner tags from text.
        //string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
        //RegexOptions.Singleline);

        string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
        RegexOptions.Singleline);


        i.Text = t;

        list.Add(i.Href);
    }
    return list;
}

Method

C#
bool AddLinks(DataGridView dg,string url,int level,string file,
  string [] elseStr,string [] withStr,bool lastLevel,ArrayList PreLinks ,bool justCurrentSite)

This method get URL url in Level level with html source file ,  array of without string elseStr , array of with string withStr , boolean lastLevel  that determine is it last level , an ArrayList for avoid from frequent link URL , Boolean input  justCurrentSite for force to stay on same domain or not.

Steps for add links from URL to datagridview:

  1. Extract links with List<string> lst = FindLinks(file);
  2. Get AbsoluteUri from url (Absolute URL address) and add to PreLinks
  3. For each url in lst
    1. If link  absolute URL length is greater than it's father link url and input url host is equall to new url host and justCurrentSite option is  true ( same domain is checked ) that is an active new link.
    2. If link  URL host is another host and justCurrentSite option is false (same domain check not checked ) and not in PreLink ArrayList , that is an active new link.
    3. If current link is active new link , url absolute URL address add to PreLink ArrayList and check following conditions:
      1. Check that not contain any without string list.
      2. Check at least contain one from with string list if it isn't empty.

If two previous check is true this new link information is add to dataGridView Rows

C#
public static bool AddLinks(DataGridView dg, string url, 
        int level, string file, string[] elseStr, string[] withStr, 
        bool lastLevel, ArrayList PreLinks, bool justCurrentSite)
{
    try
    {
        if (isStop == true)
            return false;
        bool ret = false;
        List<string> lst = FindLinks(file);
        Uri ba = new Uri(url);
        PreLinks.Add(ba.AbsoluteUri);
        for (int i = 0; i < lst.Count; i++)
        {
            Application.DoEvents();
            if (isStop == true)
                return false;
            Uri ur = new Uri(ba, lst[i]);
            bool ActiveLink = false;
            if (ur.AbsoluteUri.Length > url.Length && (ur.Host == ba.Host))
                ActiveLink = true;
            if (justCurrentSite == false)
                if ((ur.Host != ba.Host) && PreLinks.BinarySearch(ur.AbsoluteUri) < 0)
                    ActiveLink = true;
            if (ActiveLink == true)
            {
                PreLinks.Add(ur.AbsoluteUri);
                bool flag = true;
                string ft = "p";
                string[] sec = lst[i].Split('/');
                if (sec.Length > 0)
                    if (sec[sec.Length - 1] != null)
                        if (sec[sec.Length - 1].Contains(".") == true)
                        {
                            ft = "";
                        }
                if (ft == "f" || lastLevel == true)
                {
                    for (int k = 0; k < elseStr.Length; k++)
                        if (lst[i].Contains(elseStr[k]) == true && elseStr[k] != "")
                        {
                            flag = false;
                            break;
                        }
                    if (flag == true && withStr.Length > 0)
                    {
                        flag = false;
                        for (int m = 0; m < withStr.Length; m++)
                            if (lst[i].Contains(withStr[m]) == true)
                            {
                                flag = true;
                                break;
                            }
                    }

                }
                if (flag == true)
                {
                    Application.DoEvents();
                    dg.Rows.Add();
                    int j = dg.Rows.Count - 1;
                    dg.Rows[j].Cells[5].Value = ft;
                    if (ft == "p")
                        ret = true;
                    dg.Rows[j].Cells[0].Value = i + 1;
                    dg.Rows[j].Cells["Link"].Value = ur.AbsoluteUri;
                    dg.Rows[j].Cells[4].Value = level;
                }
            }
        }
        return ret;
    }
    catch (Exception exp)
    {
        MessageBox.Show(exp.Message);
        return true;
    }
}

Method

void GetLinks(DataGridView dg, string url, string fileTypes,string [] elseStr,string [] withStr, int level,bool justCurrentSite,long Maxcnt):

This method start from URL url find links to filetypes ( for future implemeting ) ,  array of without string elseStr , array of with string withStr , max to Level level , Boolean input  justCurrentSite for force to stay on same domain or not , MaxCnt is all link count limit , unlimited if it zero.

This method is manager method for get link of web pages with all conditions.

Process steps :

  1. Array List PreLinks created for holding different links.
  2. Get a Booolean variable flag for main loop to determine that inner loop operations cause a new link ( may max level not reach )
  3. Loop on level count
  4. If dataGridView Rows list is empty start from input url with following function call :
  5. C#
    flag=AddLinks(dg, url, f + 1, GetFileSource(url), elseStr, 
      withStr, (f == level - 1),PreLinks,justCurrentSite);

    Otherwise (if DataGridView Rows list is not empty ) set cnt to Rows.Count.

  6. Loop from row number zero to cnt-1
  7. If current row is folder and isn't final file link (check LinkisFile column of DataGridView if its value is p) call add link for current row.
  8. Remove current row because links of that is added.
C#
public static void GetLinks(DataGridView dg, string url, 
     string fileTypes, string[] elseStr, string[] withStr, int level, bool justCurrentSite, long Maxcnt)
{
    if (isStop == true)
        return;
    Application.DoEvents();
    ArrayList PreLinks = new ArrayList((int)(Maxcnt == 0 ? 1000 : Maxcnt) + 1);
    bool flag = true;
    for (int f = 0; f < level && flag; f++)
    {
        if (isStop == true)
            return;

        flag = false;
        int cnt = dg.Rows.Count;
        if (f == 0 && dg.Rows.Count == 0)
        {
            Application.DoEvents();
            flag = AddLinks(dg, url, f + 1, GetFileSource(url), elseStr, 
              withStr, (f == level - 1), PreLinks, justCurrentSite);
        }
        else
        {
            int r = 0;
            while (r < cnt)
            {
                if (dg.Rows[r].Cells[5].Value != null)
                    if (dg.Rows[r].Cells[5].Value.ToString() == "p")
                    {
                        flag = true;
                        Application.DoEvents();
                        AddLinks(dg, dg.Rows[r].Cells[1].Value.ToString(), f + 1, 
                          GetFileSource(dg.Rows[r].Cells[1].Value.ToString()), 
                          elseStr, withStr, (f == level - 1), PreLinks, justCurrentSite);
                        dg.Rows.RemoveAt(r);
                        cnt--;
                    }
                    else
                        r++;
            }
            if (Maxcnt > 0 && r >= Maxcnt)
            {
                flag = false;
                break;
            }
        }
    }
}

Data Type LinkItem for link data structure:

C#
public struct LinkItem
{
    public string Href;
    public string Text;

    public override string ToString()
    {
        return Href + "\n\t" + Text;
    }
}

Forms important Methods : 

Method btnDelFileLinks_Click(object sender, EventArgs e):

For filter link file links , this method get source link file name and target link file then operate without string list and with string list on source link file and filter links then save them to target.

Using this capability:

  • write source file name in URL textbox or leave empty that to get file path by open dialog
  • write without string and delimited them by / character
  • click on "Remove Links From File" Button
C#
private void btnDelFileLinks_Click(object sender, EventArgs e)
{
try
{
if (txtElse.Text != "" || txtWith.Text != "")
{
    string[] elseStr = txtElse.Text.Split('/');
    string[] withStr = txtWith.Text.Split('/');
    if (txtURL.Text == "")
    {
        OpenFileDialog op = new OpenFileDialog();
        if (op.ShowDialog() == DialogResult.OK)
        {
            txtURL.Text = op.FileName;
        }
    }
    if (txtURL.Text != "")
    {
        SaveFileDialog sv = new SaveFileDialog();
        if (sv.ShowDialog() == DialogResult.OK)
        {
            StreamReader sr = new StreamReader(txtURL.Text);
            StreamWriter sw = new StreamWriter(sv.FileName);

            if (sr != null)
            {
                while (sr.EndOfStream == false)
                {
                    string li = sr.ReadLine();
                    bool flag = true;
                    for (int k = 0; k < elseStr.Length; k++)
                        if (li.Contains(elseStr[k]) == true && elseStr[k] != "")
                        {
                            flag = false;
                            break;
                        }
                    if (flag == true && withStr.Length > 0)
                    {
                        flag = false;
                        for (int m = 0; m < withStr.Length; m++)
                            if (li.Contains(withStr[m]) == true)
                            {
                                flag = true;
                                break;
                            }
                    }
                    if (flag == true)
                    {
                        sw.WriteLine(li);
                    }
                }
                sr.Close();
                sw.Close();
                MessageBox.Show("Exported to " + sv.FileName);

            }

        }

    }
}
else
    MessageBox.Show("No Condition !");
}
catch
{
MessageBox.Show("ٍerror");
}
}

Method btnDiffLinks_Click(object sender, EventArgs e):

This method for export new links from source link file after removing frequent links that exist in selected folder (just top level with checking "just top level" or with subfolders if it's not checked)

C#
private void btnDiffLinks_Click(object sender, EventArgs e)
{
    OpenFileDialog opfd1 = new OpenFileDialog();
    if (opfd1.ShowDialog()==DialogResult.OK)
    {
        StreamReader sr = new StreamReader(opfd1.FileName);
        if (sr != null)
        {
            FolderBrowserDialog fbd=new FolderBrowserDialog();
            if (fbd.ShowDialog() == DialogResult.OK)
            {
                string inFolder = fbd.SelectedPath;
                SaveFileDialog sfd = new SaveFileDialog();
                if (sfd.ShowDialog() == DialogResult.OK)
                {
                    StreamWriter sw = new StreamWriter(sfd.FileName);
                    if (sw != null)
                    {
                        sw.AutoFlush = true;
                        string[] files = sr.ReadToEnd().Split(
                          new string[] { "\n", "\r" }, StringSplitOptions.RemoveEmptyEntries);
                        SortedList<string, string> SLfnames = new SortedList<string, string>();
                            string[] ExistFiles;
                            if (chIsJustInTop.Checked == true)
                                ExistFiles = Directory.GetFiles(inFolder, "*", SearchOption.TopDirectoryOnly);
                            else
                                ExistFiles = Directory.GetFiles(inFolder, "*", SearchOption.AllDirectories);
                        for(int i=0;i<ExistFiles.Length;i++)
                        {
                            ExistFiles[i] = Path.GetFileName(ExistFiles[i]);
                            if(ExistFiles[i]!="" && ExistFiles[i]!=null)
                              SLfnames.Add(ExistFiles[i].ToLower(),ExistFiles[i]);
                        }
                        for (int i = 0; i < files.Length; i++)
                        {
                            string fname = Path.GetFileName(files[i]);
                            if (fname != "" && fname != null)
                            {
                                if (SLfnames.IndexOfKey(fname.ToLower())<0)
                                        sw.WriteLine(files[i]);
                            }
                        }
                        sw.Close();
                    }
                }
            }
            sr.Close();
            MessageBox.Show("Successfully Done!");
        }
    }
}

Connection Settings:

  Image 2

Connection to target URL may use Proxy Server, for this ability check the "Using From Proxy Server" Check box and set IP in first text box and port number in second:

 Image 3

For direct connection to target URL, unchecked this option.

Notes:

  • When dataGridView Rows is nonempty and you click the "Get links" button, URL textbox value is just reference URL and not starting point.
  • Extract web site links in most time consume high range of time then max level and max count value is very important.
  • Strings in without string and with string textbox can leave empty text box or with any character include space or other but each part delimited by / ( slash )character. for string list example :  120/hello/direct game/book/@
  • String list typed in without or with textbox is case sensitive.
  • This tool is very suitable for site have directory listing enabled, especially for web site builders.

Points of Interest

I created this tool to avoid from vast of my time and I reach to my goal, hope this useful also for you.

My interest is Combine a sophisticated download tool with high intelligence web link filter tool that save users time and get best result.

History

Link manager version 1.0 implemented in 2012.

References

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)