Introduction
today's with growing web sites and
their services , web users for find their desired links and filter them have to
check each link for specific word or strings or structures that that's follow
some rule for selecting or unselect links . These steps for gallery site or
multi level sites with particular links get more time and vast user's time.
If user can filter web site links
with some rules it is possible desired links extracted with high accuracy.
For example when we have following
links in web page :
1-20x20.jpg
1-100x100.jpg
1.jpg.gif
1.jpg
2-20x20.jpg
2-100x100.jpg
2.gif
2.jpg
.
.
.
1000-20x20.jpg
1000.jpg.gif
1000-100x100.jpg
1000.jpg
If desire links just be 1.jpg
, 2.jpg , … , 999.jpg , 1000.jpg , we need filter web links with delete links
with string 20 or 100 and get remain links list.
Rules:
Delete links with at least contain
one of strings "20" or "100" or ".gif"
and get
links with string ".jpg" (without quotes ) .
that can explain with string delimited
by / (slash character ) 20/100/.gif
for undesired string list that links
with these string exclude from final list ,
and .jpg for force contain .jpg in final links , this can be
advances for undesired and desire string list.
Web sites expanded by links and
links folder tree create site level , at most case web users interested to a sub
tree of site map and stay on this section for web browsing. Following setting
control this:
- Max
search level from start URL
- Stay
on same domain or not
- Max
link count for avoid from loop in web complexity large number of links after
hours of search and guaranty export found
link list
- Stop
searching function (button) for stop searching
The search result can export on link
files , manage huge number of links in link's file can be also time consuming ,
in this tools following option added for manage
link files:
- Filter links in link file by filter options (desire string and undesired
string list).
- Get
list of file names from a folder in top level or with subfolder and remove
links of a link file that its link's file name is frequency and export just new
links.
Background
web downloaded tools and some web
convertors have some same options , but I develop this tools for presenting
this tools as Open Source and ready to change if it necessary, easy to use and
filter as count as wish strings( Case Sensitive , desired strings and undesired
strings ).
Idea of building this tools is
always in my mind and after that I web searching find out WebClient capabilities especially in .Net
framework 3.5, implementing is started.
Using the code
WebManagement Class:
For using from web resources, first
we must build a web manager that implement client operations such as download
web page source, extract links, download link target, using from proxy server
and connection workflow.
Project main class is WebManagement that properties and methods in this
class implemented as static for scalar and ease of use.
Properties:
I assume that at most one download exist at once and use one control for visually show download progress: progressBar.
One label from user form that show WebManagement status: lblMessage. A file path for temporary works: TempFile.
For control search process using from Boolean stop variable: isStop.
Connection proxy using controlled with Boolean variable: UsingFromProxy.
Proxy server IP Address: ProxyServer
Proxy server port number: ProxyPort
public static ProgressBar progressBar = null;
public static Label lblMessage = null;
public static string TempFile = null;
public static bool isStop = false;
public static bool UsingFromProxy;
public static string ProxyServer;
public static string ProxyPort;
Note : if connection must established with proxy server , proxy
property of webclient that is a webproxy instance class created and set to
proxy server and port:
if (UsingFromProxy == true)
webClient.Proxy = new WebProxy(ProxyServer, int.Parse(ProxyPort));
Methods
Method GetFileSource(string url):
Download web page source code( Html Source), using from webClient.DownloadString
to download web page source as string and export that.
public static string GetFileSource(string url)
{
try
{
if (isStop == true)
return "";
WebClient webClient = new WebClient();
if (UsingFromProxy == true)
webClient.Proxy = new WebProxy(ProxyServer, int.Parse(ProxyPort));
return webClient.DownloadString(new Uri(url));
}
catch
{
return "";
}
}
Method DownloadWebFile(string url, string tarPath):
Download from web url to tarPath and connected with two event
DownloadFileCompleted and DownloadProgressChangedEventHandler and use from method DownloadFileAsync in WebClient class for download from web.
public static bool DownloadWebFile(string url, string tarPath)
{
try
{
WebClient webClient = new WebClient();
if (UsingFromProxy == true)
webClient.Proxy = new WebProxy(ProxyServer, int.Parse(ProxyPort));
webClient.DownloadFileCompleted += new AsyncCompletedEventHandler(Completed);
webClient.DownloadProgressChanged += new DownloadProgressChangedEventHandler(ProgressChanged);
webClient.DownloadFileAsync(new Uri(url), tarPath);
webClient.Dispose();
webClient = null;
return true;
}
catch
{
return false;
}
}
private static void Completed(object sender, AsyncCompletedEventArgs e)
{
if (lblMessage != null)
lblMessage.Text = "Download completed!";
}
private static void ProgressChanged(object sender, DownloadProgressChangedEventArgs e)
{
if (progressBar != null)
{
progressBar.Value = e.ProgressPercentage;
lblMessage.Text = e.ProgressPercentage.ToString() + " % is completed";
}
}
Method List<string> FindLinks(string FileText) :
This method extract links from webpage source ( HTML Source text ) and export as List<string>:
A link tag structure in HTML source :
<a href=URL > text
to display </a>
We must collect URLs in source text, for that using from Regex (Regular
Expression) class is best selection and do three following steps for extract
URL from web page source:
- Find links by matche with @"(<a.*?>*?</a>)"
- Extract URL from
each link by matche with @"href=""(.*?)"""
- Remove inner tags from href URL text
with function :
Regex.Replace(value, @"\s*<.*?>\s*", "",RegexOptions.Singleline);
public static List<string> FindLinks(string FileText)
{
Application.DoEvents();
List<string> list = new List<string>();
?>.*?</a>)",
// RegexOptions.Singleline);
MatchCollection m1 = Regex.Matches(FileText, @"(<a.*?>*?</a>)",
RegexOptions.IgnoreCase | RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
//Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
//RegexOptions.Singleline);
Match m2 = Regex.Match(value, @"href=""(.*?)""",
RegexOptions.Singleline | RegexOptions.IgnoreCase);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
//string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
//RegexOptions.Singleline);
string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i.Href);
}
return list;
}
Method
bool AddLinks(DataGridView dg,string url,int level,string file,
string [] elseStr,string [] withStr,bool lastLevel,ArrayList PreLinks ,bool justCurrentSite)
This method get URL url in Level level with
html source file , array of without
string elseStr , array of with
string withStr , boolean lastLevel that determine is it
last level , an ArrayList for avoid from frequent link URL , Boolean input justCurrentSite for force to stay on
same domain or not.
Steps for add links from URL to datagridview:
- Extract links with List<string> lst =
FindLinks(file);
- Get AbsoluteUri from url (Absolute URL address) and add to
PreLinks
- For each url in lst
- If link absolute
URL length is greater than it's father link url and input url host is equall to
new url host and justCurrentSite option is true ( same domain is checked ) that is an
active new link.
- If link URL host is another host and justCurrentSite
option is false (same domain check not checked ) and not in PreLink ArrayList ,
that is an active new link.
- If current link is active new link , url absolute URL
address add to PreLink ArrayList and check following conditions:
- Check that
not contain any without string list.
- Check at
least contain one from with string list if it isn't empty.
If two
previous check is true this new link information is add to dataGridView Rows
public static bool AddLinks(DataGridView dg, string url,
int level, string file, string[] elseStr, string[] withStr,
bool lastLevel, ArrayList PreLinks, bool justCurrentSite)
{
try
{
if (isStop == true)
return false;
bool ret = false;
List<string> lst = FindLinks(file);
Uri ba = new Uri(url);
PreLinks.Add(ba.AbsoluteUri);
for (int i = 0; i < lst.Count; i++)
{
Application.DoEvents();
if (isStop == true)
return false;
Uri ur = new Uri(ba, lst[i]);
bool ActiveLink = false;
if (ur.AbsoluteUri.Length > url.Length && (ur.Host == ba.Host))
ActiveLink = true;
if (justCurrentSite == false)
if ((ur.Host != ba.Host) && PreLinks.BinarySearch(ur.AbsoluteUri) < 0)
ActiveLink = true;
if (ActiveLink == true)
{
PreLinks.Add(ur.AbsoluteUri);
bool flag = true;
string ft = "p";
string[] sec = lst[i].Split('/');
if (sec.Length > 0)
if (sec[sec.Length - 1] != null)
if (sec[sec.Length - 1].Contains(".") == true)
{
ft = "";
}
if (ft == "f" || lastLevel == true)
{
for (int k = 0; k < elseStr.Length; k++)
if (lst[i].Contains(elseStr[k]) == true && elseStr[k] != "")
{
flag = false;
break;
}
if (flag == true && withStr.Length > 0)
{
flag = false;
for (int m = 0; m < withStr.Length; m++)
if (lst[i].Contains(withStr[m]) == true)
{
flag = true;
break;
}
}
}
if (flag == true)
{
Application.DoEvents();
dg.Rows.Add();
int j = dg.Rows.Count - 1;
dg.Rows[j].Cells[5].Value = ft;
if (ft == "p")
ret = true;
dg.Rows[j].Cells[0].Value = i + 1;
dg.Rows[j].Cells["Link"].Value = ur.AbsoluteUri;
dg.Rows[j].Cells[4].Value = level;
}
}
}
return ret;
}
catch (Exception exp)
{
MessageBox.Show(exp.Message);
return true;
}
}
Method
void GetLinks(DataGridView
dg, string url, string
fileTypes,string [] elseStr,string [] withStr, int
level,bool justCurrentSite,long Maxcnt):
This method start from URL url find links to filetypes
( for future implemeting ) , array of
without string elseStr , array of with
string withStr , max to Level level
, Boolean input justCurrentSite
for force to stay on same domain or not , MaxCnt is all link count limit
, unlimited if it zero.
This method is manager method for get link of web pages with
all conditions.
Process steps :
- Array List PreLinks created for holding different links.
- Get a Booolean variable flag for main loop to determine that inner loop operations cause a
new link ( may max level not reach )
- Loop on level count
- If dataGridView
Rows list is empty start from input url with following function call :
flag=AddLinks(dg, url, f + 1, GetFileSource(url), elseStr,
withStr, (f == level - 1),PreLinks,justCurrentSite);
Otherwise (if DataGridView
Rows
list is not empty ) set cnt
to
Rows.Count
.
- Loop from row number zero to cnt-1
- If current row is folder and isn't final file link (check
LinkisFile
column of DataGridView
if its value is p) call add link for
current row. - Remove current row because links of that is added.
public static void GetLinks(DataGridView dg, string url,
string fileTypes, string[] elseStr, string[] withStr, int level, bool justCurrentSite, long Maxcnt)
{
if (isStop == true)
return;
Application.DoEvents();
ArrayList PreLinks = new ArrayList((int)(Maxcnt == 0 ? 1000 : Maxcnt) + 1);
bool flag = true;
for (int f = 0; f < level && flag; f++)
{
if (isStop == true)
return;
flag = false;
int cnt = dg.Rows.Count;
if (f == 0 && dg.Rows.Count == 0)
{
Application.DoEvents();
flag = AddLinks(dg, url, f + 1, GetFileSource(url), elseStr,
withStr, (f == level - 1), PreLinks, justCurrentSite);
}
else
{
int r = 0;
while (r < cnt)
{
if (dg.Rows[r].Cells[5].Value != null)
if (dg.Rows[r].Cells[5].Value.ToString() == "p")
{
flag = true;
Application.DoEvents();
AddLinks(dg, dg.Rows[r].Cells[1].Value.ToString(), f + 1,
GetFileSource(dg.Rows[r].Cells[1].Value.ToString()),
elseStr, withStr, (f == level - 1), PreLinks, justCurrentSite);
dg.Rows.RemoveAt(r);
cnt--;
}
else
r++;
}
if (Maxcnt > 0 && r >= Maxcnt)
{
flag = false;
break;
}
}
}
}
Data Type LinkItem for link data structure:
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text;
}
}
Forms important Methods :
Method btnDelFileLinks_Click(object sender, EventArgs
e):
For
filter link file links , this method get source link file name and target link
file then operate without string list and with string list on source link file
and filter links then save them to target.
Using
this capability:
- write
source file name in URL textbox or leave empty that to get file path by open dialog
- write
without string and delimited them by / character
- click
on "Remove Links From File" Button
private void btnDelFileLinks_Click(object sender, EventArgs e)
{
try
{
if (txtElse.Text != "" || txtWith.Text != "")
{
string[] elseStr = txtElse.Text.Split('/');
string[] withStr = txtWith.Text.Split('/');
if (txtURL.Text == "")
{
OpenFileDialog op = new OpenFileDialog();
if (op.ShowDialog() == DialogResult.OK)
{
txtURL.Text = op.FileName;
}
}
if (txtURL.Text != "")
{
SaveFileDialog sv = new SaveFileDialog();
if (sv.ShowDialog() == DialogResult.OK)
{
StreamReader sr = new StreamReader(txtURL.Text);
StreamWriter sw = new StreamWriter(sv.FileName);
if (sr != null)
{
while (sr.EndOfStream == false)
{
string li = sr.ReadLine();
bool flag = true;
for (int k = 0; k < elseStr.Length; k++)
if (li.Contains(elseStr[k]) == true && elseStr[k] != "")
{
flag = false;
break;
}
if (flag == true && withStr.Length > 0)
{
flag = false;
for (int m = 0; m < withStr.Length; m++)
if (li.Contains(withStr[m]) == true)
{
flag = true;
break;
}
}
if (flag == true)
{
sw.WriteLine(li);
}
}
sr.Close();
sw.Close();
MessageBox.Show("Exported to " + sv.FileName);
}
}
}
}
else
MessageBox.Show("No Condition !");
}
catch
{
MessageBox.Show("ٍerror");
}
}
Method
btnDiffLinks_Click(object
sender, EventArgs e):
This
method for export new links from source link file after removing frequent links
that exist in selected folder (just top level with checking "just top
level" or with subfolders if it's not checked)
private void btnDiffLinks_Click(object sender, EventArgs e)
{
OpenFileDialog opfd1 = new OpenFileDialog();
if (opfd1.ShowDialog()==DialogResult.OK)
{
StreamReader sr = new StreamReader(opfd1.FileName);
if (sr != null)
{
FolderBrowserDialog fbd=new FolderBrowserDialog();
if (fbd.ShowDialog() == DialogResult.OK)
{
string inFolder = fbd.SelectedPath;
SaveFileDialog sfd = new SaveFileDialog();
if (sfd.ShowDialog() == DialogResult.OK)
{
StreamWriter sw = new StreamWriter(sfd.FileName);
if (sw != null)
{
sw.AutoFlush = true;
string[] files = sr.ReadToEnd().Split(
new string[] { "\n", "\r" }, StringSplitOptions.RemoveEmptyEntries);
SortedList<string, string> SLfnames = new SortedList<string, string>();
string[] ExistFiles;
if (chIsJustInTop.Checked == true)
ExistFiles = Directory.GetFiles(inFolder, "*", SearchOption.TopDirectoryOnly);
else
ExistFiles = Directory.GetFiles(inFolder, "*", SearchOption.AllDirectories);
for(int i=0;i<ExistFiles.Length;i++)
{
ExistFiles[i] = Path.GetFileName(ExistFiles[i]);
if(ExistFiles[i]!="" && ExistFiles[i]!=null)
SLfnames.Add(ExistFiles[i].ToLower(),ExistFiles[i]);
}
for (int i = 0; i < files.Length; i++)
{
string fname = Path.GetFileName(files[i]);
if (fname != "" && fname != null)
{
if (SLfnames.IndexOfKey(fname.ToLower())<0)
sw.WriteLine(files[i]);
}
}
sw.Close();
}
}
}
sr.Close();
MessageBox.Show("Successfully Done!");
}
}
}
Connection Settings:
Connection to target URL may use Proxy Server, for this
ability check the "Using From Proxy Server" Check box
and set IP in first text box and port number in second:
For direct connection to target URL, unchecked this option.
Notes:
- When dataGridView Rows is nonempty and you click the "Get links" button,
URL textbox value is just reference URL and not starting point.
- Extract
web site links in most time consume high range of time then max level and max count
value is very important.
- Strings
in without string and with string textbox can leave empty text box or with any
character include space or other but each part delimited by / ( slash )character.
for string list example : 120/hello/direct
game/book/@
- String
list typed in without or with textbox is case sensitive.
- This
tool is very suitable for site have directory listing enabled, especially
for web site builders.
Points of Interest
I created this tool to avoid from vast of my time and I reach to my goal, hope this useful also for you.
My interest is Combine a sophisticated download tool with high intelligence web link filter tool that save users time and get best result.
History
Link manager version 1.0 implemented in 2012.
References