|
I have done some "data scraping" myself in the past extracting data from website pages. The problem you will run up against is when the positioning of the data youre interested in moves, especially of the person who owns the site changes their page. These days pages that are being generated use scripts to build up the page dynamically and this means that in order to extract your information with as little maintenance as possible, you will need to parse the page and present the data in a format which your own code can understand.
My own code which was developed in vbScript parsed the page looking for various tags such as href's titles meta data etc and the function (shown below) worked reasonably well, so you should be able to convert this into your VB application quite easily.
When scraping a page of unknown content, you have to be aware of pitfalls such as the tag you are looking for being embedded inside javascript as the autor may well write something like document.write("<tr><td class='myClass'>Saturn</td>.....");
Which really excaberates the problem. Trying to write a parser which covers all eventualities is extrememly hard. HTML 1.0 is one thing, but now with XHTML and XML the problem just grows. This is why browsers are always continuously being updated.
Having said that the function below might be spawn off some thoughts. It was called from a loop which keeps calling the GetTag and passing the updated pointer into the page back again. Not also that the Get Function can be recursive if a certain condition arises. You would probably need to do a similar thing when looking for a table tag as table can be embedded inside tables and of course a table has many rows and columns.
This is not an easy answer to your problem. There is most probably an easier way so simply locate the table tag and then count the number of columns you have located and simply extract the data inside the column for your program. But for maintenance free code which caters for the page authors making changes, then the parser method will have much more lengevity and can always be re-used in other projects.
Remember that this code is a little dirty and was used as a concept prover. But I hope it helps you with your problem. (Trying to get the formatting right in this post is a real pain when its a mixture of tabs and spaces!)
<pre>
call GetHTMLPage(strTemp) ' Grab the page into strHTML
iPos = 1
do while (iPos > 0)
iPos = GetTag ("href", strHTML, iPos, nNode, true)
loop
</pre>
<pre>
Function GetTag(strTag, str, iStartPos, thisNode, bStandardTag)
Dim i, pos, iStartScriptPos, iEndScriptPos
if objProps.DebugInfo then Response.Write "GetTag : " & strTag & "," & iStartPos & "<br>"
' Look for the tag in the stream
pos = Instr(iStartPos,str, strTag, vbTextCompare)
if bStandardTag then
pos = Instr(iStartPos,str, "<" & strTag & ">", vbTextCompare)
else
pos = Instr(iStartPos,str, strTag, vbTextCompare)
end if
' Check to see if one has been found
if pos >= iStartPos then
' Walk backwards to discover if we are inside a <script> tag
for i = pos to 1 step - 8 ' 8 is the length of '<script>'
iStartScriptPos = Instr(i, str, "<script" , vbTextCompare)
' Check to see if we were inside a script
if iStartScriptPos > 0 and iStartScriptPos < pos then
if objProps.DebugInfo then Response.Write "<span style=""color:green""><script>" & iStartScriptPos & "</span><br>"
exit for ' If we were then we need to examine closer so exit now
end if ' Otherwise
next ' keep looking backwards
' Check to see if we did actually find a '<script>' and if so look for the corresponding '</script>'
' This assumes that the page actually has a terminating tag
if iStartScriptPos > 0 then
iEndScriptPos = Instr(iStartScriptPos, str,"</script>", vbTextCompare)
end if
' Now that we have some positions of both the '<script>' and '</script>
' we can cehck to see if our tag was actually inside them
' If so then we can set the bInScript Flag
if ( iEndScriptPos > iStartScriptPos) and _
( iStartScriptPos > 0 ) and _
( pos < iEndScriptPos) and _
( pos >= iStartScriptPos) then
' We need to ignore anything in JS Comment Lines
if objProps.DebugInfo then
Response.Write "<span style=""color:blue"">" & _
objUtils.HTMLEnc(mid( str, iStartScriptPos, (iEndScriptPos-iStartScriptPos + len("</script>")) )) & _
"</span><br>"
end if
Dim arrScript, iComment, idx, strC, iLen, iResult, iTag
' We are inside <script> ... </script> and hrefs could be inside comments such as // so
' we need to check. Best way so far is to split the scetion into an array of lines
' and keep looking
arrScript = split( mid(str,iStartScriptPos,iEndScriptPos - iStartScriptPos), vbNewLine )
for idx = 0 to UBound(arrScript)
strC = arrScript(idx) ' Place line onto temp buffer
iLen = len(strC) ' cache the length
iComment = Instr(strC,"//") ' Start of comment
if iComment <= 0 then iComment = iLen
iTag = Instr(strC, strTag, vbTextCompare) ' Start of tag on this line
iResult = 1
'Response.Write "Line(" & idx & ") " & strTag & "-" & iTag & " " & objUtils.HTMLEnc(strC) & "<br>"
' Recurse along this line looking for tags
do while (iTag > 0 and iTag < iComment and iResult >0)
iResult = GetTag(strTag, strC, iResult, thisNode, bStandardTag)
loop
next ' So onto the next line
pos = iEndScriptPos
else
' We have located a tag so start getting some stuff
' walk backwards and locate the start of tag e.g. '<'
' but as in bbc.co.uk there is 'news_console_link.href= ... so check for a period also
if not bStandardTag then
Dim iSanityCheck : iSanityCheck = pos
do while mid(str,pos,1) <> "<" and pos > 1
'Response.Write "Scanning backwards: ( " & pos & ") " & mid(str,pos,1) & "<br>"
pos = pos -1
loop
pos = pos + 1 ' We dont want to include the '<' though
' build up a string with the tag details
Dim strTemp : strTemp = ""
do while (mid(str,pos,1) <> ">" and pos < len(str))
'Response.Write "Scanning: forwards: ( " & pos & ") " & mid(str,pos,1) & "<br>"
strTemp = strTemp & mid(str, pos, 1)
pos = pos + 1
loop
pos = pos + 1
if pos < iSanityCheck then
pos = iSanityCheck + len(strTag)
if objProps.DebugInfo then Response.Write "<span style=""color:red"">Exit GetTag SanityCheck : </span> " & pos & "<br>"
GetTag = pos
Exit Function
end if
else
' Simply look for the closing tag as eveyrthing else inside is all thats called for
strTemp = ""
pos = pos + len(strTag) + 2
do while (mid(str,pos,1) <> "<" and pos <= len(str))
strTemp = strTemp & mid(str, pos, 1)
pos = pos + 1
loop
do while (mid(str,pos,1) <> ">" and pos <= len(str))
pos = pos + 1
loop
pos = pos + 1
end if
' We now have a temporary string with the required tag data
' So we now need to either add this tag into the child list if its an href
' or simply stuff the information into this nodes information
call ApplyInfo(strTag, strTemp, thisNode)
'pos = pos + len(strTemp)
end if
end if
if objProps.DebugInfo then Response.Write "<span style=""color:green"">Exit GetTag returns: </span> " & pos & "<br>"
GetTag = pos
End Function
</pre>
|
|
|
|
|
Hi all,
i have SQL script that i will execute it in class and i have to add this SQL file to the class ,but i need no one else see this file ,so i make the copy option to be"Do not copy" , but i use this file to read from it ,so when i use the created dll in another solution , it can't find that file
what exactly shall i do?
your answers will be appreciated
Thank you
|
|
|
|
|
I think your SQL file is inside the Project so when you use the dll in the project . you find it but when you try to use the same DLL from another project it does not give you path. So two solutions.
1. encrypt the file and let it be with copy on the system. and decrypt it when required.
2. make a function in the class and it will hold the query in string array and will return it back to you when you call it or directly run it when you pass some required parameters.
Rizwan Bashir
|
|
|
|
|
thank you Rizwan ,
can u guide me how to do the encryption? so that when file exist in the customer's machine can't be opened!!!
|
|
|
|
|
|
Hi all
suppose I have these lines:
AddHandler MyControl1.Parent.Paint, AddressOf PaintParentHandler1
AddHandler MyControl2.Parent.Paint, AddressOf PaintParentHandler1
AddHandler MyControl3.Parent.Paint, AddressOf PaintParentHandler1
AddHandler MyControl1.Parent.Paint, AddressOf PaintParentHandler2
AddHandler MyControl1.Parent.Paint, AddressOf PaintParentHandler3
The questions are:
- how can I find, at runtime, the procedures that are attached to a given event?
- how can I find, at runtime, the events a given procedure is subscribed to?
In other words: what are the procedures that reacts to MyControl1.Parent.Paint and to which events PaintParentHandler1 respond?
Thanks a lot
-- modified at 3:19 Sunday 7th May, 2006
|
|
|
|
|
Hi,
1. if the event is within the same class than a call to GetInvocationList will give you an array of Delegates. I think the compiler will not accept this when calling it on an event defined outside the current class. You might nevertheless reach it via Reflection.
2. Not possible. An event is simply a collection of references to methods and the referenced methods don't know anything about it.
|
|
|
|
|
Hi Robert,
1. if the event is within the same class than a call to
GetInvocationList will give you an array of Delegates.
I think the compiler will not accept this when calling it on an
event defined outside the current class. You might nevertheless
reach it via Reflection.
Absolutely right! I've found a lot of samples that uses MyEventEvent.GetInvocationList(). Unfortunately for me, they are all inspecting events INSIDE the same class where they are declared. In my case, I need to find a list of eventhandlers for an event of a given object used elsewhere. The MyControl1.Parent.Paint sample is not casual...
Do you have an idea on how solve it using Reflection?
2. Not possible. An event is simply a collection of references
to methods and the referenced methods don't know anything about
it.
I agree. But what I'm thinking is that the compiler exactly knows that a procedure is attached to an event.
However, for me is much more important to solve #1.
As you can see in the first three lines of my sample, lines 2 and 3 are useless. In fact, if MyControl1, MyContro2 and MyContro3 are inside the SAME parent, PaintParentHandler1 is called three times, while only the first time is needed.
Thank you for any suggestion...
carlo
-- modified at 4:07 Sunday 7th May, 2006
|
|
|
|
|
Hi,
I've played around a bit but just couldn't bring the compiler to accept any call (even reflection) on an event. I know it must be somehow possible (because some tools I've seen are able to do so) but I don't know how they made it.
The only other workaround for your problem I can image would be holding a seperate list which holds the already bound references which you could check before binding the next event.
|
|
|
|
|
Hello Robert
you are right. Mantaining an embedded list of eventhandlers calls is a good idea.
However, yesterday I implemented and tested this:
<br />
RemoveHandler MyControl1.Parent.Paint, AddressOf PaintParentHandler1<br />
AddHandler MyControl1.Parent.Paint, AddressOf PaintParentHandler1<br />
<br />
RemoveHandler MyControl2.Parent.Paint, AddressOf PaintParentHandler1<br />
AddHandler MyControl2.Parent.Paint, AddressOf PaintParentHandler1<br />
<br />
RemoveHandler MyControl3.Parent.Paint, AddressOf PaintParentHandler1<br />
AddHandler MyControl3.Parent.Paint, AddressOf PaintParentHandler1<br />
This allows to have just a single PaintParentHandler1 call whithout any useless stack... and seems to work!
Thank you for your help.
Carlo
-- modified at 3:15 Monday 8th May, 2006
|
|
|
|
|
Hi all,
I have a Winforms 2.0 app in which I have a DataGridView which is populated from my own data objects (ie: from my own classes, not from a database or strongly typed dataset). The DGV is readonly so no editing or adding new rows can be done. Since some of the columns need to have a custom sort I set the SortMode property on those columns to "Programmatic" and then added the following event handler to the code:
Private Sub dgvMeetings_SortCompare(ByVal sender As System.Object, ByVal e As DataGridViewSortCompareEventArgs) Handles dgvMeetings.SortCompare
'Try to sort based on the columns in the current column.
e.SortResult = System.String.Compare(e.CellValue1.ToString(), e.CellValue2.ToString())
'If the cells are equal, sort based on the race start time column
If e.SortResult = 0 Then
e.SortResult = System.String.Compare(dgvMeetings.Rows(e.RowIndex1).Cells("RaceDate").Value.ToString(), _ dgvMeetings.Rows(e.RowIndex2).Cells("RaceDate").Value.ToString())
End If
e.Handled = True
End Sub
However, this event is simply not firing at all... I can click on the column headers until I'm blue in the face and the event never fires. The columns in the DGV that are set to sort automatically work fine so would anyone know why SortCompare is not working? I also tried adding an AddHandler which pointed to the same method (without the "Handles ..." of course) and it still ignores it. I actually got the code above from the MS document about the DGV (at http://download.microsoft.com/download/5/6/4/5646742C-3EB7-48F7-BFB3-CC295D618CF9/DataGridView%20FAQ.doc) in which it says this is the way to custom sort unbound data - but it doesn't seem to work for me... anyone know what I might be missing here?
TIA for any help...
Mike
-- modified at 0:13 Sunday 7th May, 2006
|
|
|
|
|
Worked out what is was - I actually needed to set the sort mode of each column to "Automatic" rather than "Programmatic". Seems the latter mode is really only for managing the sort glyphs etc. Now the SortCompare event fires but I have a new problem which is to get numeric columns such as "Race Number" to sort by the value (not the string value) then by the race start date (which is a DateTime column).
I was going to delete this post but thought I'd leave it there for anyone else that runs into the same problem.
Mike
|
|
|
|
|
hi all
i want to insert the value in field of type "time stamp" in sql server 2000 from vb.net application. that field is not null.
when i had tried to insert new record in a row of table leaving time stamp field empty but it insert value as <binary>. how to get time stamp and insert into sql server.
Tasleem Arif
|
|
|
|
|
MY dear, I have a table Product in Sql server with 2 fields Prod_id & Prod_Name, I want to populate the combo box with prod Name , & when I select the Prod Name from the list then I need Prod ID , how easily I can do it with the shortest cod.......... Help me please....
|
|
|
|
|
in vb.6 the format(date,"dd/mm/yyyy") if working
but in vb.net the format(today,"dd/mm/yyyy") is reflecting as 06/00/2006
the "mm" part is always gives the "00" what can i do
|
|
|
|
|
"mm" is for minutes. You need to use "MM" for months.
I also recommend that you use the Date.ToString() method instead of Format() which is only there for VB6 compatibility reasons.
Dim myDate As Date = Date.Now
Dim myString As String
myString = myDate.ToString("dd/MM/yyyy")
|
|
|
|
|
I have entered data into a database table, "PersonData". It has 5 columns
PersonNo Sex Age Married Education.
Now I want to write a validation program in vb whereby it opens the "PersonData" and then checks that if person is married, then the spouse has to be of different sex. and if the pesron is married then the age has to be more than 15. There are many incosistencies that I would like my code to check, but if I can be helped with this ones I can handle the others. This I want to run for say about a thousands records.
Then I write the validation messages to a text file.
This is a sample of the code I have written:
Option Explicit
Public objConn As ADODB.Connection
Public objRst As ADODB.Recordset
Public objPwd As ADODB.Recordset
Public objCmd As ADODB.Command
Public Sub main()
Set objConn = New ADODB.Connection
Set objRst = New ADODB.Recordset
Set objCmd = New ADODB.Command
objConn.Provider = "MSDASQL.1"
objConn.ConnectionString = "Password=pwdsa;Persist Security Info=True;User ID=idsa;Data Source=TESTING;Initial Catalog=TESTING"
objConn.Open
If objConn.State = adStateOpen Then
MsgBox "Connexion with the server established", vbInformation, "prjTesting"
Else
MsgBox "Connexion Failed !", vbInformation, "prjTesting"
End If
End Sub
Private Sub cmdValidate_Click()
Open "C:\Testing\testing.txt " For Output As #1
With objRst
Do While Not .EOF
if !married = 1 and !age < 15 then write #1, "PersonID: "; !personid & " " Person must be greater 15 years in order to be married"
.MoveNext
Loop
End With
End Sub
Private Sub Form_Load()
main
Set objRst = New ADODB.Recordset
objRst.Open "tblPersondata", objConn, adOpenDynamic, adLockOptimistic
End Sub
-----------------------------
The problem with my code is that, it only checks for the first column only as with the other columns it does not.
What am I doing wrong.
Quick Help is appreciated.
Thank you
phokojoe
|
|
|
|
|
I don't see in your code where married or age are getting set to values...
Mike Lasseter
|
|
|
|
|
Actually, What i am saying is that, the data is already in the database. Now the code I am writing will look for the inconsistencies and then write the message to the .txt file that I have specified with the ID of that particular person where there is incosistancies.
In this, below
Private Sub cmdValidate_Click()
Open "C:\test\persondata.txt " For Output As #1
With objRst
Do While Not .EOF
If !married = 1 and age < 15 Then Write #1, "Person ID: "; !personid & " Too young to get married"
.MoveNext
Loop
End With
End Sub
For the married code is 1 and I am checking that if that particular person is married then, his/her age must be greater than 15 otherwise write down the personid in the persondata.txt file so that it will be corrected. Since I will be checking many persons and the same time that is why I have enclosed the statement inside the DO WHILE NOT EOF.
Please help
phokojoe
|
|
|
|
|
Please help me with the above thread.
Help is urgently needed. Actual, I am writing a code to validate my survey results whichI have used MS sql server 2000 as the backend and VB as the front end. Or can anybody help me with which is the appropriate software to use to validate the dataset.
phokojoe
|
|
|
|
|
Dear..!
i want to read data from notepad(Includes phones number)
and fill all the data in listview ...so i need
1- brows and select file which found in C: directory
2- fill all the data inside the listview items
thanx alot for you....
kilany
|
|
|
|
|
You do realise that a "notepad file" is just a plain text file?
"On two occasions, I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question."
--Charles Babbage (1791-1871)
My: Website | Blog
|
|
|
|
|
is your problem solved .
if not send me message.
i can solve it.
rtytryt
|
|
|
|
|
hellow to all ..
i am generating a crystal report that have double rows that i would like to remove , let say i have
ID - NAME - ADDRESS
1 SSS SSSSS
1 EEE EEDEE
1 TER WEWEE
i would like to remove two rows , and keep one
have can i do this in crystal reports
thxx
|
|
|
|
|
Hi, I need to perform Shallow and Deep copies in my VB.NET application. Ive seen several code intensive methods to do this. Anyone know whay Microsoft has not just given a basic copy object command? Am I missing something?
In my application I have several User Controls and the user needs to cut and paste the controls to build up a screen of data that is then saved. The user often need to copy an existing instance of a user control and then modify it and save it. The amount of code I need to manually write seems intense.
Are there 3rd party extension to VB.NET that offer help in this area? As aI said b4 am I missing something?
I want to be able to do something like...
dim newObject as ucMyControl
newObject = new list.selecteditem
I know the keyword, NEW, is for new instances of classes but it would be handy if it created a new copy of a class as instanciated.
Cheers,
Kadi
|
|
|
|
|