PDA

View Full Version : download company URL's & company names across multiple pages from web (Web scraping)



syed_iqbal
03-15-2017, 01:10 PM
Hi,


I want to download all company names from the web. I wrote code but one line of code is not working. pls go through the code and correct it


Note: here my intention is, I want to scrape/download specific URL's from the web page. if URL contain "http://money.rediff.com/companies/", then I want to download that URL. for this I was used instr function. but this function is not working. Pls, help me.

Sub downloadallcompanynames()

Dim IE As New SHDocVw.InternetExplorer
Dim num As Integer
Dim lrow As Long
Dim str As String
Dim alllinks As mshtml.IHTMLElementCollection
ThisWorkbook.Sheets("Sheet1").Activate
ActiveSheet.Cells(2, 1).Select
Set IE = New SHDocVw.InternetExplorer


IE.Visible = True
IE.navigate "http://money.rediff.com/companies"
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Set htmldoc = IE.Document
Set HTMLAs = htmldoc.getElementsByTagName("a")


For Each HTMLA In HTMLAs
num = InStr(HTMLA.innerText, "http://money.rediff.com/companies") 'this line of code is not working
If num > 0 Then
Debug.Print HTMLA.getAttribute("href")
ActiveCell.Value = HTMLA.getAttribute("href")
ActiveCell.Offset(1, 0).Select

End If


Next HTMLA


End Sub


Pls add another code for download all URL's from multiple pages from above website. i did not added.

thank you in advance

offthelip
03-15-2017, 02:15 PM
you are trying to use the INSTR function on something that isn't a string so results aren't what you want.
I just put the href into a string then it works.


Sub downloadallcompanynames()


Dim IE As New SHDocVw.InternetExplorer
Dim num As Integer
Dim lrow As Long
Dim str As String
Dim hrefstr As String


Dim alllinks As MSHTML.IHTMLElementCollection
ThisWorkbook.Sheets("Sheet1").Activate
ActiveSheet.Cells(2, 1).Select
Set IE = New SHDocVw.InternetExplorer




IE.Visible = True
IE.navigate "http://money.rediff.com/companies"
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Set htmldoc = IE.Document
Set HTMLAs = htmldoc.getElementsByTagName("a")




For Each HTMLA In HTMLAs
hrefstr = HTMLA.href
num = InStr(hrefstr, "http://money.rediff.com/companies") 'this line of code is not working
If num > 0 Then
Debug.Print HTMLA.getAttribute("href")
ActiveCell.Value = HTMLA.getAttribute("href")
ActiveCell.Offset(1, 0).Select


End If




Next HTMLA




End Sub

syed_iqbal
03-15-2017, 03:02 PM
Hi,


Below code is perfect except small error. i.e. There are 200 URLs in the first page. I downloaded 199 URLs (except first URL) by use of below code. Why first URL not added.

(first URL - http://money.rediff.com/companies/20-Microns-Ltd/15110088)

Sub downloadallcompanynames()


Dim IE As New SHDocVw.InternetExplorer
Dim num As Integer
Dim hrefstr As String

ThisWorkbook.Sheets("Sheet1").Activate
ActiveSheet.Cells(2, 1).Select
Set IE = New SHDocVw.InternetExplorer
IE.Visible = True
IE.Navigate "http://money.rediff.com/companies/"

Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Set HTMLDoc = IE.Document
Set HTMLAs = HTMLDoc.getElementsByTagName("a")
For Each HTMLA In HTMLAs
hrefstr = HTMLA.href
num = InStr(hrefstr, "http://money.rediff.com/companies/")
If num > 0 And IsNumeric(Right(hrefstr, 8)) Then
Debug.Print HTMLA.getAttribute("href")
ActiveCell.Value = HTMLA.getAttribute("href")
ActiveCell.Offset(1, 0).Select


End If
Next HTMLA

End Sub

offthelip
03-15-2017, 03:44 PM
When I ran your modified code and I got 200 urls in A2 to A201, and the one you have queried arrived in cell A2, so I can't see any problem.

syed_iqbal
03-15-2017, 03:57 PM
When I run the code, I also got 200 URLs(Specific URL's) in sheet1 (from A2 to A201). But I could not see first URL in the immediate window. that is why i mentioned. Why i could not see first URL in the immediate window.

offthelip
03-15-2017, 04:08 PM
I put a break straight after the debug print statement and I get the url coming up fine. If you are getting what you want onm the spreadsheet why worry about what is happening in the immediate window? Seeing what is happening in the immedate window can be difficult I find. I tend to use breakpoints and watch window to debug things.

syed_iqbal
03-15-2017, 04:21 PM
Thank you so much for your inputs. Just now, I understood one thing that, Immediate window can show only 199 rows + 1 blank row (When we run the debug statement). Maybe, That is why I could not see the first URL.