Scrape Website Data into Excel using VBA
I'll be showing you an example on how to Scrape Data from a Website into Excel Worksheet using VBA. We'll be scraping data from www(dot)renewableuk(dot)com. Please also read the privacy policy of the website before mining data.
Goal:
Get all data under all column headings which can be found on this website i.e.
Wind Project, Region, ..., Type of Project
Requirements:
Usage:
VBA CODE:
Goal:
Get all data under all column headings which can be found on this website i.e.
Wind Project, Region, ..., Type of Project
Requirements:
You need to add a reference, Microsoft HTML Object Library on your VBA project.
Usage:
You can call the ProcessWeb() sub directly by pressing F5 on the Microsoft Visual Basic Window.
Or you can add a button on your excel worksheet then assign ProcessWeb() as the macro.
VBA CODE:
Function ScrapeWebPage(ByVal URL As String)
Dim HTMLDoc As New HTMLDocument
Dim tmpDoc As New HTMLDocument
Dim i As Integer, row As Integer
Dim WS As Worksheet
Set WS = Sheets("DATA")
'create new XMLHTTP Object
Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")
XMLHttpRequest.Open "GET", URL, False
XMLHttpRequest.send
While XMLHttpRequest.readyState <> 4
DoEvents
Wend
With HTMLDoc.body
'Set HTML Document
.innerHTML = XMLHttpRequest.responseText
'Get only Order List Tag of HTML Document
Set orderedlists = .getElementsByTagName("ol")
'Reset the Document to the HTML of the second ordered list element
'where we only need to extract the data
.innerHTML = orderedlists(1).innerHTML
'Now, we'll get the list items
Set ListItems = .getElementsByTagName("li")
'Declare data variable for p values
Dim iData As Integer
row = lastRow + 1
'Let's process each data of the list items
For Each li In ListItems
'Start's at 1st column
iData = 1
With tmpDoc.body
'Set the temp doc
.innerHTML = li.innerHTML
'There are about 10 columns, so there are 10 p's
Set ps = .getElementsByTagName("p")
For Each p In ps
'Put the value of p to each cells define below
WS.Cells(row, iData).Value = p.innerText
'increment it by 1 which starts at column 1
iData = iData + 1
Next
End With
row = row + 1
Next
End With
End Function
'Get the total number pages we need to scrape
Function totalPage() As Integer
Dim HTMLDoc As New HTMLDocument
Dim tmpDoc As New HTMLDocument
Dim html As String
Dim mask As String
Dim URL As String
URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm"
Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")
XMLHttpRequest.Open "GET", URL, False
XMLHttpRequest.send
html = XMLHttpRequest.responseText
With HTMLDoc.body
.innerHTML = Mid(html, InStr(1, html, ""), 300)
mask = Mid(.innerHTML, InStr(1, LCase(.innerHTML), "") - 2, 2)
End With
totalPage = mask
End Function
Function lastRow() As Long
lastRow = Range("A65536").End(xlUp).row
End Function
Sub ProcessWeb()
Dim URL As String
Dim i As Integer
Range("2:2", Selection.End(xlDown)).ClearContents
Range("A2").Select
Application.ScreenUpdating = False
Application.Cursor = xlWait
URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm/page/"
For i = 1 To totalPage
ScrapeWebPage URL & i
Application.StatusBar = "Please wait while processing page " & i & " of " & totalPage & "..."
Next i
Application.ScreenUpdating = True
Application.Cursor = xlDefault
Application.StatusBar = ""
MsgBox "Data Extraction is Done!"
End Sub
why are you using external log file? isn't it hard to parse the data directly and put into the excel sheets.
ReplyDeleteI think that would be possible. You can use the following code instead.
ReplyDeleteDim iData as Integer
For Each li In ListItems
iData = 1
With tmpDoc.body
'Set the temp doc
.innerHTML = li.innerHTML
'There are about 10 columns, so there are 10 p's
Set ps = .getElementsByTagName("p")
For Each p In ps
'Print only the text, excluding the tags
WS.Cells(row, iData).Value = p.innerText
iData = iData + 1
Next
End With
Next
Thanks. ScrapeWebpage() function is now simplified.
ReplyDeleteWhat References are you adding I am getting an error Invalid Procedure call or Argument error in this line :
ReplyDeletemask = Mid(.innerHTML, InStr(1, LCase(.innerHTML), "") - 2, 2)
Thanks