Scrape Website Data into Excel using VBA

I'll be showing you an example on how to Scrape Data from a Website into Excel Worksheet using VBA. We'll be scraping data from www(dot)renewableuk(dot)com. Please also read the privacy policy of the website before mining data.


Goal:
Get all data under all column headings which can be found on this website i.e.
Wind Project, Region, ..., Type of Project

Requirements:
You need to add a reference, Microsoft HTML Object Library on your VBA project.

Usage:
You can call the ProcessWeb() sub directly by pressing F5 on the Microsoft Visual Basic Window.
Or you can add a button on your excel worksheet then assign ProcessWeb() as the macro.

VBA CODE:

Function ScrapeWebPage(ByVal URL As String)
    Dim HTMLDoc As New HTMLDocument
    Dim tmpDoc As New HTMLDocument
    
    Dim i As Integer, row As Integer
    Dim WS As Worksheet
    Set WS = Sheets("DATA")
    
    'create new XMLHTTP Object
    Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")
    XMLHttpRequest.Open "GET", URL, False
    XMLHttpRequest.send
    
    While XMLHttpRequest.readyState <> 4
        DoEvents
    Wend

    With HTMLDoc.body
        'Set HTML Document
        .innerHTML = XMLHttpRequest.responseText
        
        'Get only Order List Tag of HTML Document
        Set orderedlists = .getElementsByTagName("ol")
        
        'Reset the Document to the HTML of the second ordered list element
        'where we only need to extract the data
        .innerHTML = orderedlists(1).innerHTML
    
        'Now, we'll get the list items
        Set ListItems = .getElementsByTagName("li")
    
        'Declare data variable for p values
        Dim iData As Integer
        row = lastRow + 1
        
        'Let's process each data of the list items
        For Each li In ListItems
            'Start's at 1st column
            iData = 1
            With tmpDoc.body
                'Set the temp doc
                .innerHTML = li.innerHTML
                
                'There are about 10 columns, so there are 10 p's
                Set ps = .getElementsByTagName("p")
                
                For Each p In ps
                    'Put the value of p to each cells define below
                    WS.Cells(row, iData).Value = p.innerText
                    
                    'increment it by 1 which starts at column 1
                    iData = iData + 1
                Next
            End With
            row = row + 1
        Next  
    End With
End Function

'Get the total number pages we need to scrape
Function totalPage() As Integer
    Dim HTMLDoc As New HTMLDocument
    Dim tmpDoc As New HTMLDocument
    Dim html As String
    Dim mask As String
    Dim URL As String
    
    URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm"
        
    Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")
    XMLHttpRequest.Open "GET", URL, False
    XMLHttpRequest.send
    
    html = XMLHttpRequest.responseText
    
    With HTMLDoc.body
        .innerHTML = Mid(html, InStr(1, html, ""), 300)
        mask = Mid(.innerHTML, InStr(1, LCase(.innerHTML), "") - 2, 2)
    End With
    
    totalPage = mask
    
End Function

Function lastRow() As Long
    lastRow = Range("A65536").End(xlUp).row
End Function

Sub ProcessWeb()
    Dim URL As String
    Dim i As Integer
    
    Range("2:2", Selection.End(xlDown)).ClearContents
    Range("A2").Select
    
    Application.ScreenUpdating = False
    Application.Cursor = xlWait
    
    URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm/page/"
        
    For i = 1 To totalPage
        ScrapeWebPage URL & i
        Application.StatusBar = "Please wait while processing page " & i & " of " & totalPage & "..."
    Next i
    
    Application.ScreenUpdating = True
    Application.Cursor = xlDefault
    Application.StatusBar = ""
    
    MsgBox "Data Extraction is Done!"
    
End Sub

Comments

  1. why are you using external log file? isn't it hard to parse the data directly and put into the excel sheets.

    ReplyDelete
  2. I think that would be possible. You can use the following code instead.
    Dim iData as Integer
    For Each li In ListItems
             iData = 1
            With tmpDoc.body
                'Set the temp doc
                .innerHTML = li.innerHTML
                 
                'There are about 10 columns, so there are 10 p's
                Set ps = .getElementsByTagName("p")
                 
                For Each p In ps
                    'Print only the text, excluding the tags
    WS.Cells(row, iData).Value = p.innerText
    iData = iData + 1
                Next
                 
            End With
     Next

    ReplyDelete
  3. Thanks. ScrapeWebpage() function is now simplified.

    ReplyDelete
  4. What References are you adding I am getting an error Invalid Procedure call or Argument error in this line :
    mask = Mid(.innerHTML, InStr(1, LCase(.innerHTML), "") - 2, 2)
    Thanks

    ReplyDelete

Post a Comment

Popular posts from this blog

How to Create a Configuration.INI Files in VB6

How to Set Windows Form Always on Top of Other Applications in VB6

Send Email with Excel VBA via CDO through GMail