PDA

View Full Version : Save web page source as text file



mdmackillop
06-05-2009, 08:19 AM
With IE7 I can go Page/View Source to get the raw data from which I can extract desired data. Can some kind person create some code which would allow me to automate this for the active web page, saving to a specified location? Overwriting previous file doesn't matter.
Any differences for IE8?
Regards
MD

Oorang
06-05-2009, 01:05 PM
Doing it for the active web page is a little tricky if you didn't create the instance of IE yourself. It that something that can be worked around?
If so then either of these work with IE8. If not post back and we'll work on it:)

Option Explicit
Private Declare Function URLDownloadToFileA Lib "urlmon" (ByVal pCaller As Long, _
ByVal szURL As String, ByVal szFileName As String, ByVal dwReserved As Long, _
ByVal lpfnCB As Long) As Long

Public Sub Example1()
DownloadFile "http://www.google.com", "c:\Test\goog.txt"
Shell "notepad.exe ""c:\Test\goog.txt""", vbMaximizedFocus
End Sub

Public Sub Example2()
'Requires reference to Microsoft Internet Controls
Dim ie As SHDocVw.InternetExplorer
'Microsoft HTML Object Library
Dim doc As MSHTML.HTMLDocument
Set ie = New SHDocVw.InternetExplorer
ie.Navigate "www.google.com"
Do Until ie.ReadyState = READYSTATE_COMPLETE
DoEvents
Loop
Do While doc Is Nothing
Set doc = ie.Document
Loop
MsgBox doc.body.innerHTML
ie.Quit
End Sub

Private Function DownloadFile(URL As String, LocalFilename As String) As Boolean
'Thanks Mentalis:)
Dim lngRetVal As Long
lngRetVal = URLDownloadToFileA(0, URL, LocalFilename, 0, 0)
If lngRetVal = 0 Then DownloadFile = True
End Function

mdmackillop
06-05-2009, 01:13 PM
Hi Aaron,
I'll need to wait till I'm at work to test this.
A bit of background. Our surveyors log in to a web site and fill in an on-line form. Some of this data, we wish to store in our own database for analysis purposes. I now have the rest in place to extract the info from a text file and export it, so this automation is the icing.
Regards
Malcolm

Oorang
06-05-2009, 01:17 PM
Is it a site your area has control over, or are we making due without altering the site?

mdmackillop
06-05-2009, 01:27 PM
I'm not making any changes to the site. Data entry will be complete before this is run. Probably we will have someone reopening previously submitted data and reading the text file.
I've attached the Excel part for info.

mdmackillop
06-09-2009, 06:01 AM
Hi Aaron,
I'm having limited success here. Although I previously managed to save the text version manually, I now seem only to see a few lines. It appears as though the page is in two frames, and I need the second one.
I've been using Example 1 in my attempts.
Regards
Malcolm

Oorang
06-09-2009, 10:01 PM
:) This ought to do it :)

I'm sure there is room for refinement, but I tried to code in the spirit of "just get something that works".

Edit: I just couldn't let it rest. Reworked it to use the MSHTML library. Should resolve most of the string parsing issues. It's worth noting that depending on how much of the HTML you actually want, it could be beneficial to just just parse what it in the doc object and stick it in Excel rather than downloading the file. You can use the CreateDocumentFromURL on the web URL, but I wanted to work within the framework of your scenario.

Option Explicit

Private Declare Function URLDownloadToFileA Lib "urlmon" (ByVal pCaller As Long, _
ByVal szURL As String, ByVal szFileName As String, ByVal dwReserved As Long, _
ByVal lpfnCB As Long) As Long

Public Sub ImportSourceCode()
'Code written by Aaron Bush 06/10/2009 - Free for public use.
'Credit: It's nice but whatever;)
'Microsoft HTML Object Library for this to work.
'You will want to set a reference to C:\Windows\system32\mshtml.tlb
Const strURL_c As String = _
"http://www.w3schools.com/TAGS/tryit.asp?filename=tryhtml_iframe"
Const strSaveFolder_c As String = "C:\Test\HTMLDump\"
Dim strLocalPath As String
Dim strURLs() As String
Dim lngIndx As Long
Dim lngUprBnd As Long
ClearFolder strSaveFolder_c
ReDim strURLs(lngIndx) As String
strURLs(lngIndx) = strURL_c
Do
strLocalPath = DownloadFile(strURLs(lngIndx), strSaveFolder_c)
lngUprBnd = AppendArray(strURLs, GetIFrameURLS("file:///" & _
strLocalPath, strURLs(lngIndx)))
lngIndx = lngIndx + 1&
Loop Until lngIndx > lngUprBnd
ImportAll strSaveFolder_c, Sheet1
ClearFolder strSaveFolder_c
End Sub

Private Sub ClearFolder(ByVal folder As String)
Dim strPath As String
folder = FixFolderName(folder)
strPath = Dir(folder)
Do While LenB(strPath)
Kill folder & strPath
strPath = Dir
Loop
End Sub

Private Sub ImportAll(ByVal folder As String, ByVal ws As Excel.Worksheet)
Dim strPath As String
ws.UsedRange.Delete
folder = FixFolderName(folder)
strPath = Dir(folder)
Do While LenB(strPath)
With ws.QueryTables.Add("TEXT;" & folder & strPath, _
ws.UsedRange.SpecialCells(xlCellTypeLastCell).Offset(1, 0))
.TextFileParseType = xlFixedWidth
.TextFileColumnDataTypes = Array(1)
.TextFileFixedColumnWidths = Array(1024)
.Refresh False
End With
strPath = Dir
Loop
End Sub

Private Function DownloadFile(ByVal URL As String, localFolder As String) As _
String
'Thanks Mentalis:)
Dim lngRetVal As Long
Dim strLocalFile As String
strLocalFile = FixFolderName(localFolder) & GetGUID & ".html"
lngRetVal = URLDownloadToFileA(0, URL, strLocalFile, 0, 0)
If lngRetVal Then
Err.Raise Err.LastDllError, , "Download failed."
End If
DownloadFile = strLocalFile
End Function

Private Function FixFolderName(ByVal folder As String) As String
Dim strRtnVal As String
If AscW(Right$(folder, 1&)) = 92& Then
strRtnVal = folder
Else
strRtnVal = folder & "\"
End If
FixFolderName = strRtnVal
End Function

Private Function GetGUID() As String
GetGUID = Mid$(CreateObject("Scriptlet.TypeLib").GUID, 2&, 36&)
End Function

Private Function AppendArray(ByRef appendTo() As String, ByRef appendFrom() As _
String) As Long
Const lngEmpty_c As Long = -1&
Dim lngUprBndTo As Long
Dim lngUprBndFrom As Long
Dim lngUprBndNew As Long
Dim lngIndxTo As Long
Dim lngIndxFrom As Long
lngUprBndTo = SafeUBound(appendTo)
lngUprBndFrom = SafeUBound(appendFrom)
If lngUprBndFrom <> lngEmpty_c Then
lngUprBndNew = lngUprBndTo + lngUprBndFrom + 1&
ReDim Preserve appendTo(lngUprBndNew)
For lngIndxTo = lngUprBndTo + 1 To lngUprBndNew
appendTo(lngIndxTo) = appendFrom(lngIndxFrom)
lngIndxFrom = lngIndxFrom + 1&
Next
Else
lngUprBndNew = lngUprBndTo
End If
AppendArray = lngUprBndNew
End Function

Private Function SafeUBound(ByRef value() As String) As Long
Dim lngRtnVal As Long
On Error Resume Next
lngRtnVal = UBound(value)
If Err Then
lngRtnVal = -1
End If
SafeUBound = lngRtnVal
End Function



Private Function GetIFrameURLS(ByVal fileURL As String, ByVal remoteURL As _
String) As String()
'Note: Assumes fileURL is prefixed with "file:///".
Dim docNew As New MSHTML.HTMLDocument '<-Must be dimmed as new to work.
Dim doc As MSHTML.HTMLDocument
Dim iFrame As MSHTML.HTMLIFrame
Dim objElmnt As Object
Dim lngIndx As Long
Dim strRtnVal() As String
Dim strBaseURL As String
Dim strDomain As String
If Left$(fileURL, 8&) <> "file:///" Then
Err.Raise vbObjectError, , _
"fileURL parameter must be prefixed with ""file:///""."
End If
strDomain = GetDomain(remoteURL)
strBaseURL = Left$(remoteURL, InStrRev(remoteURL, "/"))
Set doc = docNew.createDocumentFromUrl(fileURL, vbNullString)
Do Until doc.readyState = "complete"
DoEvents
Loop
For Each objElmnt In doc.all
'Don't try to use the typeof operator here, it will yield false positives.
If TypeName(objElmnt) = "HTMLIFrame" Then
Set iFrame = objElmnt
If LenB(iFrame.src) Then
ReDim Preserve strRtnVal(lngIndx) As String
If InStrB(iFrame.src, "//") = 0& Then
If AscW(iFrame.src) = 47& Then
strRtnVal(lngIndx) = strDomain & iFrame.src
Else
strRtnVal(lngIndx) = strBaseURL & iFrame.src
End If
Else
strRtnVal(lngIndx) = iFrame.src
End If
lngIndx = lngIndx + 1&
End If
End If
Next
GetIFrameURLS = strRtnVal
End Function

Private Function GetDomain(ByVal URL As String) As String
Dim lngDPos As Long
Dim strRtnVal As String
lngDPos = InStr(URL, ":")
If lngDPos Then
Select Case LCase$(Left$(URL, lngDPos))
Case "http:"
lngDPos = InStr(8&, URL, "/")
Case "https", "file"
lngDPos = InStr(9&, URL, "/")
Case Else
Err.Raise vbObjectError, "Unrecognized URL scheme"
End Select
If lngDPos Then
strRtnVal = Left$(URL, lngDPos - 1&)
Else
strRtnVal = URL
End If
Else
Err.Raise vbObjectError, "Unrecognized URL scheme"
End If
GetDomain = strRtnVal
End Function

mdmackillop
06-10-2009, 10:33 AM
Hi Aaron,
For whatever reason it is not getting to the Data level. This is a fairly secure site, and possibly the data is not exposed via a visible page nasme. Does that make sense? I could email you the mht file if that would help?
I do get all the data in the MHT. Would that be an easier way to go? I've had a look for SaveAs MHT, but no success as yet! Whilst line-breaks mess up the data a bit, I'm sure I can extract what I need from that format.
Regards
Malcolm

BTW, a minor bug in your code


Private Function GetDomain(ByVal URL As String) As String
Dim lngDPos As Long
Dim strRtnVal As String

'My target site is 'https://link etc.
lngDPos = InStr(URL, ":")
If lngDPos Then
Select Case LCase$(Left$(URL, lngDPos))
Case "http:"
lngDPos = InStr(8&, URL, "/")
'Should these contain the colon?
Case "https", "file"
lngDPos = InStr(9&, URL, "/")
Case Else
Err.Raise vbObjectError, "Unrecognized URL scheme"
End Select

stanl
11-27-2009, 01:12 PM
If you can find a way to navigate to the data level, I find Logparser the easiest way to stream to a file, especially because you can specify interesting WHERE clauses. For example, here is this thread. Stan


cURL="http://www.vbaexpress.com/forum/showthread.php?t=27050"
cFile="c:\temp\url.txt"
cSQL = "SELECT text INTO " & cFile & " from " & cURL"
oLog = CreateObject("MSUtil.LogQuery")
oInput = CreateObject("MSUtil.LogQuery.TextLineInputFormat")
oOut = CreateObject("MSUtil.LogQuery.NativeOutputFormat.1")
oOut.Direct = 1
oLog.ExecuteBatch(cSQL, oInput, oOut)
oInput=Nothing
oOut=Nothing
oLog=Nothing

mdmackillop
11-28-2009, 04:34 AM
Hi Stan,
I'll have a look at this at work to test it on the secure site I'm working with
Regards
Malcolm

stanl
11-28-2009, 09:30 AM
Hi Stan,
I'll have a look at this at work to test it on the secure site I'm working with
Regards
Malcolm

I wasn't reallly sure from your post if you wanted to just get the source, or wanted to obtain what the users filled in forms as. Also, in IE8 (and probably IE7) 'Tabs' are a different issue and best programmed through InternetExplorer.Application.1

Stan

Shred Dude
05-02-2010, 09:31 PM
Just reading through this old post and was wondeirng if a simpler method would suffice? If you just want to capture the HTML Source of a known URL into a Text file, would something like this work for you?


Public Sub saveSource()
Dim myFile As String
Dim strURL As String
Dim myText As String

strURL = "http://www.google.com"

myFile = Application.DefaultFilePath & "\" & _
Replace(strURL, "http://", "") & " - SOURCE.txt"

With CreateObject("Microsoft.XMLHTTP")
.Open "GET", strURL, False
.send
myText = .responsetext
End With

With CreateObject("Scripting.FileSystemObject").CreateTextFile(myFile, True)
.Write (myText)
.Close
End With

End Sub

It sounded like you'd know how to navigate to the correct webpage to retrieve the previously inputted form data. Wasn't sure if that source would reveal the Frame contents for you or not.