PDA

View Full Version : Extract Hyperlinks from Word into Excel



Bullracer2
12-09-2013, 07:01 PM
Hi I've spent hours trawling these forums and thought it time to ask for help!

Essentially I'm wanting to extract all hyperlinks from a few hundred word documents (of various sizes) into an excel file. The information i'm after is the document name, hyperlink text, and hyperlink address. I've setup 3 columns accordingly for this.

I think the closest i've got is from using some code macropod wrote however it does nothing in my situation. I believe the issue is around identifying the hyperlink and perhaps some other formatting i've used the .style = "hyperlink" to find. Also not sure how the extract and display the 3 pieces of information i require...


Sub GetHyperlinks()
Application.ScreenUpdating = False
Dim StrFolder As String, StrFile As String, StrFnd As String, StrTxt As String
Dim wdApp As Object, wdDoc As Object, bStrt As Boolean
Dim WkSht As Worksheet, LRow As Long, LCol As Long, i As Long
Const wdFindContinue As Long = 1, wdFindStop As Long = 0
Const wdReplaceOne As Long = 0, wdReplaceAll As Long = 2
StrFolder = GetFolder
If StrFolder = "" Then Exit Sub
Set WkSht = ThisWorkbook.Sheets("Sheet1")
LRow = WkSht.Cells.SpecialCells(xlCellTypeLastCell).Row
LCol = WkSht.Cells.SpecialCells(xlCellTypeLastCell).Column
' Test whether Word is already running.
On Error Resume Next
bStrt = False ' Flag to record if we start Word, so we can close it later.
Set wdApp = GetObject(, "Word.Application")
'Start Excel if it isn't running
If wdApp Is Nothing Then
Set wdApp = CreateObject("Word.Application")
If wdApp Is Nothing Then
MsgBox "Can't start Word.", vbExclamation
Exit Sub
End If
' Record that we've started Word.
bStrt = True
End If
On Error GoTo 0
StrFile = Dir(StrFolder & "\*.doc", vbNormal)
While StrFile <> ""
LRow = LRow + 1
Set wdDoc = wdApp.Documents.Open(Filename:=StrFolder & "\" & StrFile, AddToRecentFiles:=False, Visible:=False, ReadOnly:=True)
'Do some pre-processing cleanup
With wdDoc
'Get the data for each defined Excel column
For i = 1 To LCol
StrFnd = WkSht.Cells(1, i).Value
With .Range
With .Find
.ClearFormatting
.Style = "Hyperlink"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchWildcards = True
.Execute
End With
If .Find.Found = True Then
'Parse the data
StrTxt = .Duplicate.Text
'Update Excel
WkSht.Cells(LRow, i).Value = StrTxt
End If
End With
Next
.Close SaveChanges:=False
End With
StrFile = Dir()
Wend
If bStrt = True Then wdApp.Quit
Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing
Application.ScreenUpdating = True
End Sub

Thanks

macropod
03-02-2014, 06:17 PM
It seems this thread was unanswered (I guess that might be the result of posting a query that concerns Word automation in an Excel forum). You'd have done better to post in http://www.vbaexpress.com/forum/forumdisplay.php?21-Integration-Automation-of-Office-Applications-Help. I believe you could use:

Sub GetHyperlinks()
Application.ScreenUpdating = False
Dim StrFolder As String, StrFile As String
Dim wdApp As Object, wdDoc As Object
Dim WkSht As Worksheet, LRow As Long, LCol As Long, i As Long
Const wdFindContinue As Long = 1, wdFindStop As Long = 0
StrFolder = GetFolder
If StrFolder = "" Then Exit Sub
Set WkSht = ThisWorkbook.Sheets("Sheet1")
LRow = WkSht.Cells.SpecialCells(xlCellTypeLastCell).Row
LCol = WkSht.Cells.SpecialCells(xlCellTypeLastCell).Column
Set wdApp = CreateObject("Word.Application")
If wdApp Is Nothing Then
MsgBox "Can't start Word.", vbExclamation
Exit Sub
End If
StrFile = Dir(StrFolder & "\*.doc", vbNormal)
While StrFile <> ""
LRow = LRow + 1
Set wdDoc = wdApp.Documents.Open(Filename:=StrFolder & "\" & StrFile, _
AddToRecentFiles:=False, Visible:=False, ReadOnly:=True)
With wdDoc
'Get the data for each defined Excel column
For i = 1 To LCol
With .Range
With .Find
.ClearFormatting
.Text = WkSht.Cells(1, i).Value
.Style = "Hyperlink"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchWildcards = True
.Execute
End With
If .Find.Found = True Then
'Update Excel
WkSht.Cells(LRow, i).Value = .Text
End If
End With
Next
.Close SaveChanges:=False
End With
StrFile = Dir()
Wend
wdApp.Quit
Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing
Application.ScreenUpdating = True
End Sub
Note: in the code you posted, you had:
.Style = "Hyperlink"
and:
.Format = False
The latter cancels the former.

Bullracer2
03-02-2014, 07:03 PM
Thanks Paul, apologies for posting in the incorrect area. I had today coincidentally opened a new thread about extracting the contents into word.
Haven't managed to get either to work with the code above i can't get anything to fire at all (i.e. not even to select the directory with no error messages either) and can't see whats wrong with the code.

macropod
03-02-2014, 07:14 PM
Do you have the GetFolder function that the macro uses available in the code module? It would have been part of the code you originally sourced the sub from.

Function GetFolder() As String
Dim oFolder As Object
GetFolder = ""
Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0,
"Choose a folder", 0)
If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
Set oFolder = Nothing
End Function

Bullracer2
03-02-2014, 08:35 PM
Thanks, have it mostly working now with the exception that it is only returning the first hyperlink in each document.

I thought changing the .find.found to a loop would work, but it just causes word to hand - presume it is stuck in a loop.


While StrFile <> "" LRow = LRow + 1
Set wdDoc = wdApp.Documents.Open(Filename:=StrFolder & "\" & StrFile, _
AddToRecentFiles:=False, Visible:=False, ReadOnly:=True)
With wdDoc
'Get the data for each defined Excel column
For i = 1 To LCol
With .Range
With .Find
.ClearFormatting
.Text = WkSht.Cells(1, i).Value
.Style = "Hyperlink"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchWildcards = True
.Execute
End With
Do While .Find.Found = True
'Update Excel
WkSht.Cells(LRow, i).Value = .Text
Loop
End With
Next
.Close SaveChanges:=False
End With
StrFile = Dir()
Wend

macropod
03-02-2014, 09:09 PM
OK, so you have:
For i = 1 To LCol
What is the value of LCol? What are the entries on row 1 of Sheet1?

Adding a loop does nothing for the code, since all you'd be doing is overwriting WkSht.Cells(LRow, i) with the same value on each iteration. It would also continue ad-infinitum, since there's nothing that would cause the code to exit the loop; there isn't even anything that might encourage the macro to continue on to the next instance.

Bullracer2
03-02-2014, 09:28 PM
Ok not as far along as i thought. The initial running of the code on test data i just had the row 1 blank and it returned the first hyperlink of each document.

Once i populated headings in row 1, being Document Name, Hyperlink Text, Hyperlink Address it causes no data to be entered. Ideally that's what i'd like but realise i'm in over my head and would be satisfied with just returning all the hyperlinks as a list (either word or excel) from a folder full of documents. I realise i haven't defined what needs to go into those three columns but have no idea where to start :(

macropod
03-02-2014, 09:43 PM
Those headings have nothing to do with what your code is doing! You code is trying to find and output hyperlinks whose text is the same as whatever you have on Row 1 (i.e. your headings).

In this context, your use of LCol is completely irrelevant. So, you can delete all reference to it. You can also delete the 'Const' line.

And, instead of the current 'While StrFile <> "" ... Wend' loop, you can use:

While StrFile <> ""
Set wdDoc = wdApp.Documents.Open(FileName:=StrFolder & "\" & StrFile, _
AddToRecentFiles:=False, Visible:=False, ReadOnly:=True)
With wdDoc
'Get the data for each defined Excel column
For i = 1 To .Hyperlinks.Count
LRow = LRow + 1
WkSht.Cells(LRow, 1).Value = .FullName
WkSht.Cells(LRow, 2).Value = .Hyperlinks(i).TextToDisplay
WkSht.Cells(LRow, 3).Value = .Hyperlinks(i).Address
Next
.Close SaveChanges:=False
End With
StrFile = Dir()
Wend

Bullracer2
03-02-2014, 11:07 PM
Makes sense when its explained - not so when deciphering the code. Anyway the updates you provided me have the macro working as i intended, so i really appreciate your assistance! :)