PDA

View Full Version : Solved: Search document for words in keyword list



sassora
07-05-2012, 01:02 AM
Hi,

I have a list of keywords that i want to search for in docuuments that are on my excel list. The paths are in the same sheet and the documents are in msoffice or web pages. I would like to return the list words that are found for each document and deposit them on the same row.

Any ideas?

Kenneth Hobs
07-05-2012, 05:35 AM
Break your project into smaller bits.

When you say MSOffice, do you mean MSWord or PowerPoint or Excel or Outlook or all or just what? If MSWord, will it be DOC, DOCX, DOCM, file types or what?

For the web searching, one can easily and quickly get source text and search in that. It would probably be better if you were looking for some specific content in a control or such.

sassora
07-05-2012, 08:55 AM
Break your project into smaller bits.

When you say MSOffice, do you mean MSWord or PowerPoint or Excel or Outlook or all or just what? If MSWord, will it be DOC, DOCX, DOCM, file types or what?

For the web searching, one can easily and quickly get source text and search in that. It would probably be better if you were looking for some specific content in a control or such.

Of the 563 documents I currently have the majority are in doc format (see table). This is probably the best place to start. Thanks for your comment about breaking the project down.

doc 494
docx 6
pdf 6
ppt 21
web 31
xls 5

Grand Total 563

If I have a list of words to find then how is it best to identify them in a word document and record the keywords that are contained in the document in Excel?

Thanks for your help

CatDaddy
07-05-2012, 08:57 AM
the InStr() function is good for finding words in a string

Kenneth Hobs
07-05-2012, 09:17 AM
Searching in the body of the DOC file is easily done by opening in the foreground and then doing a search. Searching in other structures like headers and footers takes more work.

Are you passing key words or key phrases? Is the list in an array? Did you want to know which words were found and if found, how many? Show an example of the output that you expect.

PDF files can not always be searched since they may just be scans. Even if they are not, that part will probably be the most difficult. If you have the Adobe Acrobat and not just the reader, that task might be a bit easier.

sassora
07-05-2012, 10:44 AM
I think that searching in the foreground of the DOC should be enough. There will be some phrases included in the list. To make the list editable to a basic excel user, it would be good to locate it in the workbook.

The number of times a phrase is found (if it is found) would be useful. If the number was very low then it might be that the phrase wasn't a lot to do with the document.

I am pretty much certain that none of the PDFs will be scans so that's convenient. I only have the reader and not the full version.

I'm travelling at the moment - I will upload an example of what I am looking for soon. Thanks for your help

sassora
07-05-2012, 12:43 PM
Hi

I have attached an example workbook.

Kenneth Hobs
07-05-2012, 06:17 PM
Where will the first 4 column's data come from? I could search for *.doc and *.docx and create the link for column B if needed. The first question must be how do you want to set the filenames to search?

Kenneth Hobs
07-05-2012, 07:30 PM
I won't be able to visit until late afternoon tomorrow. Until then, see if this helps any. Replace the value of fn with your full filename. Add the reference as commented or comment the Dim for the object and uncomment the Dim line where the objects were defined as objects to use late binding instead.

Sub Test_CountWordsPhrasesInDocAndTextBoxes()
Dim a() As Variant, r As Range, i As Integer
Dim fn As String

fn = "x:\MSWord\ken.docx"
With Worksheets("List of key phrases to search")
Set r = .Range("A3:A" & .Range("A" & Rows.count).End(xlUp).Row)
End With

a() = CountWordsPhrasesInDocAndTextBoxes(fn, r)
'a() = CountWordsPhrasesInDocAndTextBoxes(fn, r, tfDocVisible:=False)

For i = 1 To UBound(a)
MsgBox "Filename: " & vbLf & fn & vbLf & vbLf & _
"Word/Phrase: " & r(i, 1) & vbLf & _
"Count: " & a(i), vbInformation, _
"Word/Phrase Count"
Next i
End Sub


Function CountWordsPhrasesInDocAndTextBoxes(doc As String, rWP As Range, _
Optional tfMatchWholeWord As Boolean = True, _
Optional tfMatchCase As Boolean = True, _
Optional inTextBoxes As Boolean = True, _
Optional tfDocVisible As Boolean = True, _
Optional tfCloseDoc As Boolean = True) As Variant

' Tools > References > Microsoft Word 14.0 Object Libray
Dim wdApp As Word.Application, WD As Word.Document 'Early Binding requires reference.
'Dim wdApp As Object, WD As Object 'Late Binding
Dim y As Integer, i As Integer, cI() As Variant

ReDim cI(1 To rWP.Cells.count)
Const wdMainTextStory As Integer = 1
Const wdReplaceAll As Integer = 2 'Not needed.
Const wdFindContinue = 1

On Error Resume Next
Set wdApp = GetObject(, "Word.Application")
If Err.Number <> 0 Then Set wdApp = CreateObject("Word.Application")
On Error GoTo 0

If Dir(doc) = "" Then Exit Function
Set WD = wdApp.Documents.Open(doc)
wdApp.Visible = tfDocVisible
On Error GoTo CloseNow

' Search for and count occurrences of the text typed.
For i = 1 To rWP.Cells.count
y = 0
With WD.Content.Find
Do While .Execute(FindText:=rWP(i, 1), MatchCase:=tfMatchCase, _
Forward:=True, Format:=True, _
MatchWholeWord:=tfMatchWholeWord) = True
' Display message in Word's Status Bar.
y = y + 1
Loop
End With

'Now search all other stories using Ranges
Dim mystoryrange As Object
'wdApp.Activate
For Each mystoryrange In WD.StoryRanges
If mystoryrange.StoryType <> wdMainTextStory Then
With mystoryrange.Find
Do While .Execute(FindText:=rWP(i, 1), MatchCase:=tfMatchCase, _
Forward:=True, Format:=True, _
MatchWholeWord:=tfMatchWholeWord) = True
y = y + 1
Loop
End With
Do While Not (mystoryrange.NextStoryRange Is Nothing)
Set mystoryrange = mystoryrange.NextStoryRange
With mystoryrange.Find
Do While .Execute(FindText:=rWP(i, 1), MatchCase:=tfMatchCase, _
Forward:=True, Format:=True, _
MatchWholeWord:=tfMatchWholeWord) = True
y = y + 1
Loop
End With
Loop
End If
Next mystoryrange
cI(i) = y 'Poke total word/phrase count into array for return.
Next i

CloseNow:
If tfCloseDoc Then
WD.Close False
Set WD = Nothing
wdApp.Quit
Set wdApp = Nothing
End If

CountWordsPhrasesInDocAndTextBoxes = cI()
End Function

sassora
07-05-2012, 10:48 PM
This is great.

The first four columns relate to information already collected. Column B would be where fn is be defined from (if it is a DOC or DOCX). I think it would be useful for the sub to return only the 10 most common key phrases on the key phrase list, perhaps in order of frequency. If there are no matches they wouldn't have to be returned at all.

Thanks

sassora
07-06-2012, 11:21 PM
Do you have any advice on for searching web links from excel?

snb
07-07-2012, 08:09 AM
if the document's fullname is in column A in sheet "documents"
if the key words/phrases are in column A in sheet "list"


sub snb()
sn=sheets("documents").columns(1).specialcells(2).resize(2)
sp=sheets("list").columns(1).specialcells(2)

for j=1 to ubound(sn)
c01=getobject(sn(j,1)).content
for jj=1 to ubound(sp)
sn(j,2)=sn(j,2) & " " & sp(j,1) & "_" & ubound(split(c01,sp(j,1)))+1
next
next

sheets("documents").columns(1).specialcells(2).resize(2)=sn
end sub

sassora
07-07-2012, 09:00 AM
should sp also have a resize property?

snb
07-07-2012, 09:13 AM
No, the array will be used only to read it's information (key words/phrases)
sn is being resized because the result of the 'searching' process will be written into the second 'column' of array sn and at last be written into columns A & B in sheet 'documents'.

But without resizing sn this would produce the same rsult:


Sub snb()
sn=sheets("documents").columns(1).specialcells(2)
sp=sheets("list").columns(1).specialcells(2)

For j=1 To UBound(sn)
c01=getobject(sn(j,1)).content
sn(j,1)=""
For jj=1 To UBound(sp)
sn(j,1)=sn(j,1) & " " & sp(j,1) & "_" & UBound(split(c01,sp(j,1)))+1
Next
Next

sheets("documents").columns(1).specialcells(2).offset(,1)=sn
End Sub

sassora
07-07-2012, 09:38 AM
I get a type mismatch for For j = 1 To UBound(sn) - with the resize this doesn't happen (similar to what was happening with sb)

Also the jj variable isn't listed in the body of the second for loop, is this to repeat the same thing UBound(sp) number of times?

sassora
07-07-2012, 11:07 AM
Would you be able to upload a workbook with this functioning? I can't seem to get it to work.

snb
07-07-2012, 12:53 PM
Here you go....

the amended code:
Sub snb()
sn=sheets("documents").columns(1).specialcells(2)
sp=sheets("list").columns(1).specialcells(2)

For j=1 To UBound(sn)
c01=getobject(sn(j,1)).content
sn(j,1)=""
For jj=1 To UBound(sp)
sn(j,1)=sn(j,1) & " " & sp(jj,1) & "_" & UBound(split(c01,sp(jj,1)))+1
Next
Next

sheets("documents").columns(1).specialcells(2).offset(,4)=sn
End Sub

sassora
07-07-2012, 01:43 PM
Thanks that's really helpful - I added a set statement to c01 and found that it works well for multiple word documents.

I notice that it opens them files in the background - if I am opening 500-1000 files these may build up in the memory, is there a fairly simple way to close them again? I think Ken's code above takes care of this, are there pro and cons of using either?

When I include links to other filetypes (other than Word), the "object doesn't support this method/property" dialog box appears. Is there any scope to adjust the code for single web pages?

It's great to have your insight

Kenneth Hobs
07-07-2012, 02:43 PM
I have not had time to figure out how to make snb's workbook code work for me. I am not sure if snb's code handles textbox text or the other story sections. Obviously, if it works for you, go for it. Any routine that works in the background will be faster than foreground methods providing it meets your needs.

To sort, I added a worksheet named Scratch and hide it. The Redim Preserve keeps the first 10.

Later, I will post code on how to get source code. As I said earlier, if you just wanted parts of a web page such as links, that would be a different solution. Once you have source code placed into a text file, the file could be saved to a docx file in a temp folder and then use one of the methods that you prefer.
Sub Test_CountWordsPhrasesInDocAndTextBoxes()
Dim a() As Variant, r As Range, i As Integer
Dim fn As String

fn = "x:\MSWord\ken.docx"
With Worksheets("List of key phrases to search")
Set r = .Range("A3:A" & .Range("A" & Rows.count).End(xlUp).Row)
End With

'a() = CountWordsPhrasesInDocAndTextBoxes(fn, r)
a() = CountWordsPhrasesInDocAndTextBoxes(fn, r, tfDocVisible:=False)

For i = 1 To UBound(a, 2)
MsgBox "Filename: " & vbLf & fn & vbLf & vbLf & _
"Word/Phrase: " & a(1, i) & vbLf & _
"Count: " & a(2, i), vbInformation, _
"Word/Phrase Count"
Next i

a() = WorksheetFunction.Transpose(a)
ScratchSort a(), Worksheets("Scratch")
a() = WorksheetFunction.Transpose(a)

ReDim Preserve a(1 To 2, 1 To 10)
For i = 1 To UBound(a, 2)
MsgBox "Filename: " & vbLf & fn & vbLf & vbLf & _
"Word/Phrase: " & a(1, i) & vbLf & _
"Count: " & a(2, i), vbInformation, _
"Word/Phrase Count"
Next i
End Sub

Sub ScratchSort(ByRef Array2d() As Variant, ws As Worksheet)
Dim rCount As Long
With ws
rCount = .UsedRange.Rows.count
.UsedRange.Clear
.Range("A1").Resize(UBound(Array2d, 1), UBound(Array2d, 2)).Value = Array2d
.Sort.SortFields.Add Key:=Range("B1:B" & rCount) _
, SortOn:=xlSortOnValues, Order:=xlDescending, DataOption:=xlSortNormal
.Sort.SortFields.Add Key:=Range("A1:A" & rCount) _
, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
With .Sort
.SetRange Range("A1:B" & rCount)
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
.Sort.SortFields.Clear
Array2d() = .UsedRange.Value
End With
End Sub

Function CountWordsPhrasesInDocAndTextBoxes(doc As String, rWP As Range, _
Optional tfMatchWholeWord As Boolean = True, _
Optional tfMatchCase As Boolean = True, _
Optional inTextBoxes As Boolean = True, _
Optional tfDocVisible As Boolean = True, _
Optional tfCloseDoc As Boolean = True) As Variant

Dim cCount As Integer
' Tools > References > Microsoft Word 14.0 Object Libray
Dim wdApp As Word.Application, WD As Word.Document 'Early Binding requires reference.
'Dim wdApp As Object, WD As Object 'Late Binding
Dim y As Integer, i As Integer, cI() As Variant

cCount = rWP.Cells.count
ReDim cI(1 To 2, 1 To cCount)
Const wdMainTextStory As Integer = 1
Const wdReplaceAll As Integer = 2 'Not needed.
Const wdFindContinue = 1

On Error Resume Next
Set wdApp = GetObject(, "Word.Application")
If Err.Number <> 0 Then Set wdApp = CreateObject("Word.Application")
On Error GoTo 0

If Dir(doc) = "" Then Exit Function
Set WD = wdApp.Documents.Open(doc)
wdApp.Visible = tfDocVisible
On Error GoTo CloseNow

' Search for and count occurrences of the text typed.
For i = 1 To cCount
y = 0
With WD.Content.Find
Do While .Execute(FindText:=rWP(i, 1), MatchCase:=tfMatchCase, _
Forward:=True, Format:=True, _
MatchWholeWord:=tfMatchWholeWord) = True
' Display message in Word's Status Bar.
y = y + 1
Loop
End With

'Now search all other stories using Ranges
Dim mystoryrange As Object
'wdApp.Activate
For Each mystoryrange In WD.StoryRanges
If mystoryrange.StoryType <> wdMainTextStory Then
With mystoryrange.Find
Do While .Execute(FindText:=rWP(i, 1), MatchCase:=tfMatchCase, _
Forward:=True, Format:=True, _
MatchWholeWord:=tfMatchWholeWord) = True
y = y + 1
Loop
End With
Do While Not (mystoryrange.NextStoryRange Is Nothing)
Set mystoryrange = mystoryrange.NextStoryRange
With mystoryrange.Find
Do While .Execute(FindText:=rWP(i, 1), MatchCase:=tfMatchCase, _
Forward:=True, Format:=True, _
MatchWholeWord:=tfMatchWholeWord) = True
y = y + 1
Loop
End With
Loop
End If
Next mystoryrange
cI(1, i) = rWP(i, 1)
cI(2, i) = y 'Poke total word/phrase count into array for return.
Next i

CloseNow:
If tfCloseDoc Then
WD.Close False
Set WD = Nothing
wdApp.Quit
Set wdApp = Nothing
End If

CountWordsPhrasesInDocAndTextBoxes = cI()
End Function

Kenneth Hobs
07-07-2012, 02:50 PM
Here is an example using winhttp to get source text. http://www.vbaexpress.com/forum/showthread.php?p=238904

Aussiebear
07-08-2012, 12:46 AM
Here you go....

the amended code:
Sub snb()
sn(j,1)=sn(j,1) & " " & sp(jj,1) & "_" & UBound(split(c01,sp(jj,1)))+1


What does this line do

sassora
07-08-2012, 01:19 AM
sn(j,1)=sn(j,1) & " " & sp(jj,1) & "_" & UBound(split(c01,sp(jj,1)))+1

For document, j, it produces an element in the sn array which has the keyphrases with the number of occurences next to them.

For example:
sn(1,1) =
Arithmetic_0 Data_0 Education_8 French_0 Gap_0 Gender_0 Languages_0 Policy_0 Reading_0 Speaking_0 Writiing_0 and_113 the_97

It's a neat way of extracting the information.

sassora
07-08-2012, 01:49 AM
Thanks all for help, I'll see where I get with moving this forward. If you have any addtional thoughts - feel free to post them.

I will mark this thread as solved.

snb
07-08-2012, 02:47 AM
To check for extensions you can use exempli gratia (e.g.)
Sub snb()
' activate the reference to Microsoft WinHttp services, version 5.1

sn = Sheets("documents").Columns(1).SpecialCells(2)
sp = Sheets("List").Columns(1).SpecialCells(2)

For j = 1 To UBound(sn)
Select Case CreateObject("scripting.filesystemobject").getextensionname(sn(j, 1))
Case "doc", "dot", "docx", "docm", "dotm"
c01 = GetObject(sn(j, 1)).Content
Case "csv", "txt"
c01 = CreateObject("scripting.filesystemobject").opentextfile(sn(j, 1)).readall
Case "htm", "html"
With New WinHttpRequest
.Open "Get", sn(j, 1)
.Send
c01 = .ResponseText
End With
End Select

sn(j, 1) = ""
For jj = 1 To UBound(sp)
sn(j, 1) = sn(j, 1) & " " & sp(jj, 1) & "_" & UBound(Split(c01, sp(jj, 1)))
Next
Next

Sheets("documents").Columns(1).SpecialCells(2).Offset(, 4) = sn
End Sub

sassora
07-08-2012, 03:44 AM
To check for extensions you can use exempli gratia (e.g.)
Sub snb()
' activate the reference to Microsoft WinHttp services, version 5.1

sn = Sheets("documents").Columns(1).SpecialCells(2)
sp = Sheets("List").Columns(1).SpecialCells(2)

For j = 1 To UBound(sn)
Select Case CreateObject("scripting.filesystemobject").getextensionname(sn(j, 1))
Case "doc", "dot", "docx", "docm", "dotm"
c01 = GetObject(sn(j, 1)).Content
Case "csv", "txt"
c01 = CreateObject("scripting.filesystemobject").opentextfile(sn(j, 1)).readall
Case "htm", "html"
With New WinHttpRequest
.Open "Get", sn(j, 1)
.Send
c01 = .ResponseText
End With
End Select

sn(j, 1) = ""
For jj = 1 To UBound(sp)
sn(j, 1) = sn(j, 1) & " " & sp(jj, 1) & "_" & UBound(Split(c01, sp(jj, 1)))
Next
Next

Sheets("documents").Columns(1).SpecialCells(2).Offset(, 4) = sn
End Sub

Well that simplifies things!

Would you know how to go from this, to a sorted list of key phrases (say the top 10). I think you get the top ten list by fixing an array to having 10 elements?

Ideally, I am looking to have have the output for a given document as
phrase1 / count1 / phrase2 / count2 in adjacent cells on the same row.

snb
07-08-2012, 04:51 AM
Would you know how to go from this, to a sorted list of key phrases (say the top 10).

Yes.
But with the code provided you should be able to do so yourself.



I think you get the top ten list by fixing an array to having 10 elements?


I don't think so. You will get the first 10 items in the list in sheet 'list'.
But that isn't related to any frequency.

sassora
07-08-2012, 05:35 AM
I was thinking that the array should be sorted according to the frequency and then just focus on the top 10 items for that document.

I'm going on a vba course next month, so hopefully I will pick up some tips there.

Are my first steps to create an array with more dimensions? (so I can define the phrase and count separately for each document) Three dimensions seems what I would need, although I would want to phrase name and count to always have the same index after sorting. The sorting order of keywords for each document would depend on the number of phrases in that document.

snb
07-08-2012, 05:49 AM
I'd say this would suffice:


sp = Sheets("List").Columns(1).SpecialCells(2).resize(,2)

sassora
07-08-2012, 10:46 AM
When I try to run code in post #24

I get either
"Runtime error 432 - File name or class name not found during automated operation"

or object not found.

Not sure what to do with the last post, I've been fiddling with this for most of the day

Any ideas?

snb
07-08-2012, 12:14 PM
did you read:

' activate the reference to Microsoft WinHttp services, version 5.1




And.... use F8 while testing the code to step through it line by line.
It would be nice to give feedback as detailed as possible: when, wher, value of variables, etc.

sassora
07-08-2012, 10:21 PM
Yes I activated the reference.



The code doesn't start as such so whether I press f5 or f8, it's the same. It's not the type of alert that starts the degugging mode.

The cursor is at For j = 1 To UBound(sn) if that helps.

snb
07-08-2012, 11:48 PM
There should be some data in column A in sheet 'documents'.

sassora
07-09-2012, 12:28 AM
There is data in the documents and list tabs as needed.

snb
07-09-2012, 03:42 AM
Remove 'Option Explicit'

sassora
07-09-2012, 02:26 PM
Unfortunately not