PDA

View Full Version : search a column of keywords in a folder of .docs and match file names and string



sibjac
12-09-2017, 10:09 AM
I have a bunch of *.doc documents (around 1000 plus) with keyword (4 digits) almost at the end of the document. All these keywords start with "SD/# and four digits and ends with /". I want to find dcouments matching the 4 digits of keyword from A column of excel and fill the column B with file name and column C and D with the folder names (documents can be in 2 subfolders inside the folder(subfolders names are to identify the type of documents) with the matching keyword because there can be many documents with the same keyword. There are almost 300 keywords of (4 digits, eg: 0001, 8684, 0456, 0022, etc.) in column A of the worksheet. Will this be possible with VBscript. Your valuable help appreciated.

PC information: 64 bit i3 with win 7, office 2007

Kenneth Hobs
12-09-2017, 12:07 PM
So you want just the 1st DOC file even if there are more than one with a match? I don't know what TYPE means for columns C and D. Maybe if you attached a simple example file it would show what you want more clearly.

sibjac
12-09-2017, 06:55 PM
So you want just the 1st DOC file even if there are more than one with a match? I don't know what TYPE means for columns C and D. Maybe if you attached a simple example file it would show what you want more clearly.

Thank you for showing my interest and prompt reply. Want all the documents to listed one by one in column B with matching string in C and the foldernames which these documents are located in the next two colums. Extremely sorry for not being clear and misguiding...

Kenneth Hobs
12-09-2017, 07:17 PM
That does not match with post #1. So, if there is a parent folder like c:\Excel, with subfolders \Sub1 and \Sub2, then for each row, the file might come from one of 3 locations.

To attach a file(s), click the Go Advanced button in lower right of a reply, and then the Manage Attachments button below the reply box.

I will make up an example using what I think was meant by post #1.

sibjac
12-09-2017, 07:27 PM
I will post an example which I have manually done. Plese look into it.

sibjac
12-09-2017, 07:38 PM
That does not match with post #1. So, if there is a parent folder like c:\Excel, with subfolders \Sub1 and \Sub2, then for each row, the file might come from one of 3 locations.

exactly. thank you for your patience and interest[/QUOTE]


To attach a file(s), click the Go Advanced button in lower right of a reply, and then the Manage Attachments button below the reply box.

I have attached one in above reply.


I will make up an example using what I think was meant by post #1.

sorry for misleading, it would be great if you could make up an example matching the sample

Kenneth Hobs
12-09-2017, 08:04 PM
You said this in the file.



Please note there are one main folder and total 4 subfolders, VR, VR-EX, NVR, NVR-EX. Files are grouped into these foldes under the main folder to easily identify. Can there be a serial number for the list of be also added in another colum (column structure can be changed)




I don't know why column C has a 4 digit number. I thought your goal was to MATCH the 4 digit number in column A? If not, what is the purpose of column A.

Without clear goals, full solutions will be nearly impossible. Think about what could happen. e.g. All files contain SD/#7301/ with 7301 in column A: f1.doc, f2.doc, \VR\f1.doc, \VR\f3.doc, \NVR\f1.doc, \NVR\Ken.doc. In that scenario, f1.doc with be in column B, nothing would be in Column C as it came from the parent folder. What would go into column D is a mystery. I guess if one kept the same filename then VR would go into column D.

Since the goals are not clear enough to handle all scenarios, the best I can do for you is to show you how to solve some scenarios.

sibjac
12-09-2017, 08:24 PM
hi, the problem is the there are multple files containing the same 4 digit number in it. when file list is popluated I think we would not be able to identify the which file belong to which number. The document file names cannot be saved with the neither the folder details nor the 4 digit number to identify it. so if the number again is written in the column next to it we can identify the file easily with the 4 digit matching number. Hope, I am clear

sibjac
12-09-2017, 08:54 PM
the numbers in column C is actually the search result showning match with the file number. I have done it manually to show you how the end result has to be. I think by doing that I have created confusion.

Kenneth Hobs
12-09-2017, 08:57 PM
That was what I was saying. So, column A serves no purpose?

If you want all files with at least one match from a list, I would break that list out separately. e.g. Sheet2 column A or a named range somewhere else so there was no confusion.

Rather than multiple columns, I would list the found match in Sheet1 with found match in column A, base filename, in column B, and then the subfolder path in Column C if not in the parent folder.

I would probably make it more simple and include the full filename in a hyperlink with the base filename as the hypertext.

If you were to match in a file by a list, could a file have more than one match?

sibjac
12-09-2017, 09:19 PM
Column A is the keywords to be searched. The output is in Column B which will be having the file number alone and there are multiple files with the same number. So when the result populates how will we identify which all the files have a particular keyword as the keyword is inside the file not a part of the file number.

sibjac
12-09-2017, 09:24 PM
That was what I was saying. So, column A serves no purpose?

If you want all files with at least one match from a list, I would break that list out separately. e.g. Sheet2 column A or a named range somewhere else so there was no confusion.



Rather than multiple columns, I would list the found match in Sheet1 with found match in column A, base filename, in column B, and then the subfolder path in Column C if not in the parent folder.

if it can happen in the same sheet instead of breaking it, would be great, if no other go, then it is okay with multiple sheets.


I would probably make it more simple and include the full filename in a hyperlink with the base filename as the hypertext.

hypertext is okay but not necessary.


If you were to match in a file by a list, could a file have more than one match?

no there is only one match inside the file..(there are no other keywords than the ones in Column A)..

sibjac
12-09-2017, 09:38 PM
below is an example of my working sheet with explanations, hope that will help you to make out the senario involved

Kenneth Hobs
12-09-2017, 11:14 PM
Don't expect this to be fast as you wanted a many to many search. There might be something faster than Instr().


Sub Main()
Dim p$, fn$, i As Long, j As Long, r As Long, c As Integer
Dim a, b, e, rr As Range, cc As Range
Dim ws As Worksheet, o As Object, s$
Dim fso As Object 'New Scripting.FileSystemObject

'******************* INPUTS **********************************
p = ThisWorkbook.Path & "\" 'Parent folder
Set ws = Worksheets(1)
'******************* END INPUTS ******************************

'List of 4 digit numbers. 'e.g. SD/#7301/, SD/#0231/
Set rr = ws.Range("A2", ws.Cells(Rows.Count, "A").End(xlUp))

On Error GoTo EndSub
Application.DisplayAlerts = False

a = aFFs(p & "*.doc", , True)
If Not IsArray(a) Then Exit Sub

Set fso = CreateObject("Scripting.FileSystemObject")
ReDim b(1 To Rows.Count, 1 To 4)

For Each e In a
Set o = GetObject(e)
s = o.Content
For Each cc In rr
i = InStr(s, "SD/#" & cc.Text & "/")
If i > 0 Then
j = j + 1
b(j, 1) = fso.GetFile(CStr(e)).Name
b(j, 2) = cc.Text
fn = fso.GetParentFolderName(CStr(e))
If Len(fn) > Len(p) Then b(j, 3) = Right(fn, Len(fn) - Len(p))
b(j, 4) = WorksheetFunction.Round(Len(s) / 65, 0)
End If
Next cc
o.Close False
Next e

Set fso = Nothing
If j = 0 Then Exit Sub

b = Application.Index(b, Evaluate("row(1:" & j & ")"), Application.Transpose([row(1:4)]))
ws.[B2].Resize(j, 4).Value = b
ws.UsedRange.Columns.AutoFit

EndSub:
Set fso = Nothing
Application.DisplayAlerts = True
End Sub


'MyDir should end in a "\" character unless searching by wildcards, e.g. "x:\test\t*
'Command line switches for the shell's Dir, http://ss64.com/nt/dir.html
Function aFFs(myDir As String, Optional extraSwitches = "", _
Optional tfSubFolders As Boolean = False) As Variant

Dim s As String, a() As String, v As Variant
Dim b() As Variant, i As Long

If tfSubFolders Then
s = CreateObject("Wscript.Shell").Exec("cmd /c dir " & _
"""" & myDir & """" & " /b /s " & extraSwitches).StdOut.readall
Else
s = CreateObject("Wscript.Shell").Exec("cmd /c dir " & _
"""" & myDir & """" & " /b " & extraSwitches).StdOut.readall
End If

a() = Split(s, vbCrLf)
If UBound(a) = -1 Then
Debug.Print myDir & " not found.", vbCritical, "Macro Ending"
Exit Function
End If
ReDim Preserve a(0 To UBound(a) - 1) As String 'Trim trailing vblfcr

For i = 0 To UBound(a)
If Not tfSubFolders Then
s = Left$(myDir, InStrRev(myDir, "\"))
'add the folder name
a(i) = s & a(i)
End If
Next i
aFFs = sA1dtovA1d(a)
End Function


Function sA1dtovA1d(strArray() As String) As Variant
Dim varArray() As Variant, i As Long
ReDim varArray(LBound(strArray) To UBound(strArray))
For i = LBound(strArray) To UBound(strArray)
varArray(i) = CVar(strArray(i))
Next i
sA1dtovA1d = varArray()
End Function

sibjac
12-10-2017, 02:19 AM
I placed the excel sheet with macro in the main folder and tried to run it. When I tries to run it, it hangs the

pc and excel. It took some time to renistate the pC back into the previous condition. On third attempt I could run

the macro but it gave a message and stopped after reading through all the documents "the document could not be

registered. It will not be possible create links from other documents to this doctument". It did not print

anything on the active sheet and did not create a new sheet, just plain excel sheet with the keywords in column A.

The word doc has a macro in it which utilizes the activx to take data from webpage and creates the document name. Can you please remove the hyperlinking section from the code.

The macro is installed in C Drive not in office folder and cannot be removed from the system as it is vital to the document. Can this be the reason? if so can the macro be killed during the running of the excel macro?.

snb
12-10-2017, 04:58 AM
The word doc has a macro in it which utilizes the activx to take data from webpage and creates the document name :spidereek


Please post a sample Word document.

sibjac
12-10-2017, 06:10 AM
Here is the sample folder with documents

snb
12-10-2017, 08:15 AM
I put the files in G:\test_snb\mainfolder, containing subfolders G:\test_snb\mainfolder\NVR, G:\test_snb\mainfolder\NVR-EX,G:\test_snb\mainfolder\VR and G:\test_snb\mainfolder\VR-EX

Then run this code:


Sub M_snb()
sn = ThisWorkbook.Sheets(1).Columns(1).SpecialCells(2)

with CreateObject("wscript.shell")
For j = 2 To UBound(sn)
MsgBox .Exec("cmd /c findstr /m/s SD/#" & Format(sn(j, 1), "0000") & "/ G:\test_snb\mainfolder\*.doc").StdOut.ReadAll
Next
end with
End Sub

Kenneth Hobs
12-10-2017, 09:22 AM
snb's example was to show how findstr can help. Change MsgBox.Exec to MsgBox .Exec in post #21.

I can not duplicate your system restore issue. When a macro "hangs" and ESC key presses or Break key does not abort it, the 3 finger solute (Alt+Ctrl+Del, Task Manager, Kill) is the usual method to Kill the Excel instance. It is always best to run a macro like mine by itself with no other Excel files open. That is, at least until you know what to expect.

I can not duplicate your timing problem for those files. I did say it took a "long time" to run. My run with your files took about 44 seconds. Anything over 5 seconds is a long time to me. I like to see under one second but you wanted to get the character count / 65 for each match.

snb or I could show you how to modify his code to do what mine does, less the character count / 65.

I added an ESC key option. When you abort a macro like mine, it can leave an instance of Word that needs to be Killed via 3 finger solute before another run or you could get an OLE error.


Sub Main()
Dim p$, fn$, i As Long, j As Long, r As Long, c As Integer
Dim a, b, e, rr As Range, cc As Range
Dim ws As Worksheet, o As Object, s$
Dim fso As Object 'New Scripting.FileSystemObject
Dim d#

d = Timer

'******************* INPUTS **********************************
p = ThisWorkbook.Path & "\" 'Parent folder
Set ws = Worksheets(1)
'******************* END INPUTS ******************************

'List of 4 digit numbers. 'e.g. SD/#7301/, SD/#0231/
Set rr = ws.Range("A2", ws.Cells(Rows.Count, "A").End(xlUp))

On Error GoTo EndSub
Application.EnableCancelKey = xlErrorHandler
Application.DisplayAlerts = False

a = aFFs(p & "*.doc", , True)
If Not IsArray(a) Then Exit Sub

Set fso = CreateObject("Scripting.FileSystemObject")
ReDim b(1 To Rows.Count, 1 To 4)

For Each e In a
Set o = GetObject(e)
s = o.Content
For Each cc In rr
i = InStr(s, "SD/#" & cc.Text & "/")
If i > 0 Then
j = j + 1
b(j, 1) = fso.GetFile(CStr(e)).Name
b(j, 2) = cc.Text
fn = fso.GetParentFolderName(CStr(e))
If Len(fn) > Len(p) Then b(j, 3) = Right(fn, Len(fn) - Len(p))
b(j, 4) = WorksheetFunction.Round(Len(s) / 65, 0)
End If
Next cc
o.Close False
Next e

Set fso = Nothing
If j = 0 Then Exit Sub

b = Application.Index(b, Evaluate("row(1:" & j & ")"), Application.Transpose([row(1:4)]))
ws.[B2].Resize(j, 4).Value = b
ws.UsedRange.Columns.AutoFit

EndSub:
Set fso = Nothing
Application.DisplayAlerts = True
Application.EnableCancelKey = xlInterrupt
Debug.Print Timer - d
End Sub

snb
12-10-2017, 09:33 AM
@KH


Sub M_snb()
sn = ThisWorkbook.Sheets(1).Columns(1).SpecialCells(2)

With CreateObject("wscript.shell")
For j = 2 To UBound(sn)
c00=c00 & vbcrlf & .Exec("cmd /c findstr /m/s SD/#" & Format(sn(j, 1), "0000") & "/ G:\test_snb\mainfolder\*.doc").StdOut.ReadAll
Next
End With
with createobject("scripting.filesystemobject")
st=split(c00,vbcrlf)
for j=1 to ubound(st)

next
end with
End Sub

sibjac
12-10-2017, 09:50 AM
snb's example was to show how findstr can help. Change MsgBox.Exec to MsgBox .Exec in post #21.

I can not duplicate your system restore issue. When a macro "hangs" and ESC key presses or Break key does not abort it, the 3 finger solute (Alt+Ctrl+Del, Task Manager, Kill) is the usual method to Kill the Excel instance. It is always best to run a macro like mine by itself with no other Excel files open. That is, at least until you know what to expect.

I can not duplicate your timing problem for those files. I did say it took a "long time" to run. My run with your files took about 44 seconds. Anything over 5 seconds is a long time to me. I like to see under one second but you wanted to get the character count / 65 for each match.

snb or I could show you how to modify his code to do what mine does, less the character count / 65.

I added an ESC key option. When you abort a macro like mine, it can leave an instance of Word that needs to be Killed via 3 finger solute before another run or you could get an OLE error.


Thank you very much for your time, effort, and commitment.

[CODE] a = aFFs(p & "*.doc", , True) [CODE]

I am getting a compile error for the above portion prompting "sub or function not defined" for "aFFs"

sibjac
12-10-2017, 09:58 AM
@KH


Sub M_snb()
sn = ThisWorkbook.Sheets(1).Columns(1).SpecialCells(2)

With CreateObject("wscript.shell")
For j = 2 To UBound(sn)
c00=c00 & vbcrlf & .Exec("cmd /c findstr /m/s SD/#" & Format(sn(j, 1), "0000") & "/ G:\test_snb\mainfolder\*.doc").StdOut.ReadAll
Next
End With
with createobject("scripting.filesystemobject")
st=split(c00,vbcrlf)
for j=1 to ubound(st)

next
end with
End Sub


thank you, I have tried but nothing happens with it. My excel version is office 2007

Kenneth Hobs
12-10-2017, 10:44 AM
aFFs() and the following Sub in it was already shown in previous posts.

sibjac
12-10-2017, 11:10 AM
aFFs() and the following Sub in it was already shown in previous posts.

Thank you... I will just check my pc and see if office can be reinstalled and then will try it...

Kenneth Hobs
12-10-2017, 11:26 AM
Combining what I did with snb's findstr method, less the characters/65 it ran in 1.4 seconds. We could use GetDetailsOf() to get the word count quickly if that would help any.

snb does not use Option Explicit so Dim is not needed. Obviously, change the value of ws and p to suit.

You can delete the two lines with Timer if you don't want that. Debug.Print puts the result into the Immediate Window (Ctrl+G) after a run.

Sub snbkh()
d = Timer
Set ws = ThisWorkbook.Sheets(1)
p = "C:\Users\lenovo1\Dropbox\Excel\Word\MainFolder\"

On Error GoTo EndSub
Application.EnableCancelKey = xlErrorHandler

Set fso = CreateObject("Scripting.FileSystemObject")
ReDim b(1 To Rows.Count, 1 To 3)

sn = ws.Columns(1).SpecialCells(2)
With CreateObject("WScript.Shell")
For Z = 2 To UBound(sn)
s = .Exec("cmd /c findstr /m/s SD/#" & Format(sn(Z, 1), "0000") & "/ " & _
"""" & p & "*.doc" & """").StdOut.ReadAll
If Len(s) <> 0 Then
s = Left(s, Len(s) - 2) 'Trim trailing vbCrLF
a = Split(s, vbCrLf)
For i = 0 To UBound(a)
j = j + 1
b(j, 1) = fso.GetFile(a(i)).Name
b(j, 2) = sn(Z, 1)
fn = fso.GetParentFolderName(a(i))
If Len(fn) > Len(p) Then b(j, 3) = Right(fn, Len(fn) - Len(p))
Next i
End If
Next Z
End With

If j = 0 Then GoTo EndSub

b = Application.Index(b, Evaluate("row(1:" & j & ")"), Application.Transpose([row(1:3)]))
ws.[B2].Resize(j, 3).Value = b
ws.UsedRange.Columns.AutoFit

EndSub:
Set fso = Nothing
Application.DisplayAlerts = True
Application.EnableCancelKey = xlInterrupt
Debug.Print Timer - d
End Sub

snb
12-10-2017, 02:58 PM
thank you, I have tried but nothing happens with it. My excel version is office 2007

Did you ever program something in VBA ?

sibjac
12-11-2017, 07:38 AM
[/QUOTE]

sorry for being late to reply. I have tried the the macro in folder MainFolder, but the macro did not print anything, just a blank excel.

sibjac
12-13-2017, 07:54 AM
Hats off to Hobs and snb, it worked when I reinstalled my office, solved a big headache. Thanks a lot.