PDA

View Full Version : [SOLVED:] Import Words and Phrases from MS Word to Excel



yarixen
05-12-2020, 06:38 AM
Hi All,

I am someone who can only copy and modify a VBA but can not build it from scratch. I have more than 3000 MSWord docs stored different sub-folders. All of them are in same format.

I have been trying to look for a VBA that would extract data from MsWord to Excel using the keywords before and after the target data - Something to what this youtube video is showing https://www.youtube.com/watch?v=1x-Vk4Qmpz0&t= only that I don't have it in a Word table.

I realize that there may be a lot of similar threads like this but nothing seems to work for my case since my documents are not in a table format.:crying:

I was hoping someone might be able to walk me through this one. I have attached a sample doc and the target excel file. The data that needed to be exported are highlighted in yellow.

Thank you and stay safe everyone!

2664026641

macropod
05-12-2020, 03:39 PM
Your sample document appears to have had some content deleted - it even refers to an 'attached two pages' - and has also had numerous strings of nothing more than spaces inserted. The document cannot be processed without knowing what other content it has that might affect the data extraction.

yarixen
05-13-2020, 12:51 AM
Thanks Paul, for looking into this.:help I apologize that I removed most part of the document in my poor attempt to keep from being public. There are 3 (or sometimes 4 pages) in each document. The string of spaces are consistently present in the documents - Please see if it can be ignored. I was wondering if there was a way text can be extracted based on words prior and subsequent to it. Here I attached a new xls and doc.

macropod
05-13-2020, 02:33 AM
Try the following Excel macro. Provided your documents all have a layout consistent with the one you attached, the data should be extracted from all documents in the folder you select.

Sub GetPermitData()
'Note: this code requires a reference to the Word object model. See under the VBE's Tools|References.
Application.ScreenUpdating = False
Dim strFolder As String, strFile As String, strData As String
Dim WkSht As Worksheet, r As Long, c As Long
strFolder = GetFolder: If strFolder = "" Then Exit Sub
Dim wdApp As New Word.Application, wdDoc As Word.Document
Set WkSht = ActiveSheet
r = WkSht.Cells(WkSht.Rows.Count, 1).End(xlUp).Row
'Disable any auto macros in the documents being processed
wdApp.WordBasic.DisableAutoMacros
strFile = Dir(strFolder & "\*.doc", vbNormal)
While strFile <> ""
r = r + 1
Set wdDoc = wdApp.Documents.Open(Filename:=strFolder & "\" & strFile, AddToRecentFiles:=False, Visible:=False)
On Error GoTo ErrExit
With wdDoc
Do While .InlineShapes.Count > 0
.InlineShapes(1).Delete
Loop
Do While .Shapes.Count > 0
.Shapes(1).Delete
Loop
With .Range
.Paragraphs.First.Range.Text = vbCr
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Forward = True
.Format = False
.Wrap = wdFindContinue
.MatchWildcards = True
.Text = "^13*Permit No.:(*)[ ]@Date Issued:(*)^13*issued to[^13]{1,}(*)^13*Located at (*)^13*:[ ^13]{1,}"
.Replacement.Text = "^t\1^t\2^t\3^t\4^t"
.Execute Replace:=wdReplaceAll
.Text = "[^13 ]@[!^13]@valid until[ ]@<(*).^13*\(PCO\)(*) shall be*Regional Director*[ ^13]{1,}"
.Replacement.Text = "^t\1^t\2"
.Execute Replace:=wdReplaceAll
.Text = "^13[^13 ]{1,}"
.Replacement.Text = Chr(182)
.Execute Replace:=wdReplaceAll
.Text = "([^t" & Chr(182) & "])[ ]{1,}"
.Replacement.Text = "\1"
.Execute Replace:=wdReplaceAll
.Text = "[ ]{1,}([^t" & Chr(182) & "])"
.Replacement.Text = "\1"
.Execute Replace:=wdReplaceAll
.Text = "(^t*)(^t*)(^t*)(^t*)(^t*)(^t*)(^t*)"
.Replacement.Text = "\3\4\5\1\2\6\7"
.Execute Replace:=wdReplaceAll
End With
End With
strData = Split(.Range.Text, vbCr)(0)
.Close SaveChanges:=False
End With
For c = 1 To UBound(Split(strData, vbTab))
WkSht.Cells(r, c).Value = Split(strData, vbTab)(c)
Next
strFile = Dir()
Wend
GoTo NoErr
ErrExit:

MsgBox "Cannot process:" & vbcr & strFile & vbcr & "Exiting", vbCritical
NoErr:
wdApp.Quit
WkSht.UsedRange.Replace What:=Chr(182), Replacement:=Chr(10), LookAt:=xlPart, SearchOrder:=xlByRows
Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing
Application.ScreenUpdating = True
End Sub
'
Function GetFolder() As String
Dim oFolder As Object
GetFolder = ""
Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
Set oFolder = Nothing
End Function

yarixen
05-13-2020, 11:43 AM
Thank you Sir. I will get to check it on actual docs when I get back to office. Please let me get back to you if I need help again. I really appreciate it. Stay safe.

yarixen
05-14-2020, 09:50 AM
Hi Paul, can you help me check this error:
26677

pressing Debug goes here:


.Paragraphs.First.Range.Text = vbCr

macropod
05-14-2020, 03:35 PM
Don't know why you'd be getting that error; there's nothing wrong with the code line. Try restarting Windows.

yarixen
05-15-2020, 06:29 AM
Thanks for the reply. Did the restart but still encountering it in a lot of folders. I checked what seems to be different with these folders - My suspicion is, some of the folders contains older docs that have this textbox/autoshape in their header like this :

26684

When it encounters one of these old files (with textbox ) it stops checking the rest of the other docs and gives the error earlier posted.

Do you think this is stopping the run? Is there a way to ignore these textboxes? I am attaching another doc sample with this text box. Kindly see.

Thank you for all the help.

macropod
05-15-2020, 03:45 PM
I have revised the code in post #4 to eliminate such objects.

yarixen
05-16-2020, 12:59 AM
Hi Paul, thank you. Your revision works. But I still get errors and it stops running - my documents are the problem. I think there are other files with variations that is hard for me to check one by one as of the moment. Do you think it is better doing extraction from a .txt notepad? I can batch convert all of them to .txt to keep the text only.

macropod
05-16-2020, 01:13 AM
I very much doubt changing the file format would make any difference. The crucial thing is the file content, not its format. If you have non-conforming documents - as seems to be the case - you can either:
• move those documents out of the folder being processed; or
• correct those documents.
I have further edited the code in post #4 to report the file that causes the macro to stop.

yarixen
05-24-2020, 11:53 PM
Hi Paul, I just want to say thanks. Some had to be done manually, but its done now.

Thanks to your code it saved me a lot of time. I just wish I knew how to edit those things (wildcards, special characters, ascii?, I don't know what they are called:rotlaugh:) . If you can point me to a link where I can learn more about "these", I'd appreciate it.

As always, stay safe.:thumb

macropod
05-25-2020, 12:02 AM
They're wildcard expressions. See: https://wordmvp.com/FAQs/General/UsingWildcards.htm

yarixen
05-25-2020, 03:34 AM
I really appreciate this!