View Full Version : [SOLVED:] VBA Automated extraction of tables from word file and table processing
thomas_w
04-16-2022, 11:29 AM
Hey everyone,
I'm a beginner with VBC and a bit stuck on coding the following complicated problem. Here is what I would like to do:
- I have a list of .rtf files in a folder (see example file: 29647). These files contain multiple tables with patient medication. Each table has a header which is one paragraph/line above the table. The headers are not formatted as such, but as plain test.
- I'd like to extract all tables from all documents in my folder and place them in an excel sheet.
- I'd then like to add two new columns to the left of each table (one column labelled with name_document, one column labelled with header_table).
- Into each cell of the name_document column, I'd like to add the name of the document I extracted the table from.
- Into each cell of the header_table column, I'd like to add the name of the table, as defined by the header one paragraph/line above the table.
I found the following code to extract tables from a list of documents.
Sub ExtractTablesFromMultiDocs()
Dim objTable As Table
Dim objDoc As Document, objNewDoc As Document
Dim objRange As Range
Dim strFile As String, strFolder As String
' Initialization
strFolder = InputBox("Enter folder address here: ")
strFile = Dir(strFolder & "" & "*.rtf", vbNormal)
Set objNewDoc = Documents.Add
' Process each file in the folder.
While strFile <> ""
Set objDoc = Documents.Open(FileName:=strFolder & "" & strFile)
Set objDoc = ActiveDocument
For Each objTable In objDoc.Tables
objTable.Range.Select
Selection.Copy
Set objRange = objNewDoc.Range
objRange.Collapse Direction:=wdCollapseEnd
objRange.PasteSpecial DataType:=wdPasteRTF
objRange.Collapse Direction:=wdCollapseEnd
objRange.Text = vbCr
Next objTable
objDoc.Save
objDoc.Close
strFile = Dir()
Wend
End Sub
Any suggestions how to amend it for my purpose? Help is much appreciated!!!
Best,
Thomas
macropod
04-16-2022, 01:25 PM
The following Excel macro should do as you want:
Sub GetTableData()
'Note: this code requires a reference to the Word object model.
'See under the VBE's Tools|References.
Application.ScreenUpdating = False
Dim wdApp As New Word.Application, wdDoc As Word.Document, wdTbl As Word.Table
Dim strFolder As String, strFile As String, WkBk As Workbook, WkSht As Worksheet
Dim StrTbl As String, r As Long, i As Long, j As Long
strFolder = GetFolder: If strFolder = "" Then GoTo ErrExit
Set WkBk = ActiveWorkbook
'Disable any auto macros in the documents being processed
wdApp.WordBasic.DisableAutoMacros
strFile = Dir(strFolder & "\*.doc", vbNormal)
While strFile <> ""
Set wdDoc = wdApp.Documents.Open(FileName:=strFolder & "\" & strFile, AddToRecentFiles:=False, Visible:=False)
Set WkSht = WkBk.Sheets.Add: r = 1
WkSht.Name = Split(strFile, ".doc")(0)
With wdDoc
For Each wdTbl In .Tables
With wdTbl.Range.Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = "[^13^l]"
.Replacement.Text = "¶"
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchWildcards = True
.Execute Replace:=wdReplaceAll
End With
If r > 1 Then r = r + 2: i = r
wdTbl.Range.Copy
WkSht.Paste Destination:=WkSht.Range("C" & r)
StrTbl = wdTbl.Range.Paragraphs.First.Previous.Range.Text
r = WkSht.Cells(WkSht.Rows.Count, 1).End(xlUp).Row
For j = i To r
WkSht.Range("A" & r).Value = Split(strFile, ".")(0)
WkSht.Range("B" & r).Value = StrTbl
Next
Next
WkSht.UsedRange.Replace What:="¶", Replacement:=Chr(10), LookAt:=xlPart, SearchOrder:=xlByRows
.Close SaveChanges:=False
End With
strFile = Dir()
Wend
ErrExit:
wdApp.Quit
Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing: Set WkBk = Nothing
Application.ScreenUpdating = True
End Sub
Function GetFolder() As String
Dim oFolder As Object
GetFolder = ""
Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
Set oFolder = Nothing
End Function
Simply select the folder to process. The macro creates a new worksheet for each document and outputs all tables from that document one below the other, with an empty row in between. Except for text wrapping, table formatting is preserved as much as possible.
If you don't want a new worksheet for each document, delete:
Set WkSht = WkBk.Sheets.Add: r = 1
WkSht.Name = Split(strFile, ".doc")(0)
change:
Set WkBk = ActiveWorkbook
to:
Set WkBk = ActiveWorkbook: Set WkSht = WkBk.ActiveSheet
and change:
Dim StrTbl As String, r As Long, i As Long, j As Long
to:
Dim StrTbl As String, r As Long, i As Long, j As Long: r = 1
DaveM
04-17-2022, 10:57 AM
Paul, nice use of the character class ([^13^l]) in the .Find function to replace all LineFeeds and FormFeeds with a paragraph mark (pilcrow -
¶).... which you later replace in the Excel range with a Carriage Return (char(10)). Do LineFeeds and FormFeeds both mark the end of a table?
I also like your use of the command concatenation operator ( : ) to string together similar commands on one line as in:
Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing: Set WkBk = Nothing
And lastly, the GetFolder() function is the cats-meow of efficiency.
I do have a question about your use of the BinaryCompare argument to the Split() function. Why not the TextCompare argument?
Nicely done, as is all the VBA code of yours that I have viewed in my short time on VBA Express. Thanks for contributing so generously
to those of us toiling in the VBA arena. :clap:
Dave
thomas_w
04-17-2022, 12:49 PM
Hi Dave,
that's great! Thanks a lot for your help on this!
Cheers,
Thomas
macropod
04-17-2022, 02:11 PM
Paul, nice use of the character class ([^13^l]) in the .Find function to replace all LineFeeds and FormFeeds with a paragraph mark (pilcrow -
¶).... which you later replace in the Excel range with a Carriage Return (char(10)). Do LineFeeds and FormFeeds both mark the end of a table?
This process concerns Word table cells that contain paragraph breaks and/or line breaks. If those are not removed before copying the table to Excel, the output goes to separate rows in Excel. The subsequent replacement of the pilcrow in Excel is to restore those formats as line breaks within the cells concerned.
Word table cell and row ends are designated by the Chr(13) and Chr(7) pair, which appear as '¤'symbols.
I do have a question about your use of the BinaryCompare argument to the Split() function. Why not the TextCompare argument?
Since Word files are ordinarily saved with lower-case extensions (.doc, .docx, .docm), I saw no need to add the the TextCompare argument.
DaveM
04-17-2022, 03:00 PM
Word table cell and row ends are designated by the Chr(13) and Chr(7) pair, which appear as '¤'symbols.
Da-ding! Your code indicated LineFeed and Formfeed chars, but your reply indicated a Linefeed and Bell character. Is that right.... that a Chr(7) is part of the end-of-row character pair? (I hope I'm not muddying the water here).
macropod
04-17-2022, 03:12 PM
Da-ding! Your code indicated LineFeed and Formfeed chars, but your reply indicated a Linefeed and Bell character.
As I said,
This process concerns Word table cells that contain paragraph breaks and/or line breaks. If those are not removed before copying the table to Excel, the output goes to separate rows in Excel.
The F/R removes paragraph breaks and manual line breaks that might occur within a cell - it has nothing to do with Word's end-of-cell or end-of-row markers.
but your reply indicated a Linefeed and Bell character. Is that right.... that a Chr(7) is part of the end-of-row character pair?
That is correct.
DaveM
04-17-2022, 03:44 PM
OK, Paul, so I think I'm square with your code:
..... use of the character class ([^13^l]) in the .Find function to replace all LineFeeds (^13) and FormFeeds/Page Breaks (^L) with a paragraph mark (pilcrow -
¶)
Separate from that F/R, you indicate that the end-of-row in a table is a pair of characters (Chr(13)+Chr(7)), represented by '¤'. (Finally know what this symbol is made of...... and I have been using Word since version 2.0...... go figure :doh:..... This will not look good on my résumé)).
No need to reply if that is correct. Thanks for taking the time to get me straightened out.
Dave
macropod
04-17-2022, 04:12 PM
OK, Paul, so I think I'm square with your code:
..... use of the character class ([^13^l]) in the .Find function to replace all LineFeeds (^13) and FormFeeds/Page Breaks (^L) with a paragraph mark (pilcrow -
¶)
No, page breaks are not replaced (and it's ^l not ^L). Moreover, it's ^13 that finds paragraph breaks and ^l that finds manual line breaks (aka formfeeds).
And the pilcrow is just a pilcrow - not a paragraph break (though Word uses pilcrow symbols to display paragraph breaks - just as it uses '¤'symbols to display end-of-cell and end-of-row markers).
the end-of-row in a table is a pair of characters (Chr(13)+Chr(7)), represented by '¤'.
The same '¤'symbol is used to display both end-of-cell markers and end-of-row markers. End-of-row markers are typically hidden.
DaveM
04-18-2022, 06:34 AM
Sorry to have muddied the water again; should not have mentioned Page Breaks. Outside of the Microsoft realm ^l represents both FormFeeds and PageBreaks. Within Microsoft ^l is the line break that one achieves by pressing Shift-<Enter>. And ^13 is what one gets when pressing <Enter>.
By referencing ^L, I meant to remove any ambiguity as to what character ^l was referring to (l not 1 or I). I realize that Microsoft does not consider ^L the same as ^l.
I also see that your use of the pilcrow symbol in the replacement text of your F/R command is merely a placeholder for later replacement (and not the actual Microsoft paragraph mark).
With that, I guess I will recede into the lurkdom.
Dave
Powered by vBulletin® Version 4.2.5 Copyright © 2025 vBulletin Solutions Inc. All rights reserved.