Consulting

Results 1 to 10 of 10

Thread: VBA Automated extraction of tables from word file and table processing

  1. #1

    VBA Automated extraction of tables from word file and table processing

    Hey everyone,

    I'm a beginner with VBC and a bit stuck on coding the following complicated problem. Here is what I would like to do:
    - I have a list of .rtf files in a folder (see example file: test.docx). These files contain multiple tables with patient medication. Each table has a header which is one paragraph/line above the table. The headers are not formatted as such, but as plain test.
    - I'd like to extract all tables from all documents in my folder and place them in an excel sheet.
    - I'd then like to add two new columns to the left of each table (one column labelled with name_document, one column labelled with header_table).
    - Into each cell of the name_document column, I'd like to add the name of the document I extracted the table from.
    - Into each cell of the header_table column, I'd like to add the name of the table, as defined by the header one paragraph/line above the table.

    I found the following code to extract tables from a list of documents.

    Sub ExtractTablesFromMultiDocs()
      Dim objTable As Table
      Dim objDoc As Document, objNewDoc As Document
      Dim objRange As Range
      Dim strFile As String, strFolder As String
    '  Initialization
      strFolder = InputBox("Enter folder address here: ")
      strFile = Dir(strFolder & "" & "*.rtf", vbNormal)
    Set objNewDoc = Documents.Add
    '  Process each file in the folder.
      While strFile <> ""
        Set objDoc = Documents.Open(FileName:=strFolder & "" & strFile)
        Set objDoc = ActiveDocument
    For Each objTable In objDoc.Tables
          objTable.Range.Select
          Selection.Copy
    Set objRange = objNewDoc.Range
          objRange.Collapse Direction:=wdCollapseEnd
          objRange.PasteSpecial DataType:=wdPasteRTF
          objRange.Collapse Direction:=wdCollapseEnd
          objRange.Text = vbCr
        Next objTable
    objDoc.Save
        objDoc.Close
        strFile = Dir()
      Wend
    End Sub
    Any suggestions how to amend it for my purpose? Help is much appreciated!!!

    Best,
    Thomas
    Last edited by Aussiebear; 04-16-2022 at 07:44 PM. Reason: Added code tags to supplied code

  2. #2
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    The following Excel macro should do as you want:
    Sub GetTableData()
    'Note: this code requires a reference to the Word object model.
    'See under the VBE's Tools|References.
    Application.ScreenUpdating = False
    Dim wdApp As New Word.Application, wdDoc As Word.Document, wdTbl As Word.Table
    Dim strFolder As String, strFile As String, WkBk As Workbook, WkSht As Worksheet
    Dim StrTbl As String, r As Long, i As Long, j As Long
    strFolder = GetFolder: If strFolder = "" Then GoTo ErrExit
    Set WkBk = ActiveWorkbook
    'Disable any auto macros in the documents being processed
    wdApp.WordBasic.DisableAutoMacros
    strFile = Dir(strFolder & "\*.doc", vbNormal)
    While strFile <> ""
      Set wdDoc = wdApp.Documents.Open(FileName:=strFolder & "\" & strFile, AddToRecentFiles:=False, Visible:=False)
      Set WkSht = WkBk.Sheets.Add: r = 1
      WkSht.Name = Split(strFile, ".doc")(0)
      With wdDoc
        For Each wdTbl In .Tables
          With wdTbl.Range.Find
            .ClearFormatting
            .Replacement.ClearFormatting
            .Text = "[^13^l]"
            .Replacement.Text = "¶"
            .Forward = True
            .Wrap = wdFindStop
            .Format = False
            .MatchWildcards = True
            .Execute Replace:=wdReplaceAll
          End With
          If r > 1 Then r = r + 2: i = r
          wdTbl.Range.Copy
          WkSht.Paste Destination:=WkSht.Range("C" & r)
          StrTbl = wdTbl.Range.Paragraphs.First.Previous.Range.Text
          r = WkSht.Cells(WkSht.Rows.Count, 1).End(xlUp).Row
          For j = i To r
            WkSht.Range("A" & r).Value = Split(strFile, ".")(0)
            WkSht.Range("B" & r).Value = StrTbl
          Next
        Next
        WkSht.UsedRange.Replace What:="¶", Replacement:=Chr(10), LookAt:=xlPart, SearchOrder:=xlByRows
        .Close SaveChanges:=False
      End With
      strFile = Dir()
    Wend
    ErrExit:
    wdApp.Quit
    Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing: Set WkBk = Nothing
    Application.ScreenUpdating = True
    End Sub
     
    Function GetFolder() As String
        Dim oFolder As Object
        GetFolder = ""
        Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
        If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
        Set oFolder = Nothing
    End Function
    Simply select the folder to process. The macro creates a new worksheet for each document and outputs all tables from that document one below the other, with an empty row in between. Except for text wrapping, table formatting is preserved as much as possible.

    If you don't want a new worksheet for each document, delete:
      Set WkSht = WkBk.Sheets.Add: r = 1
      WkSht.Name = Split(strFile, ".doc")(0)

    change:
    Set WkBk = ActiveWorkbook

    to:
    Set WkBk = ActiveWorkbook: Set WkSht = WkBk.ActiveSheet

    and
    change:
    Dim StrTbl As String, r As Long, i As Long, j As Long

    to:
    Dim StrTbl As String, r As Long, i As Long, j As Long: r = 1

    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  3. #3
    VBAX Regular DaveM's Avatar
    Joined
    Mar 2022
    Posts
    10
    Location
    Paul, nice use of the character class ([^13^l]) in the .Find function to replace all LineFeeds and FormFeeds with a paragraph mark (pilcrow -
    ¶).... which you later replace in the Excel range with a Carriage Return (char(10)). Do LineFeeds and FormFeeds both mark the end of a table?

    I also like your use of the command concatenation operator ( : ) to string together similar commands on one line as in:

    Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing: Set WkBk = Nothing

    And lastly, the GetFolder() function is the cats-meow of efficiency.

    I do have a question about your use of the BinaryCompare argument to the Split() function. Why not the TextCompare argument?

    Nicely done, as is all the VBA code of yours that I have viewed in my short time on VBA Express. Thanks for contributing so generously
    to those of us toiling in the VBA arena.

    Dave

  4. #4
    Hi Dave,
    that's great! Thanks a lot for your help on this!

    Cheers,
    Thomas

  5. #5
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Quote Originally Posted by DaveM View Post
    Paul, nice use of the character class ([^13^l]) in the .Find function to replace all LineFeeds and FormFeeds with a paragraph mark (pilcrow -
    ¶).... which you later replace in the Excel range with a Carriage Return (char(10)). Do LineFeeds and FormFeeds both mark the end of a table?
    This process concerns Word table cells that contain paragraph breaks and/or line breaks. If those are not removed before copying the table to Excel, the output goes to separate rows in Excel. The subsequent replacement of the pilcrow in Excel is to restore those formats as line breaks within the cells concerned.

    Word table cell and row ends are designated by the Chr(13) and Chr(7) pair, which appear as '¤'symbols.
    Quote Originally Posted by DaveM View Post
    I do have a question about your use of the BinaryCompare argument to the Split() function. Why not the TextCompare argument?
    Since Word files are ordinarily saved with lower-case extensions (.doc, .docx, .docm), I saw no need to add the the TextCompare argument.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  6. #6
    VBAX Regular DaveM's Avatar
    Joined
    Mar 2022
    Posts
    10
    Location
    Quote Originally Posted by macropod View Post
    Word table cell and row ends are designated by the Chr(13) and Chr(7) pair, which appear as '¤'symbols.
    Da-ding! Your code indicated LineFeed and Formfeed chars, but your reply indicated a Linefeed and Bell character. Is that right.... that a Chr(7) is part of the end-of-row character pair? (I hope I'm not muddying the water here).

  7. #7
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Quote Originally Posted by DaveM View Post
    Da-ding! Your code indicated LineFeed and Formfeed chars, but your reply indicated a Linefeed and Bell character.
    As I said,
    This process concerns Word table cells that contain paragraph breaks and/or line breaks. If those are not removed before copying the table to Excel, the output goes to separate rows in Excel.

    The F/R removes paragraph breaks and manual line breaks that might occur within a cell - it has nothing to do with Word's end-of-cell or end-of-row markers.
    Quote Originally Posted by DaveM View Post
    but your reply indicated a Linefeed and Bell character. Is that right.... that a Chr(7) is part of the end-of-row character pair?
    That is correct.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  8. #8
    VBAX Regular DaveM's Avatar
    Joined
    Mar 2022
    Posts
    10
    Location
    OK, Paul, so I think I'm square with your code:

    ..... use of the character class ([^13^l]) in the .Find function to replace all LineFeeds (^13) and FormFeeds/Page Breaks (^L) with a paragraph mark (pilcrow -
    ¶)

    Separate from that F/R, you indicate that the end-of-row in a table is a pair of characters (Chr(13)+Chr(7)), represented by '¤'. (Finally know what this symbol is made of...... and I have been using Word since version 2.0...... go figure ..... This will not look good on my résumé)).

    No need to reply if that is correct. Thanks for taking the time to get me straightened out.

    Dave

  9. #9
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Quote Originally Posted by DaveM View Post
    OK, Paul, so I think I'm square with your code:

    ..... use of the character class ([^13^l]) in the .Find function to replace all LineFeeds (^13) and FormFeeds/Page Breaks (^L) with a paragraph mark (pilcrow -
    ¶)
    No, page breaks are not replaced (and it's ^l not ^L). Moreover, it's ^13 that finds paragraph breaks and ^l that finds manual line breaks (aka formfeeds).

    And the pilcrow is just a pilcrow - not a paragraph break (though Word uses pilcrow symbols to display paragraph breaks - just as it uses '¤'symbols to display end-of-cell and end-of-row markers).
    Quote Originally Posted by DaveM View Post
    the end-of-row in a table is a pair of characters (Chr(13)+Chr(7)), represented by '¤'.
    The same '¤'symbol is used to display both end-of-cell markers and end-of-row markers. End-of-row markers are typically hidden.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  10. #10
    VBAX Regular DaveM's Avatar
    Joined
    Mar 2022
    Posts
    10
    Location
    Sorry to have muddied the water again; should not have mentioned Page Breaks. Outside of the Microsoft realm ^l represents both FormFeeds and PageBreaks. Within Microsoft ^l is the line break that one achieves by pressing Shift-<Enter>. And ^13 is what one gets when pressing <Enter>.

    By referencing ^L, I meant to remove any ambiguity as to what character ^l was referring to (l not 1 or I). I realize that Microsoft does not consider ^L the same as ^l.

    I also see that your use of the pilcrow symbol in the replacement text of your F/R command is merely a placeholder for later replacement (and not the actual Microsoft paragraph mark).

    With that, I guess I will recede into the lurkdom.

    Dave

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •