Consulting

Page 1 of 2 1 2 LastLast
Results 1 to 20 of 30

Thread: Complex macro has very poor performance

  1. #1
    VBAX Regular
    Joined
    May 2012
    Posts
    30
    Location

    Complex macro has very poor performance

    What I have:
    I have a macro in a word document that looks for statements that say "shall" or "will". When it finds it, it writes the statement and the section header to an excel file.

    The problem:
    These are crappy word files exported from PDF files, so they don't have "real" section headers. As a result, I'm forced to identify headers using fairly complex logic. The end result is that my macro runs slowly--far too slowly. On long proposals (100+ pages) it can take over an hour.

    I need help with:
    I could use some help identifying bottlenecks in my code. The attached file includes part of a proposal I'm using to gauge performance. I'm getting about 2 minutes right now, but ideally this file should process in about 30 seconds.

    The Code:
    Here's the code in all its twisted glory.

    [VBA]Dim sKWArray() As String

    Function FileThere(FileName As String) As Boolean
    FileThere = (Dir(FileName) > "")
    End Function

    'Queries the user for a file name where we'll save our information
    Function GetXLSFileName() As String
    'open excel document
    Dim fd As FileDialog
    Set fd = Application.FileDialog(msoFileDialogFilePicker)
    Dim vSelFile As Variant
    With fd
    .AllowMultiSelect = False
    .Filters.Add "Excel 2003", "*.xls", 1
    .Filters.Add "Excel 2007 & 2010", "*.xlsx", 1
    .Title = "Choose your destination spreadsheet file."

    'Show the dialog box
    If .Show = -1 Then
    'Set the file string to the selected file
    GetXLSFileName = .SelectedItems(1)
    'If the user presses Cancel...
    Else
    MsgBox "No file selected. Macro will use 'C:\ShallStatements.xls' as default."
    GetXLSFileName = "C:\ShallStatements.xls"
    End If
    End With
    End Function

    'Interface function that gathers keywords to search for in the document.
    Function SetUpKeywords() As Boolean
    ReDim sKWArray(1 To 2) As String
    sKWArray(1) = "shall"
    sKWArray(2) = "must" '<< Section simplified in this example as it was not causing a performance issue
    SetUpKeywords = True
    End Function

    'Finds the first searched keyword in a line and returns the position
    Function HasKeywords(line As String) As Integer
    HasKeywords = 0
    Dim x As Integer
    For x = 1 To UBound(sKWArray)
    HasKeywords = IIf(Not IsNull(InStr(1, line, sKWArray(x), 1)), InStr(1, line, sKWArray(x), 1), 0)

    'Exit the loop if we find something
    If HasKeywords <> 0 Then Exit For
    Next x
    End Function

    'See if we have a heading-style number. Returns the heading, or blank if not a heading.
    Private Function HeaderOutlineTitle(line As String) As String
    'Set up variables
    Dim CharCounter As Integer
    Dim NumberCounter As Integer
    Dim tCh As String

    CharCounter = 0
    NumberCounter = 0

    'For the 4 character headings, this skips all the unncessary logic
    If Len(line) >= 4 And IsDigits(Left(line, 4)) Then
    HeaderOutlineTitle = Left(line, 4)
    GoTo SkipToEnd
    End If

    'Start looping through the line character by character
    For y = 1 To Len(line)
    tCh = Mid(line, y, 1)

    'This allows the letter "H" to be the first character
    If y = 1 And tCh = "H" Then GoTo SkipNumberCheck

    'If it's blank or a number or a period, exit
    If tCh = " " Or (Not IsDigit(tCh) And tCh <> ".") Then Exit For

    'If this is a number, keep track, otherwise reset the counter
    NumberCounter = IIf(IsNumeric(tCh), NumberCounter + 1, 0)

    'If we've had more than 2 numbers in a row, we're probably not seeing a heading
    If NumberCounter > 2 Then
    CharCounter = 0
    Exit For
    End If

    SkipNumberCheck: 'We can also skip all the other logic since it's the first character
    CharCounter = CharCounter + 1
    Next y

    'If this was a outline heading, list that information in column 4
    '(AND As long as the last character before the space was a number)
    If CharCounter <> 0 Then
    If IsNumeric(Mid(line, CharCounter, 1)) Then
    HeaderOutlineTitle = Left(line, CharCounter)
    Else
    HeaderOutlineTitle = Left(line, CharCounter - 1)
    End If
    Else
    HeaderOutlineTitle = ""
    End If

    SkipToEnd:
    End Function

    Function IsDigit(char As String) As Boolean
    'Uses [Asc] function to get an ASCII code of a character.
    Dim iAsciiCode As Integer
    iAsciiCode = Asc(char)

    'The ASCII codes for the digits range from 48 to 57. If an ASCII
    'code of the character is within this range, function returns true
    IsDigit = iAsciiCode >= 48 And iAsciiCode <= 57
    End Function

    Function IsDigits(text As String) As Boolean
    'If the text is 0 length, it's not a digit-only string
    If Len(text) = 0 Then
    IsDigits = False
    Else
    Dim x As Integer
    'the default is true. We'll set to false if...
    IsDigits = True
    For x = 1 To Len(text)
    '...we find something that's not a digit
    If Not IsDigit(Mid(text, x, 1)) Then
    IsDigits = False
    Exit For
    End If
    Next x
    End If
    End Function

    'Main function that extracts requirements from the document
    Sub ExtractRequirements()
    Dim x As Integer, y As Integer, tCh As Integer
    Dim intRowCount As Integer 'The row excel will import data into. Used as counter & start value.
    intRowCount = 10 'Data starts on row 10. The first 9 rows are reserved for performance metrics and the headers.

    'Create arrays to hold data
    Dim sHeaders() As String
    Dim sStatements() As String
    ReDim sHeaders(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
    ReDim sStatements(1 To 1000) As String 'Supports maximum of 1000 requirements per document.

    'Record Start Time
    sHeaders(1) = "Start Time"
    sStatements(1) = DateTime.Now

    'Write headers
    sHeaders(intRowCount - 1) = "Header"
    sStatements(intRowCount - 1) = "Requirement"

    'Figure out how many lines are in the document
    Dim NumOfLines As Integer
    NumOfLines = ActiveDocument.ComputeStatistics(wdStatisticLines, False)

    'Record time for setup to be completed
    sHeaders(2) = "Lines Computed"
    sStatements(2) = DateTime.Now

    'Set up requirement words (exit sub on failure)
    If Not SetUpKeywords Then Exit Sub

    'Turn off screen updating to improve performance
    If MsgBox("Would you like to turn off screen updating to improve performance? " & _
    "Please note that this may cause Word to appear unresponsive for quite " & _
    "some time on large files.", vbYesNo, "Improve Performance") = vbYes Then
    Application.ScreenUpdating = False
    End If

    'Record time for information to be collected
    sHeaders(3) = "Params Collected"
    sStatements(3) = DateTime.Now

    'Loop through every line and look for headers
    Dim tmpHeaderRslt As String, LineText As String, SentenceText As String, header As String, ShallPos As Integer
    header = "None" 'This is the header value used before we've found a legitimate header
    For x = 1 To NumOfLines
    'Go to the correct line
    Selection.GoTo wdGoToLine, wdGoToAbsolute, x

    'Expand to select the entire line, but skip the logic if it's blank
    Selection.Expand wdLine
    LineText = Trim(Selection.text)
    If LineText <> "" Then
    'see if it's a header, and if so, set the header
    tmpHeaderRslt = HeaderOutlineTitle(LineText)
    If tmpHeaderRslt <> "" Then header = tmpHeaderRslt

    'See if the line has the word "shall" or "must" in it
    ShallPos = HasKeywords(LineText)

    'If so, write it to the spreadsheet
    If ShallPos > 0 Then
    'Move to the correct character and expand to the full sentence
    Selection.Collapse wdCollapseStart
    Selection.Move wdCharacter, ShallPos
    Selection.Expand wdSentence
    SentenceText = Selection.text

    'Write the header to the spreadsheet
    sHeaders(intRowCount) = header
    sStatements(intRowCount) = SentenceText

    'Move the selection forward to the proper line (avoids replication)
    Selection.Collapse wdCollapseEnd
    If (x + 1) < Selection.Information(wdFirstCharacterLineNumber) Then x = Selection.Information(wdFirstCharacterLineNumber) - 1

    'Incremement the row counter
    intRowCount = intRowCount + 1
    End If
    End If
    Next x

    'Record time for statements to be collected
    sHeaders(4) = "Statements Collected"
    sStatements(4) = DateTime.Now

    'Open excel file and set up for receiving data
    Dim xlsFileString As String
    xlsFileString = GetXLSFileName
    Dim appExcel As Object
    Dim objSheet As Object
    Set appExcel = CreateObject("Excel.Application")
    Set objSheet = appExcel.workbooks.Open(xlsFileString).Sheets("Sheet1")

    'Write data to the spreadsheet
    sHeaders(5) = "Excel write begin"
    sStatements(5) = DateTime.Now
    For x = 1 To intRowCount
    objSheet.Cells(x, 1) = sHeaders(x)
    objSheet.Cells(x, 2) = sStatements(x)
    Next x

    'Write end time
    objSheet.Cells(6, 1) = "Excel write ends"
    objSheet.Cells(6, 2) = DateTime.Now

    'Clean up and open for viewing.
    appExcel.workbooks(1).Close True
    appExcel.Quit
    Set objSheet = Nothing
    Set appExcel = Nothing

    'Open the new spreadsheet for viewing
    If FileThere("C:\Program Files\Microsoft Office\Office12\excel.exe ") Then
    Shell ("C:\Program Files\Microsoft Office\Office12\excel.exe """ & xlsFileString & """")
    ElseIf FileThere("C:\Program Files\Microsoft Office\Office14\excel.exe ") Then
    Shell ("C:\Program Files\Microsoft Office\Office14\excel.exe """ & xlsFileString & """")
    End If
    End Sub
    [/VBA]

    I appreciate any help you can give. Thank you!

    Dan
    Attached Files Attached Files

  2. #2
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    Yep, you've got a bad set up. You need to stop using string functions (like InStr) entirely, and substitute by using the Find object.

    You could probably get some increase in performance by compiling this code rather than leaving it in an unmanaged addin (i.e., make it into a COM addin, a DLL, etc vs. leaving the code in a .dotm)... but the main problem is use of string functions the way you're doing it.

    My cursory glance at the structure of code isn't that bad (except for your use of the selection object-- you could probably get some value out of using a range object, although ranges and lines become tough).

    But the main problem is that you haven't attempted to use the Find object. If I can sum up with two more concise bits of code...

    Your strategy
    [vba]
    'This routine displays how many instances of a string exist in the document
    'from the beginning of the selection to the end of the document
    Sub DemoStringAnalysisBad()
    Dim rngWhere As Range
    Dim rngSearch As Range
    Dim oPara As Paragraph
    Dim sStringToFind As String
    Dim iHowManyFound As Integer
    Dim iInStrResult As Integer

    sStringToFind = "Some Text"
    Set rngWhere = Selection.Range
    rngWhere.End = ActiveDocument.Content.End
    For Each oPara In rngWhere.Paragraphs
    Set rngSearch = oPara.Range
    iInStrResult = InStr(rngSearch.Text, sStringToFind)
    Do Until iInStrResult = 0
    iHowManyFound = iHowManyFound + 1
    rngSearch.Start = rngSearch.Start + iInStrResult
    iInStrResult = InStr(rngSearch.Text, sStringToFind)
    Loop
    Next
    MsgBox "Found " & iHowManyFound & " results"
    End Sub
    [/vba]
    A better strategy
    [vba]
    'This routine displays how many instances of a string exist in the document
    'from the beginning of the selection to the end of the document
    Sub DemoStringAnalysisGood()
    Dim rngWhere As Range
    Dim rngSearch As Range
    Dim sStringToFind As String
    Dim iHowManyFound As Integer

    sStringToFind = "Some Text"
    Set rngWhere = Selection.Range
    Set rngSearch = rngWhere.Duplicate
    'set up the find object
    With rngSearch.Find
    .Text = sStringToFind
    Do Until .Execute = False
    iHowManyFound = iHowManyFound + 1
    Loop
    End With

    MsgBox "Found " & iHowManyFound & " results"
    End Sub
    [/vba]
    Now, if you simply create a 100 page document with just paragraphs, just text... and the keywords sprinkled judiciously, the string function will seem to work faster.
    But if you create a 100 page document with paragraphs, tables, etc... the find function will work faster.
    Of course, my for each loop is a heck of a lot faster than your process, which is to move the selection line by line, select it, grab the text of that line, and then loop through character by character of the text property.

    This is proof of concept... does this give you a direction to go? Yes, it requires that you re-write everything. But unfortunately, you have painted yourself into a very bad corner. There could be marginal improvements if you simply did a For Each oPara in ActiveDocument.Paragraphs ... and then grabbed the .Text property of the oPara.Range, but you may very well end up in the same corner at some point.

    The Find object has been optimized to search for stuff. You've duplicated that functionality, but in a much slower way.

  3. #3
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    Wow... I just actually checked out your macro and sample document.

    Although I didn't turn screenupdating off, because I wanted to see what it did... it took me 10 minutes (and at times, word hung, so I needed to break the action, then hit continue to get the macro to keep going) to get through the whole thing.

    And the final output is an excel spreadsheet with only 146 rows of data?

    You're off in saying it should take 30 seconds. It should take less than that. A couple additional points:

    1. Your use of the Sentences collection will fail you at some times. Try having a sentence which includes a period in it (the contractor shall furnish subject matter expertise for the X.Y.Z. program yada yada).
    2. As will your use of lines. The first two sample outputs proves this-- one is a sentence fragment because a paragraph mark is there, and the other is an entire sentences.
    3. You're clearly running into memory buffer issues, as the macro demonstrably slows down towards the end.

    At this point, I can help you... but as you have a single post, you need to reply here and include any links (even if they are only text) to any cross-posts you've made.

    It's a little bit of an intriguing problem, but I'm not wasting time on it if you are getting help (or have already had it answered) elsewhere (we seem to have a problem with a lot of 1 post count cross posters here).

  4. #4
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    As a quick proof of concept... depending on your familiarity with collections and custom classes, this would be one approach.

    Create a class module, call it "myInfo"
    Inside it you could have something like this (unfortunately, the vba parser doesn't make this very pretty):
    [vba]
    Option Explicit
    'A private variable for the main info
    Private m_rngInfo As Range
    'the heading string of the found range
    Public sHeading As String
    'this can be used to re-order in the logical order
    Public lKeyWordStart As Long
    Public sKeyWordFound As String

    Public Property Get InfoRange() As Range
    Set InfoRange = m_rngInfo
    End Property

    Public Property Set InfoRange(rngWhere As Range)
    Set m_rngInfo = rngWhere
    End Property

    Public Property Get InfoText() As String
    InfoText = m_rngInfo.text
    End Property
    [/vba]
    And then you could use the following routine in a regular module, to create a collection of that custom class, using the search terms and the find object.
    [vba]
    Sub Demo()
    Dim oInfo As myInfo
    Dim rngSearch As Range
    Dim sKeyWords As String
    Dim sKeyWord As String
    Dim i As Integer
    Dim colFoundInfo As Collection

    'get our keywords
    sKeyWords = "shall|will"
    'set up our collection of info
    Set colFoundInfo = New Collection

    'search for each of our keywords
    For i = 0 To UBound(Split(sKeyWords, "|"))
    'reset the range for each search term
    Set rngSearch = ActiveDocument.Content
    sKeyWord = Split(sKeyWords, "|")(i)
    'set up the find, not matching case
    With rngSearch.Find
    .MatchCase = False
    'get the keyword to search
    .text = sKeyWord
    'while we find our search term
    Do Until .Execute = False
    'create a new object
    Set oInfo = New myInfo
    'put the relevant data into it
    oInfo.lKeyWordStart = rngSearch.Start
    Set oInfo.InfoRange = rngSearch.Sentences(1)
    oInfo.sKeyWordFound = sKeyWord

    'here we would add in functionality to grab the header info for this object
    '(or could put into the actual class-- depends on your preference)
    'and then add the objectto the collection
    colFoundInfo.Add oInfo
    Loop
    End With
    Next

    Dim sDemoMessage As String
    For i = 1 To 5
    On Error Resume Next
    sDemoMessage = sDemoMessage & colFoundInfo(i).InfoText
    Next
    MsgBox "The first 5 found terms of " & colFoundInfo.Count & " terms are these sentences:" & vbCr & vbCr & _
    sDemoMessage

    'check out the locals window in View > Locals Window, and expand the colFoundInfo object
    Stop
    End Sub
    [/vba]
    I left the stop code in there on purpose, so you can stop the code and see what's going on in the Locals Window.
    What this does not do is identify the header for whatever sentence you're dealing with. It also doesn't solve your problem of sentences broken up by paragraph marks (which is rampant in the sample document).

    However, the next steps would these:
    1. Identify the header info for the found ranges, using the .InfoRange property of the class (you could do this in the class or outside and then set it
    2. Properly sort the information (there are a lot of ways to sort information, once you have it) based on the .lKeywordStart property of the class. You dump all the info from the collection into an array and sort, or you could put all the info into excel, and then sort on the .lKeyWordStart value, and then delete that column.

    The collection of found ranges will be naturally sorted by everything which contained "shall" in the order it appeared in the document, and then "will" in the order that term appeared in the document.

    But the main purpose of this demo is to show you how fast you can get the main info from your document using this methodology, rather than your existing methodology.

    I was curious, so I did a little more investigation. But now I'll wait to see if you respond.

    As an fyi... you should really try to remove any sensitive data from documents you post on the internet.

  5. #5
    VBAX Regular
    Joined
    May 2012
    Posts
    30
    Location
    Thanks for the responses, Frosty. First of all, this is the only posting of this issue; I understand how frustrating it can be to put effort into a problem that turns out to already be solved.

    I actually started this project with a macro someone else had developed. It used the Find function and was very fast; the problem was that it didn't pull the headers.

    The headers can be in various formats, can be located in the same paragraph or a different paragraph, and generally are a pain in the neck to find. All the performance issues spring from the concessions I've made in order to (somewhat) reliably find headers.

    At this point, I'm using this paradigm:
    (1)Go through each line, determining if its a header. If it is, make it the "current" header.
    (2)As I go through each line, check for keywords. If it has a keyword, record it with the current header.

    I can think of at least two other paradigms. One would be:
    (1)Find each occurance of a keyword in a document.
    (2)Search backward for the last header before the keyword and record the two.
    (3)Repeat for each keyword

    And another would be:
    (1)Check every line for a header and store it with its line number in a multi-dimensional array.
    (2)Use the "find" function to look for keywords and record each one with the smallest header line number that isn't greater than the line number of the keyword.

    Either new paradigm would require a complete rewrite and I'm reluctant to do so unless I know there will be a performance gain. That said, if you have some insight into the header problem I'm certainly happy to put the work in.

    No sensitive information here; it's all publicly accessible except for the instructions at the top and the macro itself. Thanks for looking out though.

    And lastly--thanks again; its clear you spent a lot of time on this and I appreciate that a great deal.

    Dan

  6. #6
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    There will be a significant performance increase if you go with the find object. Orders of magnitude. Finding the headers can be a little tricky, but mostly because it needs to be done in a way you will feel comfortable modifying. This will involve a little learning on your part on how to work with ranges.

    If you're interested in that, then we can proceed. Otherwise, I would recommend getting rid of the line by line, and readjusting your function to use two for each loops... The paragraphs collection and the sentences collection.

    Either way, you're looking at a re-write. In both re-writes you will get a performance increase, but without the find object, I think it may be marginal. And the sentences collection has issues, so a rewrite based on the sentences collection paints you into a different corner.

    Have you tried my code and understood it? Forget multi-dimensional arrays and going through the document in order. If you can work to understand what's going on in the code with the collection, finding the headers will be relatively easy. Promise.

  7. #7
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    This doesn't totally solve the problem, as we need to figure out a better way of getting the header info than your current process, but adding the following line of code
    oInfo.sHeading = fReturnHeadingBefore(oInfo.InfoRange)
    just before you add the oInfo item to the collection, combined with the following function will help you on the way... this at least gets all the info in a faster way, but is still far from optimized (since it still uses your HeadingOutlineTitle function)
    [VBA]
    Function fReturnHeadingBefore(rngWhere As Range) As String
    Dim oPara As Paragraph
    Dim rngSearch As Range
    Dim i As Integer
    Dim sRet As String

    Set rngSearch = rngWhere.Duplicate
    rngSearch.Start = rngSearch.Parent.Content.Start
    For i = rngSearch.Paragraphs.Count To 1 Step -1
    Set oPara = rngSearch.Paragraphs(i)
    sRet = HeaderOutlineTitle(oPara.Range.text)
    If sRet <> "" Then
    fReturnHeadingBefore = sRet
    Exit For
    End If
    Next
    End Function
    [/VBA]
    Is this making sense, thus far?

  8. #8
    VBAX Regular
    Joined
    May 2012
    Posts
    30
    Location
    Frosty,

    Thanks again for all your help. I'm in the final stages of rewriting the macro based on your suggestions. Once I'm done I'll test it and post the results as well as the new code.

    Dan

  9. #9
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    Sure thing... one last thing to consider, although I'm curious to see your development process.

    You are essentially trying to do a pattern matching function, and limit yourself to acceptable patterns (in terms of finding the headings previous to a found range containing "shall" and "will").

    You should investigate using a wildcard search, rather than string pattern analysis. In the same way that you can build a collection of "found" sentences, you can also build a collection of successful wildcard searches, and then compare the one that is closest.

    In fact, it may be better to simply build that collection as well once for the document, and then compare the start values of that with the start values of your sentences. I don't have any demo code for you at the moment, but wildcard searches can be *very* powerful, and remove the need for any actual string comparisons (which can be memory intensive, especially when you get to a large document and are going through it in textual chunks-- whether line by line, or sentence by sentence).

    This link may help.... I'm toying around with a little demo code at the moment...
    http://word.mvps.org/faqs/General/UsingWildcards.htm

  10. #10
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    Well, I couldn't resist. This seems to work much better on your demo document. The general concept is...
    1. Build a collection of found sentences, based on your search terms (currently "shall" and "will")
    2. Build a collection of found Headings, based on your existing string functionality analyzing the text of each paragraph in the document which has a text length of longer than 1 character (i.e., don't bother with empty paragraph marks, section breaks, etc)
    3. cycle through the collection of found sentences, adding the appropriate heading text, by comparing the start ranges
    4. Spit the results into a Word table (I'll leave getting it into Excel to you).

    I left out the idea of wildcard searching, because a) I always have to wrap my mind around the wildcard searching process-- there are others here who are much faster and familiar with that and b) it seemed pretty fast already.

    Running this on your demo document completes the operation in about 10 seconds. It will be a bit slower when you have to dump the data into Excel (since it takes time to open the excel app, generate a new worksheet, etc), but I think you can probably work out how to deal with the collection based on the creation of the table.

    So, in summary... this requires the following:
    1. A class called myInfo, which contains the following
    [vba]
    Option Explicit
    'for the actual found range
    Public InfoRange As Range
    'the heading string of the found range
    Public sHeading As String
    'this can be used to re-order in the logical order
    Public lKeyWordStart As Long
    Public sKeyWordFound As String

    'return the text, but remove any empty paragraph marks (including multiple paragraph marks)
    Public Property Get InfoText() As String
    If Not InfoRange Is Nothing Then
    InfoText = Replace(InfoRange.text, vbCr, "")
    End If
    End Property
    [/vba] 2. A second class called myHeading, which contains the following
    [vba]
    Option Explicit

    Public sHeadingText As String
    Public rngHeadingPara As Range
    Public lStart As Long
    [/vba] 3. The following code in a regular module (I've included your original code, as well as a couple of functions I had strictly for testing from the immediate window, although I think you only need the HeaderOutlineTitle function.
    [vba]
    Option Explicit
    Dim sKWArray() As String
    Sub Demo()
    Dim oInfo As myInfo
    Dim rngSearch As Range
    Dim sKeyWords As String
    Dim sKeyWord As String
    Dim i As Integer
    Dim colFoundInfo As Collection
    Dim colHeaders As Collection

    'get our keywords
    sKeyWords = "shall|will"
    'set up our collection of info
    Set colFoundInfo = New Collection

    'search for each of our keywords
    For i = 0 To UBound(Split(sKeyWords, "|"))
    'reset the range for each search term
    Set rngSearch = ActiveDocument.Content
    sKeyWord = Split(sKeyWords, "|")(i)
    'set up the find, not matching case
    With rngSearch.Find
    .MatchCase = False
    'get the keyword to search
    .text = sKeyWord
    'while we find our search term
    Do Until .Execute = False
    'create a new object
    Set oInfo = New myInfo
    'put the relevant data into it
    oInfo.lKeyWordStart = rngSearch.Start
    Set oInfo.InfoRange = rngSearch.Sentences(1)
    oInfo.sKeyWordFound = sKeyWord

    'oInfo.sHeading = fReturnHeadingBefore(oInfo.InfoRange)
    'here we would add in functionality to grab the header info for this object
    '(or could put into the actual class-- depends on your preference)
    'and then add the objectto the collection
    colFoundInfo.Add oInfo
    Application.StatusBar = "Added heading: " & oInfo.sHeading
    Loop
    End With
    Next
    'now, get the heading texts collection
    Set colHeaders = fGetHeadingCollection
    For Each oInfo In colFoundInfo
    oInfo.sHeading = fGetClosestHeadingText(oInfo.lKeyWordStart, colHeaders)
    Next

    'output to a word table, just for the purposes of this demo
    DemoOutput colFoundInfo
    End Sub
    Public Function fGetClosestHeadingText(lBefore As Long, colHeadings As Collection) As String
    Dim oHeading As myHeading
    'Dim colHeadings As Collection
    Dim i As Integer
    Dim sRet As String

    'Set colHeadings = fGetHeadingCollection
    'if we're before any headings, then return an empty string before we loop through
    If lBefore < colHeadings(1).lStart Then
    Exit Function
    End If
    'cycle through all the headings, finding the one with the closest item
    'start at 2, since we always refer to the previous one
    For i = 2 To colHeadings.Count
    Set oHeading = colHeadings(i)
    If lBefore < oHeading.rngHeadingPara.Start Then
    sRet = colHeadings(i - 1).sHeadingText
    Exit For
    End If
    Next
    'it's the last heading
    If sRet = "" Then
    sRet = colHeadings(colHeadings.Count).sHeadingText
    End If
    fGetClosestHeadingText = sRet
    End Function
    'build the collection of headings
    Public Function fGetHeadingCollection() As Collection
    Dim oPara As Paragraph
    Dim oHeading As myHeading
    Dim sRet As String
    Dim colRet As Collection

    Set colRet = New Collection
    For Each oPara In ActiveDocument.Paragraphs
    'don't bother with blank paragraphs or heading rows
    If Len(oPara.Range.text) > 1 Then
    sRet = HeaderOutlineTitle(oPara.Range.text)
    If Len(sRet) > 0 Then
    Set oHeading = New myHeading
    oHeading.sHeadingText = sRet
    oHeading.lStart = oPara.Range.Start
    Set oHeading.rngHeadingPara = oPara.Range.Duplicate
    colRet.Add oHeading
    End If
    End If
    Next
    Set fGetHeadingCollection = colRet
    End Function
    'just dump it into a table at the moment
    Sub DemoOutput(colInfo As Collection)
    Dim oTable As Table
    Dim oDoc As Document
    Dim i As Integer
    Dim oInfo As myInfo

    Set oDoc = Documents.Add
    Set oTable = oDoc.Tables.Add(oDoc.Content, colInfo.Count + 1, 3)
    With oTable
    .Cell(1, 1).Range.text = "Position"
    .Cell(1, 2).Range.text = "Header"
    .Cell(1, 3).Range.text = "Sentence"
    For i = 1 To colInfo.Count
    Set oInfo = colInfo(i)
    .Cell(i + 1, 1).Range.text = oInfo.lKeyWordStart
    .Cell(i + 1, 2).Range.text = oInfo.sHeading
    .Cell(i + 1, 3).Range.text = oInfo.InfoText
    Next

    'quick and dirty sort, delete and autofit
    .Sort excludeHeader:=True, FieldNumber:="Column 1", SortFieldType:=wdSortFieldNumeric, _
    SortOrder:=wdSortOrderAscending
    .Columns(1).Delete
    .AutoFitBehavior (wdAutoFitContent)
    End With

    End Sub
    Sub SelectTheNextShallSentence()
    Dim rngSearch As Range
    Dim oInfo As myInfo

    Set rngSearch = ActiveDocument.Content
    rngSearch.Start = Selection.Start
    With rngSearch.Find
    .MatchCase = False
    .text = "shall"
    If .Execute Then
    Set oInfo = New myInfo
    oInfo.lKeyWordStart = rngSearch.Start
    Set oInfo.InfoRange = rngSearch.Sentences(1)
    oInfo.sKeyWordFound = "shall"
    oInfo.InfoRange.Select
    End If
    End With
    End Sub
    Sub DisplayTheHeadingOfThisSentence()
    MsgBox fReturnHeadingBefore(Selection.Range)
    End Sub
    Function fReturnHeadingBefore(rngWhere As Range) As String
    Dim oPara As Paragraph
    Dim rngSearch As Range
    Dim i As Integer
    Dim sRet As String

    Set rngSearch = rngWhere.Duplicate
    rngSearch.Start = rngSearch.Parent.Content.Start
    For i = rngSearch.Paragraphs.Count To 1 Step -1
    Set oPara = rngSearch.Paragraphs(i)
    sRet = HeaderOutlineTitle(oPara.Range.text)
    If sRet <> "" Then
    fReturnHeadingBefore = sRet
    Exit For
    End If
    Next
    End Function
    'See if we have a heading-style number. Returns the heading, or blank if not a heading.
    Public Function HeaderOutlineTitle(line As String) As String
    'Set up variables
    Dim CharCounter As Integer
    Dim NumberCounter As Integer
    Dim tCh As String
    Dim y As Integer

    CharCounter = 0
    NumberCounter = 0

    'For the 4 character headings, this skips all the unncessary logic
    If Len(line) >= 4 And IsDigits(Left(line, 4)) Then
    HeaderOutlineTitle = Left(line, 4)
    GoTo SkipToEnd
    End If

    'Start looping through the line character by character
    For y = 1 To Len(line)
    tCh = Mid(line, y, 1)

    'This allows the letter "H" to be the first character
    If y = 1 And tCh = "H" Then
    GoTo SkipNumberCheck
    End If

    'If it's blank or a number or a period, exit
    If tCh = " " Or (Not IsDigit(tCh) And tCh <> ".") Then
    Exit For
    End If

    'If this is a number, keep track, otherwise reset the counter
    NumberCounter = IIf(IsNumeric(tCh), NumberCounter + 1, 0)

    'If we've had more than 2 numbers in a row, we're probably not seeing a heading
    If NumberCounter > 2 Then
    CharCounter = 0
    Exit For
    End If

    SkipNumberCheck: 'We can also skip all the other logic since it's the first character
    CharCounter = CharCounter + 1
    Next y

    'If this was a outline heading, list that information in column 4
    '(AND As long as the last character before the space was a number)
    If CharCounter <> 0 Then
    If IsNumeric(Mid(line, CharCounter, 1)) Then
    HeaderOutlineTitle = Left(line, CharCounter)
    Else
    HeaderOutlineTitle = Left(line, CharCounter - 1)
    End If
    Else
    HeaderOutlineTitle = ""
    End If

    SkipToEnd:
    End Function
    Function IsDigit(char As String) As Boolean
    'Uses [Asc] function to get an ASCII code of a character.
    Dim iAsciiCode As Integer
    iAsciiCode = Asc(char)

    'The ASCII codes for the digits range from 48 to 57. If an ASCII
    'code of the character is within this range, function returns true
    IsDigit = iAsciiCode >= 48 And iAsciiCode <= 57
    End Function

    Function IsDigits(text As String) As Boolean
    'If the text is 0 length, it's not a digit-only string
    If Len(text) = 0 Then
    IsDigits = False
    Else
    Dim x As Integer
    'the default is true. We'll set to false if...
    IsDigits = True
    For x = 1 To Len(text)
    '...we find something that's not a digit
    If Not IsDigit(Mid(text, x, 1)) Then
    IsDigits = False
    Exit For
    End If
    Next x
    End If
    End Function
    Function FileThere(FileName As String) As Boolean
    FileThere = (Dir(FileName) > "")
    End Function

    'Queries the user for a file name where we'll save our information
    Function GetXLSFileName() As String
    'open excel document
    Dim fd As FileDialog
    Set fd = Application.FileDialog(msoFileDialogFilePicker)
    Dim vSelFile As Variant
    With fd
    .AllowMultiSelect = False
    .Filters.Add "Excel 2003", "*.xls", 1
    .Filters.Add "Excel 2007 & 2010", "*.xlsx", 1
    .Title = "Choose your destination spreadsheet file."

    'Show the dialog box
    If .Show = -1 Then
    'Set the file string to the selected file
    GetXLSFileName = .SelectedItems(1)
    'If the user presses Cancel...
    Else
    MsgBox "No file selected. Macro will use 'C:\ShallStatements.xls' as default."
    GetXLSFileName = "C:\ShallStatements.xls"
    End If
    End With
    End Function

    'Interface function that gathers keywords to search for in the document.
    Function SetUpKeywords() As Boolean
    ReDim sKWArray(1 To 2) As String
    sKWArray(1) = "shall"
    sKWArray(2) = "must" '<< Section simplified in this example as it was not causing a performance issue
    SetUpKeywords = True
    End Function

    'Finds the first searched keyword in a line and returns the position
    Function HasKeywords(line As String) As Integer
    HasKeywords = 0
    Dim x As Integer
    For x = 1 To UBound(sKWArray)
    HasKeywords = IIf(Not IsNull(InStr(1, line, sKWArray(x), 1)), InStr(1, line, sKWArray(x), 1), 0)

    'Exit the loop if we find something
    If HasKeywords <> 0 Then Exit For
    Next x
    End Function




    'Main function that extracts requirements from the document
    Sub ExtractRequirements()
    Dim x As Integer, y As Integer, tCh As Integer
    Dim intRowCount As Integer 'The row excel will import data into. Used as counter & start value.
    intRowCount = 10 'Data starts on row 10. The first 9 rows are reserved for performance metrics and the headers.

    'Create arrays to hold data
    Dim sHeaders() As String
    Dim sStatements() As String
    ReDim sHeaders(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
    ReDim sStatements(1 To 1000) As String 'Supports maximum of 1000 requirements per document.

    'Record Start Time
    sHeaders(1) = "Start Time"
    sStatements(1) = DateTime.Now

    'Write headers
    sHeaders(intRowCount - 1) = "Header"
    sStatements(intRowCount - 1) = "Requirement"

    'Figure out how many lines are in the document
    Dim NumOfLines As Integer
    NumOfLines = ActiveDocument.ComputeStatistics(wdStatisticLines, False)

    'Record time for setup to be completed
    sHeaders(2) = "Lines Computed"
    sStatements(2) = DateTime.Now

    'Set up requirement words (exit sub on failure)
    If Not SetUpKeywords Then Exit Sub

    'Turn off screen updating to improve performance
    If MsgBox("Would you like to turn off screen updating to improve performance? " & _
    "Please note that this may cause Word to appear unresponsive for quite " & _
    "some time on large files.", vbYesNo, "Improve Performance") = vbYes Then
    Application.ScreenUpdating = False
    End If

    'Record time for information to be collected
    sHeaders(3) = "Params Collected"
    sStatements(3) = DateTime.Now

    'Loop through every line and look for headers
    Dim tmpHeaderRslt As String, LineText As String, SentenceText As String, header As String, ShallPos As Integer
    header = "None" 'This is the header value used before we've found a legitimate header
    For x = 1 To NumOfLines
    'Go to the correct line
    Selection.GoTo wdGoToLine, wdGoToAbsolute, x

    'Expand to select the entire line, but skip the logic if it's blank
    Selection.Expand wdLine
    LineText = Trim(Selection.text)
    If LineText <> "" Then
    'see if it's a header, and if so, set the header
    tmpHeaderRslt = HeaderOutlineTitle(LineText)
    If tmpHeaderRslt <> "" Then header = tmpHeaderRslt

    'See if the line has the word "shall" or "must" in it
    ShallPos = HasKeywords(LineText)

    'If so, write it to the spreadsheet
    If ShallPos > 0 Then
    'Move to the correct character and expand to the full sentence
    Selection.Collapse wdCollapseStart
    Selection.Move wdCharacter, ShallPos
    Selection.Expand wdSentence
    SentenceText = Selection.text

    'Write the header to the spreadsheet
    sHeaders(intRowCount) = header
    sStatements(intRowCount) = SentenceText

    'Move the selection forward to the proper line (avoids replication)
    Selection.Collapse wdCollapseEnd
    If (x + 1) < Selection.Information(wdFirstCharacterLineNumber) Then x = Selection.Information(wdFirstCharacterLineNumber) - 1

    'Incremement the row counter
    intRowCount = intRowCount + 1
    End If
    End If
    Next x

    'Record time for statements to be collected
    sHeaders(4) = "Statements Collected"
    sStatements(4) = DateTime.Now

    'Open excel file and set up for receiving data
    Dim xlsFileString As String
    xlsFileString = GetXLSFileName
    Dim appExcel As Object
    Dim objSheet As Object
    Set appExcel = CreateObject("Excel.Application")
    Set objSheet = appExcel.workbooks.Open(xlsFileString).Sheets("Sheet1")

    'Write data to the spreadsheet
    sHeaders(5) = "Excel write begin"
    sStatements(5) = DateTime.Now
    For x = 1 To intRowCount
    objSheet.Cells(x, 1) = sHeaders(x)
    objSheet.Cells(x, 2) = sStatements(x)
    Next x

    'Write end time
    objSheet.Cells(6, 1) = "Excel write ends"
    objSheet.Cells(6, 2) = DateTime.Now

    'Clean up and open for viewing.
    appExcel.workbooks(1).Close True
    appExcel.Quit
    Set objSheet = Nothing
    Set appExcel = Nothing

    'Open the new spreadsheet for viewing
    If FileThere("C:\Program Files\Microsoft Office\Office12\excel.exe ") Then
    Shell ("C:\Program Files\Microsoft Office\Office12\excel.exe """ & xlsFileString & """")
    ElseIf FileThere("C:\Program Files\Microsoft Office\Office14\excel.exe ") Then
    Shell ("C:\Program Files\Microsoft Office\Office14\excel.exe """ & xlsFileString & """")
    End If
    End Sub
    [/vba] As a note: Your use of the IsDigit or IsDigits functions could be substituted by simply using the built in IsNumeric function (which accomplishes the same thing those two functions do) within the HeaderOutlineTitle function. But I left that alone, as that is your primary heading pattern analysis which you are the most comfortable with. And it also doesn't appear to adversely impact the speed that much.

    Since this works in about 10 seconds on your 30 page document, I would imagine you shouldn't have this run much longer than a minute on any longer document.

    As a general note-- *always* use Option Explicit at the top of your modules. It saves you from having VBA try to determine the type of a variable on the fly. You didn't have a Dim y as Integer in your HeaderOutlineTitle function, which means it will be a Variant type-- that can be costly to the memory process.

    And lastly-- please note that since I went away from your "line by line" approach, and instead use the Sentences collection, there will be two major problems that you will need to handle.
    1) Sentences collection is really tricky. It truncates at times you wouldn't expect, and doesn't truncate at other times. Dr. Rev. Martin Luther King Jr. died in the 20th century. That "sentence" looks to Word like it is 4 sentences.
    2) The sentences collection also continues on using any white space. So in some cases, the text returned will include a lot of empty paragraphs. So I've simply placed in the class a quick and dirty method of removing any paragraph marks from the myInfo class. There are numerous ways to approach this, but this is one way.

    Hope this helps. Please post the code you got to, if you're interested in any suggestions for your approach. The main takeaway from this approach is that the use of collections is far superior to your inclination to use arrays. Not only can you put custom objects (the classes) into collections, being able to use the .Range.Start property as your main algorithm for document location speeds things up considerably, because you don't have to a) go through the document line by line and b) Longs are much less memory intensive than a string. When you magnify that by string arrays with 1000 elements, you can definitely slow yourself down.

    Cheers,
    - Jason aka Frosty
    Last edited by Frosty; 06-07-2012 at 03:14 PM.

  11. #11
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    Returned your sample document, since you said the info is public. Contains my code, obviously. Open the document and run the Demo subroutine to test.
    Attached Files Attached Files

  12. #12
    VBAX Regular
    Joined
    May 2012
    Posts
    30
    Location
    All right, thanks for your patience. I took off Friday and the weekend but I'm back and ready to roll.

    I've uploaded my new sample (creatively: "Newsample.docm"). I've incorporated your ideas except for the wildcard matching, which I'll give a try after this. It's running pretty quickly right now in the sample document, and in longer documents I'm not seeing the same slowdown at the end. Still, it takes a couple hours to process really long documents (100+) page, which is better than before but still could use some work.

    I'll take a look at the new suggestions and see what I can do. Thanks again!

    Dan
    Attached Files Attached Files

  13. #13
    VBAX Regular
    Joined
    May 2012
    Posts
    30
    Location
    Looked in more detail at your new code. I'm sorry I didn't get back to you faster! I think my new code implemented the main performance gains, and it runs in 9 seconds (I implemented some performance testing functionality as well) so I think we're about on par where performance is concerned. Let me know if you think creating the two collections first and then running the comparisons offers a substantial benefit to what I've implemented in my most recent document.

    I've generally simplified the code pretty significantly; the main complexity at this point is in the header location routine, so I'll look into wildcards and see what I come up with. Hopefully it's something like regular expressions.

    Wish me luck!

    Dan

  14. #14
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    Haven't downloaded your sample yet, but how does code which runs in 10 seconds on 30page sample doc turn into a couple of hours on a 100+ page doc? Are you sure this is a representative sample?

  15. #15
    VBAX Regular
    Joined
    May 2012
    Posts
    30
    Location
    A valid question. The problem document is 172 pages. It's too big for the forum (just under 4MB) but it can be downloaded here:

    https://docs.google.com/file/d/0B06d...S0U/edit?pli=1

    Just go to the link and go to file->download. It doesn't seem qualitatively different than the sample document, but maybe I'm missing something.

    In other news: I'm working on an alternate means of identifying headings right now... I just finished with the first implementation, but it's actually slower than the original, believe it or not. I'll keep on workin'.

    Dan

  16. #16
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    Did you try my code on a couple of different chunks? Did the two collection approach seem to operate the same as yours not just on the sample, but on a real doc? I won't be able to look at this for a few hours.

    For the purposes of testing, putting the info into excel is meaningless, so you should be able to run the code I had on your stuff to test as well.

    Bottlenecks can show up in the strangest places. One of the big problems I had with your original code wasn't what I thought it would be-- but rather you using the move method on the selection. When it encountered section breaks, it would hang. I think you may be dealing with document corruption to some degree (not all that strange when the source is scanned PDFs)

    Will look over your stuff in a couple of hours. In the meantime, see if my code performs any differently for you on a larger sample size

  17. #17
    VBAX Regular
    Joined
    May 2012
    Posts
    30
    Location
    Holy...

    Well, I tried your code and dear God is it fast. The biggest problem is that it's not matching the capital-letter-style headings... I'll see if I can patch that up.

    I'll start working on adapting my code to use this method. More soon.

    Dan

  18. #18
    VBAX Regular
    Joined
    May 2012
    Posts
    30
    Location
    (The capital-letter-style headings issue wasn't your fault, you're just using the old version of my header-finding code... I think once I drop in the updated code it will solve the issue.)

  19. #19
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    That's why I left the header identification algorithm alone, since I figured that was the thing most likely for you to change. The only thing I can add is that you really should get rid of your IsDigit/IsDigits functionality. It's redudant to IsNumeric and may very well be a bottleneck. If nothing else, it's code to wade through which you don't need to wade through.

    At this point, the biggest approach difference I see between your NewSample code is that you were using the paradigm of finding the heading before, rather than the additional collection of headings and then comparing the nearest.

    I think the 2nd collection of headings is going to be better, since it means you won't be back-tracking so much. Why go through the same paragraphs multiple times, only to find out that they aren't useful? Especially on a longer document. If you're going to need to loop through each paragraph in order to analyze different heading patterns, you might as well only do it once. Also, give yourself some short-circuits (i.e., if the Len of the paragraph text is <=2 or something, don't do anything). The most likely spot for corruption in a document is section breaks... might as well skip thos.

    I think building the collection of headings will still be the bottleneck (since the primary collection of sentences containing search terms uses the Find object) -- but you can further refine if it seems slow. But at the very least you won't have to change any of the methodology of your main routine-- you simply have to try to build the headings collection faster.

    Apart from that, you can look at various techniques to optimizing string functions (the statement If Len(myString) = 0 is technically faster than If myString = "" -- although I prefer the later statement for readability).

    VBA contains a LIKE operator, although it is limited... and the wild card searching can be very very powerful. But this structure (two collections, one with the sentences, one with the headings) is very scalable. If you then later need to add to it, it's not that difficult (search terms, criteria for the headers).

  20. #20
    VBAX Master
    Joined
    Feb 2011
    Posts
    1,480
    Location
    So now I'm trying to examine your HeaderOutlineTitle function... and there are a few bits I don't get, and a few bits I think you can improve the logic of.

    So a couple of comments:
    1. Don't use single line If...Then statements. There is zero optimization to this, and it essentially means you can't effectively break point it.
    2. IIF constructs are not appreciably faster (to my knowledge), and they are much more confusing to read and troubleshoot.
    3. Complex And/Or operations are likewise difficult to troubleshoot... for example.
    [vba]
    'If it's blank or a number or a period, exit
    If tCH= " " Or (Not IsNumeric(tCh) And tCH <> ".") Then
    Exit For
    End If
    [/vba] That logic test doesn't make a lick of sense to me. Especially with the inaccurate comment. What you are literally testing there is
    A. If it is a blank space OR
    B. It is NOT a number AND it is NOT a period.

    So, basically... you have some valid characters to allow you to keep processing, right? This is an excellent time to make use of Select Case, for multiple reasons. It's easier to modify, and it's easier to understand. For example...
    [vba]
    Select Case tCH
    Case 0 to 9, "."
    'acceptable, keep processing
    Case Else
    Exit For
    End Select
    [/vba] You don't have to use the 0 to 9 construct, you could do "0", "1", "2", etc... but Select Case is often a good way to do an acceptable list. It's certainly easier to read and modify than a complicated If statement utilizing AND/OR operators.

    4. IIF... blech. Easier to read this way...
    [VBA]
    'If this is a number, keep track, otherwise reset the counter
    If IsNumeric(tCH) Then
    NumberCounter = NumberCounter + 1
    Else
    NumberCounter = 0
    End If
    [/VBA]
    I would be surprised if the above simplifications changed the over all processing time.

    As a benchmark, running my code on your 4 meg document took around 2 minutes, with the majority of that time being putting stuff into the word table (which may very well be faster for you, since excel doesn't have to re-paginate periodically when large amounts of data are put into a spreadsheet.

    I think there are ways to simplify what you want out of the HeaderOutlineTitle functionality even without going into Wildcard searching.

    But if you want to investigate that, it's probably worth of its own post -- something along the lines of "How can I use wildcard searching to match..."
    And then list out all the patterns that you want to find, coupled with the kinds of patterns you don't want to match, and perhaps one of our resident expert wildcard searchers will weigh in.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •