What I have:
I have a macro in a word document that looks for statements that say "shall" or "will". When it finds it, it writes the statement and the section header to an excel file.
The problem:
These are crappy word files exported from PDF files, so they don't have "real" section headers. As a result, I'm forced to identify headers using fairly complex logic. The end result is that my macro runs slowly--far too slowly. On long proposals (100+ pages) it can take over an hour.
I need help with:
I could use some help identifying bottlenecks in my code. The attached file includes part of a proposal I'm using to gauge performance. I'm getting about 2 minutes right now, but ideally this file should process in about 30 seconds.
The Code:
Here's the code in all its twisted glory.
[VBA]Dim sKWArray() As String
Function FileThere(FileName As String) As Boolean
FileThere = (Dir(FileName) > "")
End Function
'Queries the user for a file name where we'll save our information
Function GetXLSFileName() As String
'open excel document
Dim fd As FileDialog
Set fd = Application.FileDialog(msoFileDialogFilePicker)
Dim vSelFile As Variant
With fd
.AllowMultiSelect = False
.Filters.Add "Excel 2003", "*.xls", 1
.Filters.Add "Excel 2007 & 2010", "*.xlsx", 1
.Title = "Choose your destination spreadsheet file."
'Show the dialog box
If .Show = -1 Then
'Set the file string to the selected file
GetXLSFileName = .SelectedItems(1)
'If the user presses Cancel...
Else
MsgBox "No file selected. Macro will use 'C:\ShallStatements.xls' as default."
GetXLSFileName = "C:\ShallStatements.xls"
End If
End With
End Function
'Interface function that gathers keywords to search for in the document.
Function SetUpKeywords() As Boolean
ReDim sKWArray(1 To 2) As String
sKWArray(1) = "shall"
sKWArray(2) = "must" '<< Section simplified in this example as it was not causing a performance issue
SetUpKeywords = True
End Function
'Finds the first searched keyword in a line and returns the position
Function HasKeywords(line As String) As Integer
HasKeywords = 0
Dim x As Integer
For x = 1 To UBound(sKWArray)
HasKeywords = IIf(Not IsNull(InStr(1, line, sKWArray(x), 1)), InStr(1, line, sKWArray(x), 1), 0)
'Exit the loop if we find something
If HasKeywords <> 0 Then Exit For
Next x
End Function
'See if we have a heading-style number. Returns the heading, or blank if not a heading.
Private Function HeaderOutlineTitle(line As String) As String
'Set up variables
Dim CharCounter As Integer
Dim NumberCounter As Integer
Dim tCh As String
CharCounter = 0
NumberCounter = 0
'For the 4 character headings, this skips all the unncessary logic
If Len(line) >= 4 And IsDigits(Left(line, 4)) Then
HeaderOutlineTitle = Left(line, 4)
GoTo SkipToEnd
End If
'Start looping through the line character by character
For y = 1 To Len(line)
tCh = Mid(line, y, 1)
'This allows the letter "H" to be the first character
If y = 1 And tCh = "H" Then GoTo SkipNumberCheck
'If it's blank or a number or a period, exit
If tCh = " " Or (Not IsDigit(tCh) And tCh <> ".") Then Exit For
'If this is a number, keep track, otherwise reset the counter
NumberCounter = IIf(IsNumeric(tCh), NumberCounter + 1, 0)
'If we've had more than 2 numbers in a row, we're probably not seeing a heading
If NumberCounter > 2 Then
CharCounter = 0
Exit For
End If
SkipNumberCheck: 'We can also skip all the other logic since it's the first character
CharCounter = CharCounter + 1
Next y
'If this was a outline heading, list that information in column 4
'(AND As long as the last character before the space was a number)
If CharCounter <> 0 Then
If IsNumeric(Mid(line, CharCounter, 1)) Then
HeaderOutlineTitle = Left(line, CharCounter)
Else
HeaderOutlineTitle = Left(line, CharCounter - 1)
End If
Else
HeaderOutlineTitle = ""
End If
SkipToEnd:
End Function
Function IsDigit(char As String) As Boolean
'Uses [Asc] function to get an ASCII code of a character.
Dim iAsciiCode As Integer
iAsciiCode = Asc(char)
'The ASCII codes for the digits range from 48 to 57. If an ASCII
'code of the character is within this range, function returns true
IsDigit = iAsciiCode >= 48 And iAsciiCode <= 57
End Function
Function IsDigits(text As String) As Boolean
'If the text is 0 length, it's not a digit-only string
If Len(text) = 0 Then
IsDigits = False
Else
Dim x As Integer
'the default is true. We'll set to false if...
IsDigits = True
For x = 1 To Len(text)
'...we find something that's not a digit
If Not IsDigit(Mid(text, x, 1)) Then
IsDigits = False
Exit For
End If
Next x
End If
End Function
'Main function that extracts requirements from the document
Sub ExtractRequirements()
Dim x As Integer, y As Integer, tCh As Integer
Dim intRowCount As Integer 'The row excel will import data into. Used as counter & start value.
intRowCount = 10 'Data starts on row 10. The first 9 rows are reserved for performance metrics and the headers.
'Create arrays to hold data
Dim sHeaders() As String
Dim sStatements() As String
ReDim sHeaders(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
ReDim sStatements(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
'Record Start Time
sHeaders(1) = "Start Time"
sStatements(1) = DateTime.Now
'Figure out how many lines are in the document
Dim NumOfLines As Integer
NumOfLines = ActiveDocument.ComputeStatistics(wdStatisticLines, False)
'Record time for setup to be completed
sHeaders(2) = "Lines Computed"
sStatements(2) = DateTime.Now
'Set up requirement words (exit sub on failure)
If Not SetUpKeywords Then Exit Sub
'Turn off screen updating to improve performance
If MsgBox("Would you like to turn off screen updating to improve performance? " & _
"Please note that this may cause Word to appear unresponsive for quite " & _
"some time on large files.", vbYesNo, "Improve Performance") = vbYes Then
Application.ScreenUpdating = False
End If
'Record time for information to be collected
sHeaders(3) = "Params Collected"
sStatements(3) = DateTime.Now
'Loop through every line and look for headers
Dim tmpHeaderRslt As String, LineText As String, SentenceText As String, header As String, ShallPos As Integer
header = "None" 'This is the header value used before we've found a legitimate header
For x = 1 To NumOfLines
'Go to the correct line
Selection.GoTo wdGoToLine, wdGoToAbsolute, x
'Expand to select the entire line, but skip the logic if it's blank
Selection.Expand wdLine
LineText = Trim(Selection.text)
If LineText <> "" Then
'see if it's a header, and if so, set the header
tmpHeaderRslt = HeaderOutlineTitle(LineText)
If tmpHeaderRslt <> "" Then header = tmpHeaderRslt
'See if the line has the word "shall" or "must" in it
ShallPos = HasKeywords(LineText)
'If so, write it to the spreadsheet
If ShallPos > 0 Then
'Move to the correct character and expand to the full sentence
Selection.Collapse wdCollapseStart
Selection.Move wdCharacter, ShallPos
Selection.Expand wdSentence
SentenceText = Selection.text
'Write the header to the spreadsheet
sHeaders(intRowCount) = header
sStatements(intRowCount) = SentenceText
'Move the selection forward to the proper line (avoids replication)
Selection.Collapse wdCollapseEnd
If (x + 1) < Selection.Information(wdFirstCharacterLineNumber) Then x = Selection.Information(wdFirstCharacterLineNumber) - 1
'Incremement the row counter
intRowCount = intRowCount + 1
End If
End If
Next x
'Record time for statements to be collected
sHeaders(4) = "Statements Collected"
sStatements(4) = DateTime.Now
'Open excel file and set up for receiving data
Dim xlsFileString As String
xlsFileString = GetXLSFileName
Dim appExcel As Object
Dim objSheet As Object
Set appExcel = CreateObject("Excel.Application")
Set objSheet = appExcel.workbooks.Open(xlsFileString).Sheets("Sheet1")
'Write data to the spreadsheet
sHeaders(5) = "Excel write begin"
sStatements(5) = DateTime.Now
For x = 1 To intRowCount
objSheet.Cells(x, 1) = sHeaders(x)
objSheet.Cells(x, 2) = sStatements(x)
Next x
'Write end time
objSheet.Cells(6, 1) = "Excel write ends"
objSheet.Cells(6, 2) = DateTime.Now
'Clean up and open for viewing.
appExcel.workbooks(1).Close True
appExcel.Quit
Set objSheet = Nothing
Set appExcel = Nothing
'Open the new spreadsheet for viewing
If FileThere("C:\Program Files\Microsoft Office\Office12\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office12\excel.exe """ & xlsFileString & """")
ElseIf FileThere("C:\Program Files\Microsoft Office\Office14\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office14\excel.exe """ & xlsFileString & """")
End If
End Sub
[/VBA]
Yep, you've got a bad set up. You need to stop using string functions (like InStr) entirely, and substitute by using the Find object.
You could probably get some increase in performance by compiling this code rather than leaving it in an unmanaged addin (i.e., make it into a COM addin, a DLL, etc vs. leaving the code in a .dotm)... but the main problem is use of string functions the way you're doing it.
My cursory glance at the structure of code isn't that bad (except for your use of the selection object-- you could probably get some value out of using a range object, although ranges and lines become tough).
But the main problem is that you haven't attempted to use the Find object. If I can sum up with two more concise bits of code...
Your strategy
[vba]
'This routine displays how many instances of a string exist in the document
'from the beginning of the selection to the end of the document
Sub DemoStringAnalysisBad()
Dim rngWhere As Range
Dim rngSearch As Range
Dim oPara As Paragraph
Dim sStringToFind As String
Dim iHowManyFound As Integer
Dim iInStrResult As Integer
sStringToFind = "Some Text"
Set rngWhere = Selection.Range
rngWhere.End = ActiveDocument.Content.End
For Each oPara In rngWhere.Paragraphs
Set rngSearch = oPara.Range
iInStrResult = InStr(rngSearch.Text, sStringToFind)
Do Until iInStrResult = 0
iHowManyFound = iHowManyFound + 1
rngSearch.Start = rngSearch.Start + iInStrResult
iInStrResult = InStr(rngSearch.Text, sStringToFind)
Loop
Next
MsgBox "Found " & iHowManyFound & " results"
End Sub
[/vba]
A better strategy
[vba]
'This routine displays how many instances of a string exist in the document
'from the beginning of the selection to the end of the document
Sub DemoStringAnalysisGood()
Dim rngWhere As Range
Dim rngSearch As Range
Dim sStringToFind As String
Dim iHowManyFound As Integer
sStringToFind = "Some Text"
Set rngWhere = Selection.Range
Set rngSearch = rngWhere.Duplicate
'set up the find object
With rngSearch.Find
.Text = sStringToFind
Do Until .Execute = False
iHowManyFound = iHowManyFound + 1
Loop
End With
MsgBox "Found " & iHowManyFound & " results"
End Sub
[/vba]
Now, if you simply create a 100 page document with just paragraphs, just text... and the keywords sprinkled judiciously, the string function will seem to work faster.
But if you create a 100 page document with paragraphs, tables, etc... the find function will work faster.
Of course, my for each loop is a heck of a lot faster than your process, which is to move the selection line by line, select it, grab the text of that line, and then loop through character by character of the text property.
This is proof of concept... does this give you a direction to go? Yes, it requires that you re-write everything. But unfortunately, you have painted yourself into a very bad corner. There could be marginal improvements if you simply did a For Each oPara in ActiveDocument.Paragraphs ... and then grabbed the .Text property of the oPara.Range, but you may very well end up in the same corner at some point.
The Find object has been optimized to search for stuff. You've duplicated that functionality, but in a much slower way.
Wow... I just actually checked out your macro and sample document.
Although I didn't turn screenupdating off, because I wanted to see what it did... it took me 10 minutes (and at times, word hung, so I needed to break the action, then hit continue to get the macro to keep going) to get through the whole thing.
And the final output is an excel spreadsheet with only 146 rows of data?
You're off in saying it should take 30 seconds. It should take less than that. A couple additional points:
1. Your use of the Sentences collection will fail you at some times. Try having a sentence which includes a period in it (the contractor shall furnish subject matter expertise for the X.Y.Z. program yada yada).
2. As will your use of lines. The first two sample outputs proves this-- one is a sentence fragment because a paragraph mark is there, and the other is an entire sentences.
3. You're clearly running into memory buffer issues, as the macro demonstrably slows down towards the end.
At this point, I can help you... but as you have a single post, you need to reply here and include any links (even if they are only text) to any cross-posts you've made.
It's a little bit of an intriguing problem, but I'm not wasting time on it if you are getting help (or have already had it answered) elsewhere (we seem to have a problem with a lot of 1 post count cross posters here).
As a quick proof of concept... depending on your familiarity with collections and custom classes, this would be one approach.
Create a class module, call it "myInfo"
Inside it you could have something like this (unfortunately, the vba parser doesn't make this very pretty):
[vba]
Option Explicit
'A private variable for the main info
Private m_rngInfo As Range
'the heading string of the found range
Public sHeading As String
'this can be used to re-order in the logical order
Public lKeyWordStart As Long
Public sKeyWordFound As String
Public Property Get InfoRange() As Range
Set InfoRange = m_rngInfo
End Property
Public Property Set InfoRange(rngWhere As Range)
Set m_rngInfo = rngWhere
End Property
Public Property Get InfoText() As String
InfoText = m_rngInfo.text
End Property
[/vba]
And then you could use the following routine in a regular module, to create a collection of that custom class, using the search terms and the find object.
[vba]
Sub Demo()
Dim oInfo As myInfo
Dim rngSearch As Range
Dim sKeyWords As String
Dim sKeyWord As String
Dim i As Integer
Dim colFoundInfo As Collection
'get our keywords
sKeyWords = "shall|will"
'set up our collection of info
Set colFoundInfo = New Collection
'search for each of our keywords
For i = 0 To UBound(Split(sKeyWords, "|"))
'reset the range for each search term
Set rngSearch = ActiveDocument.Content
sKeyWord = Split(sKeyWords, "|")(i)
'set up the find, not matching case
With rngSearch.Find
.MatchCase = False
'get the keyword to search
.text = sKeyWord
'while we find our search term
Do Until .Execute = False
'create a new object
Set oInfo = New myInfo
'put the relevant data into it
oInfo.lKeyWordStart = rngSearch.Start
Set oInfo.InfoRange = rngSearch.Sentences(1)
oInfo.sKeyWordFound = sKeyWord
'here we would add in functionality to grab the header info for this object
'(or could put into the actual class-- depends on your preference)
'and then add the objectto the collection
colFoundInfo.Add oInfo
Loop
End With
Next
Dim sDemoMessage As String
For i = 1 To 5
On Error Resume Next
sDemoMessage = sDemoMessage & colFoundInfo(i).InfoText
Next
MsgBox "The first 5 found terms of " & colFoundInfo.Count & " terms are these sentences:" & vbCr & vbCr & _
sDemoMessage
'check out the locals window in View > Locals Window, and expand the colFoundInfo object
Stop
End Sub
[/vba]
I left the stop code in there on purpose, so you can stop the code and see what's going on in the Locals Window.
What this does not do is identify the header for whatever sentence you're dealing with. It also doesn't solve your problem of sentences broken up by paragraph marks (which is rampant in the sample document).
However, the next steps would these:
1. Identify the header info for the found ranges, using the .InfoRange property of the class (you could do this in the class or outside and then set it
2. Properly sort the information (there are a lot of ways to sort information, once you have it) based on the .lKeywordStart property of the class. You dump all the info from the collection into an array and sort, or you could put all the info into excel, and then sort on the .lKeyWordStart value, and then delete that column.
The collection of found ranges will be naturally sorted by everything which contained "shall" in the order it appeared in the document, and then "will" in the order that term appeared in the document.
But the main purpose of this demo is to show you how fast you can get the main info from your document using this methodology, rather than your existing methodology.
I was curious, so I did a little more investigation. But now I'll wait to see if you respond.
As an fyi... you should really try to remove any sensitive data from documents you post on the internet.
Thanks for the responses, Frosty. First of all, this is the only posting of this issue; I understand how frustrating it can be to put effort into a problem that turns out to already be solved.
I actually started this project with a macro someone else had developed. It used the Find function and was very fast; the problem was that it didn't pull the headers.
The headers can be in various formats, can be located in the same paragraph or a different paragraph, and generally are a pain in the neck to find. All the performance issues spring from the concessions I've made in order to (somewhat) reliably find headers.
At this point, I'm using this paradigm:
(1)Go through each line, determining if its a header. If it is, make it the "current" header.
(2)As I go through each line, check for keywords. If it has a keyword, record it with the current header.
I can think of at least two other paradigms. One would be:
(1)Find each occurance of a keyword in a document.
(2)Search backward for the last header before the keyword and record the two.
(3)Repeat for each keyword
And another would be:
(1)Check every line for a header and store it with its line number in a multi-dimensional array.
(2)Use the "find" function to look for keywords and record each one with the smallest header line number that isn't greater than the line number of the keyword.
Either new paradigm would require a complete rewrite and I'm reluctant to do so unless I know there will be a performance gain. That said, if you have some insight into the header problem I'm certainly happy to put the work in.
No sensitive information here; it's all publicly accessible except for the instructions at the top and the macro itself. Thanks for looking out though.
And lastly--thanks again; its clear you spent a lot of time on this and I appreciate that a great deal.
There will be a significant performance increase if you go with the find object. Orders of magnitude. Finding the headers can be a little tricky, but mostly because it needs to be done in a way you will feel comfortable modifying. This will involve a little learning on your part on how to work with ranges.
If you're interested in that, then we can proceed. Otherwise, I would recommend getting rid of the line by line, and readjusting your function to use two for each loops... The paragraphs collection and the sentences collection.
Either way, you're looking at a re-write. In both re-writes you will get a performance increase, but without the find object, I think it may be marginal. And the sentences collection has issues, so a rewrite based on the sentences collection paints you into a different corner.
Have you tried my code and understood it? Forget multi-dimensional arrays and going through the document in order. If you can work to understand what's going on in the code with the collection, finding the headers will be relatively easy. Promise.
This doesn't totally solve the problem, as we need to figure out a better way of getting the header info than your current process, but adding the following line of code
oInfo.sHeading = fReturnHeadingBefore(oInfo.InfoRange)
just before you add the oInfo item to the collection, combined with the following function will help you on the way... this at least gets all the info in a faster way, but is still far from optimized (since it still uses your HeadingOutlineTitle function)
[VBA]
Function fReturnHeadingBefore(rngWhere As Range) As String
Dim oPara As Paragraph
Dim rngSearch As Range
Dim i As Integer
Dim sRet As String
Set rngSearch = rngWhere.Duplicate
rngSearch.Start = rngSearch.Parent.Content.Start
For i = rngSearch.Paragraphs.Count To 1 Step -1
Set oPara = rngSearch.Paragraphs(i)
sRet = HeaderOutlineTitle(oPara.Range.text)
If sRet <> "" Then
fReturnHeadingBefore = sRet
Exit For
End If
Next
End Function
[/VBA]
Is this making sense, thus far?
Thanks again for all your help. I'm in the final stages of rewriting the macro based on your suggestions. Once I'm done I'll test it and post the results as well as the new code.
Sure thing... one last thing to consider, although I'm curious to see your development process.
You are essentially trying to do a pattern matching function, and limit yourself to acceptable patterns (in terms of finding the headings previous to a found range containing "shall" and "will").
You should investigate using a wildcard search, rather than string pattern analysis. In the same way that you can build a collection of "found" sentences, you can also build a collection of successful wildcard searches, and then compare the one that is closest.
In fact, it may be better to simply build that collection as well once for the document, and then compare the start values of that with the start values of your sentences. I don't have any demo code for you at the moment, but wildcard searches can be *very* powerful, and remove the need for any actual string comparisons (which can be memory intensive, especially when you get to a large document and are going through it in textual chunks-- whether line by line, or sentence by sentence).
Well, I couldn't resist. This seems to work much better on your demo document. The general concept is...
1. Build a collection of found sentences, based on your search terms (currently "shall" and "will")
2. Build a collection of found Headings, based on your existing string functionality analyzing the text of each paragraph in the document which has a text length of longer than 1 character (i.e., don't bother with empty paragraph marks, section breaks, etc)
3. cycle through the collection of found sentences, adding the appropriate heading text, by comparing the start ranges
4. Spit the results into a Word table (I'll leave getting it into Excel to you).
I left out the idea of wildcard searching, because a) I always have to wrap my mind around the wildcard searching process-- there are others here who are much faster and familiar with that and b) it seemed pretty fast already.
Running this on your demo document completes the operation in about 10 seconds. It will be a bit slower when you have to dump the data into Excel (since it takes time to open the excel app, generate a new worksheet, etc), but I think you can probably work out how to deal with the collection based on the creation of the table.
So, in summary... this requires the following:
1. A class called myInfo, which contains the following
[vba]
Option Explicit
'for the actual found range
Public InfoRange As Range
'the heading string of the found range
Public sHeading As String
'this can be used to re-order in the logical order
Public lKeyWordStart As Long
Public sKeyWordFound As String
'return the text, but remove any empty paragraph marks (including multiple paragraph marks)
Public Property Get InfoText() As String
If Not InfoRange Is Nothing Then
InfoText = Replace(InfoRange.text, vbCr, "")
End If
End Property
[/vba] 2. A second class called myHeading, which contains the following
[vba]
Option Explicit
Public sHeadingText As String
Public rngHeadingPara As Range
Public lStart As Long
[/vba] 3. The following code in a regular module (I've included your original code, as well as a couple of functions I had strictly for testing from the immediate window, although I think you only need the HeaderOutlineTitle function.
[vba]
Option Explicit
Dim sKWArray() As String
Sub Demo()
Dim oInfo As myInfo
Dim rngSearch As Range
Dim sKeyWords As String
Dim sKeyWord As String
Dim i As Integer
Dim colFoundInfo As Collection
Dim colHeaders As Collection
'get our keywords
sKeyWords = "shall|will"
'set up our collection of info
Set colFoundInfo = New Collection
'search for each of our keywords
For i = 0 To UBound(Split(sKeyWords, "|"))
'reset the range for each search term
Set rngSearch = ActiveDocument.Content
sKeyWord = Split(sKeyWords, "|")(i)
'set up the find, not matching case
With rngSearch.Find
.MatchCase = False
'get the keyword to search
.text = sKeyWord
'while we find our search term
Do Until .Execute = False
'create a new object
Set oInfo = New myInfo
'put the relevant data into it
oInfo.lKeyWordStart = rngSearch.Start
Set oInfo.InfoRange = rngSearch.Sentences(1)
oInfo.sKeyWordFound = sKeyWord
'oInfo.sHeading = fReturnHeadingBefore(oInfo.InfoRange)
'here we would add in functionality to grab the header info for this object
'(or could put into the actual class-- depends on your preference)
'and then add the objectto the collection
colFoundInfo.Add oInfo
Application.StatusBar = "Added heading: " & oInfo.sHeading
Loop
End With
Next
'now, get the heading texts collection
Set colHeaders = fGetHeadingCollection
For Each oInfo In colFoundInfo
oInfo.sHeading = fGetClosestHeadingText(oInfo.lKeyWordStart, colHeaders)
Next
'output to a word table, just for the purposes of this demo
DemoOutput colFoundInfo
End Sub
Public Function fGetClosestHeadingText(lBefore As Long, colHeadings As Collection) As String
Dim oHeading As myHeading
'Dim colHeadings As Collection
Dim i As Integer
Dim sRet As String
'Set colHeadings = fGetHeadingCollection
'if we're before any headings, then return an empty string before we loop through
If lBefore < colHeadings(1).lStart Then
Exit Function
End If
'cycle through all the headings, finding the one with the closest item
'start at 2, since we always refer to the previous one
For i = 2 To colHeadings.Count
Set oHeading = colHeadings(i)
If lBefore < oHeading.rngHeadingPara.Start Then
sRet = colHeadings(i - 1).sHeadingText
Exit For
End If
Next
'it's the last heading
If sRet = "" Then
sRet = colHeadings(colHeadings.Count).sHeadingText
End If
fGetClosestHeadingText = sRet
End Function
'build the collection of headings
Public Function fGetHeadingCollection() As Collection
Dim oPara As Paragraph
Dim oHeading As myHeading
Dim sRet As String
Dim colRet As Collection
Set colRet = New Collection
For Each oPara In ActiveDocument.Paragraphs
'don't bother with blank paragraphs or heading rows
If Len(oPara.Range.text) > 1 Then
sRet = HeaderOutlineTitle(oPara.Range.text)
If Len(sRet) > 0 Then
Set oHeading = New myHeading
oHeading.sHeadingText = sRet
oHeading.lStart = oPara.Range.Start
Set oHeading.rngHeadingPara = oPara.Range.Duplicate
colRet.Add oHeading
End If
End If
Next
Set fGetHeadingCollection = colRet
End Function
'just dump it into a table at the moment
Sub DemoOutput(colInfo As Collection)
Dim oTable As Table
Dim oDoc As Document
Dim i As Integer
Dim oInfo As myInfo
Set oDoc = Documents.Add
Set oTable = oDoc.Tables.Add(oDoc.Content, colInfo.Count + 1, 3)
With oTable
.Cell(1, 1).Range.text = "Position"
.Cell(1, 2).Range.text = "Header"
.Cell(1, 3).Range.text = "Sentence"
For i = 1 To colInfo.Count
Set oInfo = colInfo(i)
.Cell(i + 1, 1).Range.text = oInfo.lKeyWordStart
.Cell(i + 1, 2).Range.text = oInfo.sHeading
.Cell(i + 1, 3).Range.text = oInfo.InfoText
Next
'quick and dirty sort, delete and autofit
.Sort excludeHeader:=True, FieldNumber:="Column 1", SortFieldType:=wdSortFieldNumeric, _
SortOrder:=wdSortOrderAscending
.Columns(1).Delete
.AutoFitBehavior (wdAutoFitContent)
End With
End Sub
Sub SelectTheNextShallSentence()
Dim rngSearch As Range
Dim oInfo As myInfo
Set rngSearch = ActiveDocument.Content
rngSearch.Start = Selection.Start
With rngSearch.Find
.MatchCase = False
.text = "shall"
If .Execute Then
Set oInfo = New myInfo
oInfo.lKeyWordStart = rngSearch.Start
Set oInfo.InfoRange = rngSearch.Sentences(1)
oInfo.sKeyWordFound = "shall"
oInfo.InfoRange.Select
End If
End With
End Sub
Sub DisplayTheHeadingOfThisSentence()
MsgBox fReturnHeadingBefore(Selection.Range)
End Sub
Function fReturnHeadingBefore(rngWhere As Range) As String
Dim oPara As Paragraph
Dim rngSearch As Range
Dim i As Integer
Dim sRet As String
Set rngSearch = rngWhere.Duplicate
rngSearch.Start = rngSearch.Parent.Content.Start
For i = rngSearch.Paragraphs.Count To 1 Step -1
Set oPara = rngSearch.Paragraphs(i)
sRet = HeaderOutlineTitle(oPara.Range.text)
If sRet <> "" Then
fReturnHeadingBefore = sRet
Exit For
End If
Next
End Function
'See if we have a heading-style number. Returns the heading, or blank if not a heading.
Public Function HeaderOutlineTitle(line As String) As String
'Set up variables
Dim CharCounter As Integer
Dim NumberCounter As Integer
Dim tCh As String
Dim y As Integer
CharCounter = 0
NumberCounter = 0
'For the 4 character headings, this skips all the unncessary logic
If Len(line) >= 4 And IsDigits(Left(line, 4)) Then
HeaderOutlineTitle = Left(line, 4)
GoTo SkipToEnd
End If
'Start looping through the line character by character
For y = 1 To Len(line)
tCh = Mid(line, y, 1)
'This allows the letter "H" to be the first character
If y = 1 And tCh = "H" Then
GoTo SkipNumberCheck
End If
'If it's blank or a number or a period, exit
If tCh = " " Or (Not IsDigit(tCh) And tCh <> ".") Then
Exit For
End If
'If this is a number, keep track, otherwise reset the counter
NumberCounter = IIf(IsNumeric(tCh), NumberCounter + 1, 0)
'If we've had more than 2 numbers in a row, we're probably not seeing a heading
If NumberCounter > 2 Then
CharCounter = 0
Exit For
End If
SkipNumberCheck: 'We can also skip all the other logic since it's the first character
CharCounter = CharCounter + 1
Next y
'If this was a outline heading, list that information in column 4
'(AND As long as the last character before the space was a number)
If CharCounter <> 0 Then
If IsNumeric(Mid(line, CharCounter, 1)) Then
HeaderOutlineTitle = Left(line, CharCounter)
Else
HeaderOutlineTitle = Left(line, CharCounter - 1)
End If
Else
HeaderOutlineTitle = ""
End If
SkipToEnd:
End Function
Function IsDigit(char As String) As Boolean
'Uses [Asc] function to get an ASCII code of a character.
Dim iAsciiCode As Integer
iAsciiCode = Asc(char)
'The ASCII codes for the digits range from 48 to 57. If an ASCII
'code of the character is within this range, function returns true
IsDigit = iAsciiCode >= 48 And iAsciiCode <= 57
End Function
Function IsDigits(text As String) As Boolean
'If the text is 0 length, it's not a digit-only string
If Len(text) = 0 Then
IsDigits = False
Else
Dim x As Integer
'the default is true. We'll set to false if...
IsDigits = True
For x = 1 To Len(text)
'...we find something that's not a digit
If Not IsDigit(Mid(text, x, 1)) Then
IsDigits = False
Exit For
End If
Next x
End If
End Function
Function FileThere(FileName As String) As Boolean
FileThere = (Dir(FileName) > "")
End Function
'Queries the user for a file name where we'll save our information
Function GetXLSFileName() As String
'open excel document
Dim fd As FileDialog
Set fd = Application.FileDialog(msoFileDialogFilePicker)
Dim vSelFile As Variant
With fd
.AllowMultiSelect = False
.Filters.Add "Excel 2003", "*.xls", 1
.Filters.Add "Excel 2007 & 2010", "*.xlsx", 1
.Title = "Choose your destination spreadsheet file."
'Show the dialog box
If .Show = -1 Then
'Set the file string to the selected file
GetXLSFileName = .SelectedItems(1)
'If the user presses Cancel...
Else
MsgBox "No file selected. Macro will use 'C:\ShallStatements.xls' as default."
GetXLSFileName = "C:\ShallStatements.xls"
End If
End With
End Function
'Interface function that gathers keywords to search for in the document.
Function SetUpKeywords() As Boolean
ReDim sKWArray(1 To 2) As String
sKWArray(1) = "shall"
sKWArray(2) = "must" '<< Section simplified in this example as it was not causing a performance issue
SetUpKeywords = True
End Function
'Finds the first searched keyword in a line and returns the position
Function HasKeywords(line As String) As Integer
HasKeywords = 0
Dim x As Integer
For x = 1 To UBound(sKWArray)
HasKeywords = IIf(Not IsNull(InStr(1, line, sKWArray(x), 1)), InStr(1, line, sKWArray(x), 1), 0)
'Exit the loop if we find something
If HasKeywords <> 0 Then Exit For
Next x
End Function
'Main function that extracts requirements from the document
Sub ExtractRequirements()
Dim x As Integer, y As Integer, tCh As Integer
Dim intRowCount As Integer 'The row excel will import data into. Used as counter & start value.
intRowCount = 10 'Data starts on row 10. The first 9 rows are reserved for performance metrics and the headers.
'Create arrays to hold data
Dim sHeaders() As String
Dim sStatements() As String
ReDim sHeaders(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
ReDim sStatements(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
'Record Start Time
sHeaders(1) = "Start Time"
sStatements(1) = DateTime.Now
'Figure out how many lines are in the document
Dim NumOfLines As Integer
NumOfLines = ActiveDocument.ComputeStatistics(wdStatisticLines, False)
'Record time for setup to be completed
sHeaders(2) = "Lines Computed"
sStatements(2) = DateTime.Now
'Set up requirement words (exit sub on failure)
If Not SetUpKeywords Then Exit Sub
'Turn off screen updating to improve performance
If MsgBox("Would you like to turn off screen updating to improve performance? " & _
"Please note that this may cause Word to appear unresponsive for quite " & _
"some time on large files.", vbYesNo, "Improve Performance") = vbYes Then
Application.ScreenUpdating = False
End If
'Record time for information to be collected
sHeaders(3) = "Params Collected"
sStatements(3) = DateTime.Now
'Loop through every line and look for headers
Dim tmpHeaderRslt As String, LineText As String, SentenceText As String, header As String, ShallPos As Integer
header = "None" 'This is the header value used before we've found a legitimate header
For x = 1 To NumOfLines
'Go to the correct line
Selection.GoTo wdGoToLine, wdGoToAbsolute, x
'Expand to select the entire line, but skip the logic if it's blank
Selection.Expand wdLine
LineText = Trim(Selection.text)
If LineText <> "" Then
'see if it's a header, and if so, set the header
tmpHeaderRslt = HeaderOutlineTitle(LineText)
If tmpHeaderRslt <> "" Then header = tmpHeaderRslt
'See if the line has the word "shall" or "must" in it
ShallPos = HasKeywords(LineText)
'If so, write it to the spreadsheet
If ShallPos > 0 Then
'Move to the correct character and expand to the full sentence
Selection.Collapse wdCollapseStart
Selection.Move wdCharacter, ShallPos
Selection.Expand wdSentence
SentenceText = Selection.text
'Write the header to the spreadsheet
sHeaders(intRowCount) = header
sStatements(intRowCount) = SentenceText
'Move the selection forward to the proper line (avoids replication)
Selection.Collapse wdCollapseEnd
If (x + 1) < Selection.Information(wdFirstCharacterLineNumber) Then x = Selection.Information(wdFirstCharacterLineNumber) - 1
'Incremement the row counter
intRowCount = intRowCount + 1
End If
End If
Next x
'Record time for statements to be collected
sHeaders(4) = "Statements Collected"
sStatements(4) = DateTime.Now
'Open excel file and set up for receiving data
Dim xlsFileString As String
xlsFileString = GetXLSFileName
Dim appExcel As Object
Dim objSheet As Object
Set appExcel = CreateObject("Excel.Application")
Set objSheet = appExcel.workbooks.Open(xlsFileString).Sheets("Sheet1")
'Write data to the spreadsheet
sHeaders(5) = "Excel write begin"
sStatements(5) = DateTime.Now
For x = 1 To intRowCount
objSheet.Cells(x, 1) = sHeaders(x)
objSheet.Cells(x, 2) = sStatements(x)
Next x
'Write end time
objSheet.Cells(6, 1) = "Excel write ends"
objSheet.Cells(6, 2) = DateTime.Now
'Clean up and open for viewing.
appExcel.workbooks(1).Close True
appExcel.Quit
Set objSheet = Nothing
Set appExcel = Nothing
'Open the new spreadsheet for viewing
If FileThere("C:\Program Files\Microsoft Office\Office12\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office12\excel.exe """ & xlsFileString & """")
ElseIf FileThere("C:\Program Files\Microsoft Office\Office14\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office14\excel.exe """ & xlsFileString & """")
End If
End Sub
[/vba] As a note: Your use of the IsDigit or IsDigits functions could be substituted by simply using the built in IsNumeric function (which accomplishes the same thing those two functions do) within the HeaderOutlineTitle function. But I left that alone, as that is your primary heading pattern analysis which you are the most comfortable with. And it also doesn't appear to adversely impact the speed that much.
Since this works in about 10 seconds on your 30 page document, I would imagine you shouldn't have this run much longer than a minute on any longer document.
As a general note-- *always* use Option Explicit at the top of your modules. It saves you from having VBA try to determine the type of a variable on the fly. You didn't have a Dim y as Integer in your HeaderOutlineTitle function, which means it will be a Variant type-- that can be costly to the memory process.
And lastly-- please note that since I went away from your "line by line" approach, and instead use the Sentences collection, there will be two major problems that you will need to handle.
1) Sentences collection is really tricky. It truncates at times you wouldn't expect, and doesn't truncate at other times. Dr. Rev. Martin Luther King Jr. died in the 20th century. That "sentence" looks to Word like it is 4 sentences.
2) The sentences collection also continues on using any white space. So in some cases, the text returned will include a lot of empty paragraphs. So I've simply placed in the class a quick and dirty method of removing any paragraph marks from the myInfo class. There are numerous ways to approach this, but this is one way.
Hope this helps. Please post the code you got to, if you're interested in any suggestions for your approach. The main takeaway from this approach is that the use of collections is far superior to your inclination to use arrays. Not only can you put custom objects (the classes) into collections, being able to use the .Range.Start property as your main algorithm for document location speeds things up considerably, because you don't have to a) go through the document line by line and b) Longs are much less memory intensive than a string. When you magnify that by string arrays with 1000 elements, you can definitely slow yourself down.
All right, thanks for your patience. I took off Friday and the weekend but I'm back and ready to roll.
I've uploaded my new sample (creatively: "Newsample.docm"). I've incorporated your ideas except for the wildcard matching, which I'll give a try after this. It's running pretty quickly right now in the sample document, and in longer documents I'm not seeing the same slowdown at the end. Still, it takes a couple hours to process really long documents (100+) page, which is better than before but still could use some work.
I'll take a look at the new suggestions and see what I can do. Thanks again!
Looked in more detail at your new code. I'm sorry I didn't get back to you faster! I think my new code implemented the main performance gains, and it runs in 9 seconds (I implemented some performance testing functionality as well) so I think we're about on par where performance is concerned. Let me know if you think creating the two collections first and then running the comparisons offers a substantial benefit to what I've implemented in my most recent document.
I've generally simplified the code pretty significantly; the main complexity at this point is in the header location routine, so I'll look into wildcards and see what I come up with. Hopefully it's something like regular expressions.
Haven't downloaded your sample yet, but how does code which runs in 10 seconds on 30page sample doc turn into a couple of hours on a 100+ page doc? Are you sure this is a representative sample?
Just go to the link and go to file->download. It doesn't seem qualitatively different than the sample document, but maybe I'm missing something.
In other news: I'm working on an alternate means of identifying headings right now... I just finished with the first implementation, but it's actually slower than the original, believe it or not. I'll keep on workin'.
Did you try my code on a couple of different chunks? Did the two collection approach seem to operate the same as yours not just on the sample, but on a real doc? I won't be able to look at this for a few hours.
For the purposes of testing, putting the info into excel is meaningless, so you should be able to run the code I had on your stuff to test as well.
Bottlenecks can show up in the strangest places. One of the big problems I had with your original code wasn't what I thought it would be-- but rather you using the move method on the selection. When it encountered section breaks, it would hang. I think you may be dealing with document corruption to some degree (not all that strange when the source is scanned PDFs)
Will look over your stuff in a couple of hours. In the meantime, see if my code performs any differently for you on a larger sample size
Well, I tried your code and dear God is it fast. The biggest problem is that it's not matching the capital-letter-style headings... I'll see if I can patch that up.
I'll start working on adapting my code to use this method. More soon.
(The capital-letter-style headings issue wasn't your fault, you're just using the old version of my header-finding code... I think once I drop in the updated code it will solve the issue.)
That's why I left the header identification algorithm alone, since I figured that was the thing most likely for you to change. The only thing I can add is that you really should get rid of your IsDigit/IsDigits functionality. It's redudant to IsNumeric and may very well be a bottleneck. If nothing else, it's code to wade through which you don't need to wade through.
At this point, the biggest approach difference I see between your NewSample code is that you were using the paradigm of finding the heading before, rather than the additional collection of headings and then comparing the nearest.
I think the 2nd collection of headings is going to be better, since it means you won't be back-tracking so much. Why go through the same paragraphs multiple times, only to find out that they aren't useful? Especially on a longer document. If you're going to need to loop through each paragraph in order to analyze different heading patterns, you might as well only do it once. Also, give yourself some short-circuits (i.e., if the Len of the paragraph text is <=2 or something, don't do anything). The most likely spot for corruption in a document is section breaks... might as well skip thos.
I think building the collection of headings will still be the bottleneck (since the primary collection of sentences containing search terms uses the Find object) -- but you can further refine if it seems slow. But at the very least you won't have to change any of the methodology of your main routine-- you simply have to try to build the headings collection faster.
Apart from that, you can look at various techniques to optimizing string functions (the statement If Len(myString) = 0 is technically faster than If myString = "" -- although I prefer the later statement for readability).
VBA contains a LIKE operator, although it is limited... and the wild card searching can be very very powerful. But this structure (two collections, one with the sentences, one with the headings) is very scalable. If you then later need to add to it, it's not that difficult (search terms, criteria for the headers).
So now I'm trying to examine your HeaderOutlineTitle function... and there are a few bits I don't get, and a few bits I think you can improve the logic of.
So a couple of comments:
1. Don't use single line If...Then statements. There is zero optimization to this, and it essentially means you can't effectively break point it.
2. IIF constructs are not appreciably faster (to my knowledge), and they are much more confusing to read and troubleshoot.
3. Complex And/Or operations are likewise difficult to troubleshoot... for example.
[vba]
'If it's blank or a number or a period, exit
If tCH= " " Or (Not IsNumeric(tCh) And tCH <> ".") Then
Exit For
End If
[/vba] That logic test doesn't make a lick of sense to me. Especially with the inaccurate comment. What you are literally testing there is
A. If it is a blank space OR
B. It is NOT a number AND it is NOT a period.
So, basically... you have some valid characters to allow you to keep processing, right? This is an excellent time to make use of Select Case, for multiple reasons. It's easier to modify, and it's easier to understand. For example...
[vba]
Select Case tCH
Case 0 to 9, "."
'acceptable, keep processing
Case Else
Exit For
End Select
[/vba] You don't have to use the 0 to 9 construct, you could do "0", "1", "2", etc... but Select Case is often a good way to do an acceptable list. It's certainly easier to read and modify than a complicated If statement utilizing AND/OR operators.
4. IIF... blech. Easier to read this way...
[VBA]
'If this is a number, keep track, otherwise reset the counter
If IsNumeric(tCH) Then
NumberCounter = NumberCounter + 1
Else
NumberCounter = 0
End If
[/VBA]
I would be surprised if the above simplifications changed the over all processing time.
As a benchmark, running my code on your 4 meg document took around 2 minutes, with the majority of that time being putting stuff into the word table (which may very well be faster for you, since excel doesn't have to re-paginate periodically when large amounts of data are put into a spreadsheet.
I think there are ways to simplify what you want out of the HeaderOutlineTitle functionality even without going into Wildcard searching.
But if you want to investigate that, it's probably worth of its own post -- something along the lines of "How can I use wildcard searching to match..."
And then list out all the patterns that you want to find, coupled with the kinds of patterns you don't want to match, and perhaps one of our resident expert wildcard searchers will weigh in.