PDA

View Full Version : Complex macro has very poor performance



dipique
06-06-2012, 11:48 AM
What I have:
I have a macro in a word document that looks for statements that say "shall" or "will". When it finds it, it writes the statement and the section header to an excel file.

The problem:
These are crappy word files exported from PDF files, so they don't have "real" section headers. As a result, I'm forced to identify headers using fairly complex logic. The end result is that my macro runs slowly--far too slowly. On long proposals (100+ pages) it can take over an hour.

I need help with:
I could use some help identifying bottlenecks in my code. The attached file includes part of a proposal I'm using to gauge performance. I'm getting about 2 minutes right now, but ideally this file should process in about 30 seconds.

The Code:
Here's the code in all its twisted glory.

Dim sKWArray() As String

Function FileThere(FileName As String) As Boolean
FileThere = (Dir(FileName) > "")
End Function

'Queries the user for a file name where we'll save our information
Function GetXLSFileName() As String
'open excel document
Dim fd As FileDialog
Set fd = Application.FileDialog(msoFileDialogFilePicker)
Dim vSelFile As Variant
With fd
.AllowMultiSelect = False
.Filters.Add "Excel 2003", "*.xls", 1
.Filters.Add "Excel 2007 & 2010", "*.xlsx", 1
.Title = "Choose your destination spreadsheet file."

'Show the dialog box
If .Show = -1 Then
'Set the file string to the selected file
GetXLSFileName = .SelectedItems(1)
'If the user presses Cancel...
Else
MsgBox "No file selected. Macro will use 'C:\ShallStatements.xls' as default."
GetXLSFileName = "C:\ShallStatements.xls"
End If
End With
End Function

'Interface function that gathers keywords to search for in the document.
Function SetUpKeywords() As Boolean
ReDim sKWArray(1 To 2) As String
sKWArray(1) = "shall"
sKWArray(2) = "must" '<< Section simplified in this example as it was not causing a performance issue
SetUpKeywords = True
End Function

'Finds the first searched keyword in a line and returns the position
Function HasKeywords(line As String) As Integer
HasKeywords = 0
Dim x As Integer
For x = 1 To UBound(sKWArray)
HasKeywords = IIf(Not IsNull(InStr(1, line, sKWArray(x), 1)), InStr(1, line, sKWArray(x), 1), 0)

'Exit the loop if we find something
If HasKeywords <> 0 Then Exit For
Next x
End Function

'See if we have a heading-style number. Returns the heading, or blank if not a heading.
Private Function HeaderOutlineTitle(line As String) As String
'Set up variables
Dim CharCounter As Integer
Dim NumberCounter As Integer
Dim tCh As String

CharCounter = 0
NumberCounter = 0

'For the 4 character headings, this skips all the unncessary logic
If Len(line) >= 4 And IsDigits(Left(line, 4)) Then
HeaderOutlineTitle = Left(line, 4)
GoTo SkipToEnd
End If

'Start looping through the line character by character
For y = 1 To Len(line)
tCh = Mid(line, y, 1)

'This allows the letter "H" to be the first character
If y = 1 And tCh = "H" Then GoTo SkipNumberCheck

'If it's blank or a number or a period, exit
If tCh = " " Or (Not IsDigit(tCh) And tCh <> ".") Then Exit For

'If this is a number, keep track, otherwise reset the counter
NumberCounter = IIf(IsNumeric(tCh), NumberCounter + 1, 0)

'If we've had more than 2 numbers in a row, we're probably not seeing a heading
If NumberCounter > 2 Then
CharCounter = 0
Exit For
End If

SkipNumberCheck: 'We can also skip all the other logic since it's the first character
CharCounter = CharCounter + 1
Next y

'If this was a outline heading, list that information in column 4
'(AND As long as the last character before the space was a number)
If CharCounter <> 0 Then
If IsNumeric(Mid(line, CharCounter, 1)) Then
HeaderOutlineTitle = Left(line, CharCounter)
Else
HeaderOutlineTitle = Left(line, CharCounter - 1)
End If
Else
HeaderOutlineTitle = ""
End If

SkipToEnd:
End Function

Function IsDigit(char As String) As Boolean
'Uses [Asc] function to get an ASCII code of a character.
Dim iAsciiCode As Integer
iAsciiCode = Asc(char)

'The ASCII codes for the digits range from 48 to 57. If an ASCII
'code of the character is within this range, function returns true
IsDigit = iAsciiCode >= 48 And iAsciiCode <= 57
End Function

Function IsDigits(text As String) As Boolean
'If the text is 0 length, it's not a digit-only string
If Len(text) = 0 Then
IsDigits = False
Else
Dim x As Integer
'the default is true. We'll set to false if...
IsDigits = True
For x = 1 To Len(text)
'...we find something that's not a digit
If Not IsDigit(Mid(text, x, 1)) Then
IsDigits = False
Exit For
End If
Next x
End If
End Function

'Main function that extracts requirements from the document
Sub ExtractRequirements()
Dim x As Integer, y As Integer, tCh As Integer
Dim intRowCount As Integer 'The row excel will import data into. Used as counter & start value.
intRowCount = 10 'Data starts on row 10. The first 9 rows are reserved for performance metrics and the headers.

'Create arrays to hold data
Dim sHeaders() As String
Dim sStatements() As String
ReDim sHeaders(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
ReDim sStatements(1 To 1000) As String 'Supports maximum of 1000 requirements per document.

'Record Start Time
sHeaders(1) = "Start Time"
sStatements(1) = DateTime.Now

'Write headers
sHeaders(intRowCount - 1) = "Header"
sStatements(intRowCount - 1) = "Requirement"

'Figure out how many lines are in the document
Dim NumOfLines As Integer
NumOfLines = ActiveDocument.ComputeStatistics(wdStatisticLines, False)

'Record time for setup to be completed
sHeaders(2) = "Lines Computed"
sStatements(2) = DateTime.Now

'Set up requirement words (exit sub on failure)
If Not SetUpKeywords Then Exit Sub

'Turn off screen updating to improve performance
If MsgBox("Would you like to turn off screen updating to improve performance? " & _
"Please note that this may cause Word to appear unresponsive for quite " & _
"some time on large files.", vbYesNo, "Improve Performance") = vbYes Then
Application.ScreenUpdating = False
End If

'Record time for information to be collected
sHeaders(3) = "Params Collected"
sStatements(3) = DateTime.Now

'Loop through every line and look for headers
Dim tmpHeaderRslt As String, LineText As String, SentenceText As String, header As String, ShallPos As Integer
header = "None" 'This is the header value used before we've found a legitimate header
For x = 1 To NumOfLines
'Go to the correct line
Selection.GoTo wdGoToLine, wdGoToAbsolute, x

'Expand to select the entire line, but skip the logic if it's blank
Selection.Expand wdLine
LineText = Trim(Selection.text)
If LineText <> "" Then
'see if it's a header, and if so, set the header
tmpHeaderRslt = HeaderOutlineTitle(LineText)
If tmpHeaderRslt <> "" Then header = tmpHeaderRslt

'See if the line has the word "shall" or "must" in it
ShallPos = HasKeywords(LineText)

'If so, write it to the spreadsheet
If ShallPos > 0 Then
'Move to the correct character and expand to the full sentence
Selection.Collapse wdCollapseStart
Selection.Move wdCharacter, ShallPos
Selection.Expand wdSentence
SentenceText = Selection.text

'Write the header to the spreadsheet
sHeaders(intRowCount) = header
sStatements(intRowCount) = SentenceText

'Move the selection forward to the proper line (avoids replication)
Selection.Collapse wdCollapseEnd
If (x + 1) < Selection.Information(wdFirstCharacterLineNumber) Then x = Selection.Information(wdFirstCharacterLineNumber) - 1

'Incremement the row counter
intRowCount = intRowCount + 1
End If
End If
Next x

'Record time for statements to be collected
sHeaders(4) = "Statements Collected"
sStatements(4) = DateTime.Now

'Open excel file and set up for receiving data
Dim xlsFileString As String
xlsFileString = GetXLSFileName
Dim appExcel As Object
Dim objSheet As Object
Set appExcel = CreateObject("Excel.Application")
Set objSheet = appExcel.workbooks.Open(xlsFileString).Sheets("Sheet1")

'Write data to the spreadsheet
sHeaders(5) = "Excel write begin"
sStatements(5) = DateTime.Now
For x = 1 To intRowCount
objSheet.Cells(x, 1) = sHeaders(x)
objSheet.Cells(x, 2) = sStatements(x)
Next x

'Write end time
objSheet.Cells(6, 1) = "Excel write ends"
objSheet.Cells(6, 2) = DateTime.Now

'Clean up and open for viewing.
appExcel.workbooks(1).Close True
appExcel.Quit
Set objSheet = Nothing
Set appExcel = Nothing

'Open the new spreadsheet for viewing
If FileThere("C:\Program Files\Microsoft Office\Office12\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office12\excel.exe """ & xlsFileString & """")
ElseIf FileThere("C:\Program Files\Microsoft Office\Office14\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office14\excel.exe """ & xlsFileString & """")
End If
End Sub


I appreciate any help you can give. Thank you!

Dan

Frosty
06-06-2012, 01:28 PM
Yep, you've got a bad set up. You need to stop using string functions (like InStr) entirely, and substitute by using the Find object.

You could probably get some increase in performance by compiling this code rather than leaving it in an unmanaged addin (i.e., make it into a COM addin, a DLL, etc vs. leaving the code in a .dotm)... but the main problem is use of string functions the way you're doing it.

My cursory glance at the structure of code isn't that bad (except for your use of the selection object-- you could probably get some value out of using a range object, although ranges and lines become tough).

But the main problem is that you haven't attempted to use the Find object. If I can sum up with two more concise bits of code...

Your strategy

'This routine displays how many instances of a string exist in the document
'from the beginning of the selection to the end of the document
Sub DemoStringAnalysisBad()
Dim rngWhere As Range
Dim rngSearch As Range
Dim oPara As Paragraph
Dim sStringToFind As String
Dim iHowManyFound As Integer
Dim iInStrResult As Integer

sStringToFind = "Some Text"
Set rngWhere = Selection.Range
rngWhere.End = ActiveDocument.Content.End
For Each oPara In rngWhere.Paragraphs
Set rngSearch = oPara.Range
iInStrResult = InStr(rngSearch.Text, sStringToFind)
Do Until iInStrResult = 0
iHowManyFound = iHowManyFound + 1
rngSearch.Start = rngSearch.Start + iInStrResult
iInStrResult = InStr(rngSearch.Text, sStringToFind)
Loop
Next
MsgBox "Found " & iHowManyFound & " results"
End Sub

A better strategy

'This routine displays how many instances of a string exist in the document
'from the beginning of the selection to the end of the document
Sub DemoStringAnalysisGood()
Dim rngWhere As Range
Dim rngSearch As Range
Dim sStringToFind As String
Dim iHowManyFound As Integer

sStringToFind = "Some Text"
Set rngWhere = Selection.Range
Set rngSearch = rngWhere.Duplicate
'set up the find object
With rngSearch.Find
.Text = sStringToFind
Do Until .Execute = False
iHowManyFound = iHowManyFound + 1
Loop
End With

MsgBox "Found " & iHowManyFound & " results"
End Sub

Now, if you simply create a 100 page document with just paragraphs, just text... and the keywords sprinkled judiciously, the string function will seem to work faster.
But if you create a 100 page document with paragraphs, tables, etc... the find function will work faster.
Of course, my for each loop is a heck of a lot faster than your process, which is to move the selection line by line, select it, grab the text of that line, and then loop through character by character of the text property.

This is proof of concept... does this give you a direction to go? Yes, it requires that you re-write everything. But unfortunately, you have painted yourself into a very bad corner. There could be marginal improvements if you simply did a For Each oPara in ActiveDocument.Paragraphs ... and then grabbed the .Text property of the oPara.Range, but you may very well end up in the same corner at some point.

The Find object has been optimized to search for stuff. You've duplicated that functionality, but in a much slower way.

Frosty
06-06-2012, 01:46 PM
Wow... I just actually checked out your macro and sample document.

Although I didn't turn screenupdating off, because I wanted to see what it did... it took me 10 minutes (and at times, word hung, so I needed to break the action, then hit continue to get the macro to keep going) to get through the whole thing.

And the final output is an excel spreadsheet with only 146 rows of data?

You're off in saying it should take 30 seconds. It should take less than that. A couple additional points:

1. Your use of the Sentences collection will fail you at some times. Try having a sentence which includes a period in it (the contractor shall furnish subject matter expertise for the X.Y.Z. program yada yada).
2. As will your use of lines. The first two sample outputs proves this-- one is a sentence fragment because a paragraph mark is there, and the other is an entire sentences.
3. You're clearly running into memory buffer issues, as the macro demonstrably slows down towards the end.

At this point, I can help you... but as you have a single post, you need to reply here and include any links (even if they are only text) to any cross-posts you've made.

It's a little bit of an intriguing problem, but I'm not wasting time on it if you are getting help (or have already had it answered) elsewhere (we seem to have a problem with a lot of 1 post count cross posters here).

Frosty
06-06-2012, 03:22 PM
As a quick proof of concept... depending on your familiarity with collections and custom classes, this would be one approach.

Create a class module, call it "myInfo"
Inside it you could have something like this (unfortunately, the vba parser doesn't make this very pretty):

Option Explicit
'A private variable for the main info
Private m_rngInfo As Range
'the heading string of the found range
Public sHeading As String
'this can be used to re-order in the logical order
Public lKeyWordStart As Long
Public sKeyWordFound As String

Public Property Get InfoRange() As Range
Set InfoRange = m_rngInfo
End Property

Public Property Set InfoRange(rngWhere As Range)
Set m_rngInfo = rngWhere
End Property

Public Property Get InfoText() As String
InfoText = m_rngInfo.text
End Property

And then you could use the following routine in a regular module, to create a collection of that custom class, using the search terms and the find object.

Sub Demo()
Dim oInfo As myInfo
Dim rngSearch As Range
Dim sKeyWords As String
Dim sKeyWord As String
Dim i As Integer
Dim colFoundInfo As Collection

'get our keywords
sKeyWords = "shall|will"
'set up our collection of info
Set colFoundInfo = New Collection

'search for each of our keywords
For i = 0 To UBound(Split(sKeyWords, "|"))
'reset the range for each search term
Set rngSearch = ActiveDocument.Content
sKeyWord = Split(sKeyWords, "|")(i)
'set up the find, not matching case
With rngSearch.Find
.MatchCase = False
'get the keyword to search
.text = sKeyWord
'while we find our search term
Do Until .Execute = False
'create a new object
Set oInfo = New myInfo
'put the relevant data into it
oInfo.lKeyWordStart = rngSearch.Start
Set oInfo.InfoRange = rngSearch.Sentences(1)
oInfo.sKeyWordFound = sKeyWord

'here we would add in functionality to grab the header info for this object
'(or could put into the actual class-- depends on your preference)
'and then add the objectto the collection
colFoundInfo.Add oInfo
Loop
End With
Next

Dim sDemoMessage As String
For i = 1 To 5
On Error Resume Next
sDemoMessage = sDemoMessage & colFoundInfo(i).InfoText
Next
MsgBox "The first 5 found terms of " & colFoundInfo.Count & " terms are these sentences:" & vbCr & vbCr & _
sDemoMessage

'check out the locals window in View > Locals Window, and expand the colFoundInfo object
Stop
End Sub

I left the stop code in there on purpose, so you can stop the code and see what's going on in the Locals Window.
What this does not do is identify the header for whatever sentence you're dealing with. It also doesn't solve your problem of sentences broken up by paragraph marks (which is rampant in the sample document).

However, the next steps would these:
1. Identify the header info for the found ranges, using the .InfoRange property of the class (you could do this in the class or outside and then set it
2. Properly sort the information (there are a lot of ways to sort information, once you have it) based on the .lKeywordStart property of the class. You dump all the info from the collection into an array and sort, or you could put all the info into excel, and then sort on the .lKeyWordStart value, and then delete that column.

The collection of found ranges will be naturally sorted by everything which contained "shall" in the order it appeared in the document, and then "will" in the order that term appeared in the document.

But the main purpose of this demo is to show you how fast you can get the main info from your document using this methodology, rather than your existing methodology.

I was curious, so I did a little more investigation. But now I'll wait to see if you respond.

As an fyi... you should really try to remove any sensitive data from documents you post on the internet.

dipique
06-07-2012, 04:41 AM
Thanks for the responses, Frosty. First of all, this is the only posting of this issue; I understand how frustrating it can be to put effort into a problem that turns out to already be solved.

I actually started this project with a macro someone else had developed. It used the Find function and was very fast; the problem was that it didn't pull the headers.

The headers can be in various formats, can be located in the same paragraph or a different paragraph, and generally are a pain in the neck to find. All the performance issues spring from the concessions I've made in order to (somewhat) reliably find headers.

At this point, I'm using this paradigm:
(1)Go through each line, determining if its a header. If it is, make it the "current" header.
(2)As I go through each line, check for keywords. If it has a keyword, record it with the current header.

I can think of at least two other paradigms. One would be:
(1)Find each occurance of a keyword in a document.
(2)Search backward for the last header before the keyword and record the two.
(3)Repeat for each keyword

And another would be:
(1)Check every line for a header and store it with its line number in a multi-dimensional array.
(2)Use the "find" function to look for keywords and record each one with the smallest header line number that isn't greater than the line number of the keyword.

Either new paradigm would require a complete rewrite and I'm reluctant to do so unless I know there will be a performance gain. That said, if you have some insight into the header problem I'm certainly happy to put the work in.

No sensitive information here; it's all publicly accessible except for the instructions at the top and the macro itself. Thanks for looking out though.

And lastly--thanks again; its clear you spent a lot of time on this and I appreciate that a great deal.

Dan

Frosty
06-07-2012, 05:51 AM
There will be a significant performance increase if you go with the find object. Orders of magnitude. Finding the headers can be a little tricky, but mostly because it needs to be done in a way you will feel comfortable modifying. This will involve a little learning on your part on how to work with ranges.

If you're interested in that, then we can proceed. Otherwise, I would recommend getting rid of the line by line, and readjusting your function to use two for each loops... The paragraphs collection and the sentences collection.

Either way, you're looking at a re-write. In both re-writes you will get a performance increase, but without the find object, I think it may be marginal. And the sentences collection has issues, so a rewrite based on the sentences collection paints you into a different corner.

Have you tried my code and understood it? Forget multi-dimensional arrays and going through the document in order. If you can work to understand what's going on in the code with the collection, finding the headers will be relatively easy. Promise.

Frosty
06-07-2012, 10:03 AM
This doesn't totally solve the problem, as we need to figure out a better way of getting the header info than your current process, but adding the following line of code
oInfo.sHeading = fReturnHeadingBefore(oInfo.InfoRange)
just before you add the oInfo item to the collection, combined with the following function will help you on the way... this at least gets all the info in a faster way, but is still far from optimized (since it still uses your HeadingOutlineTitle function)

Function fReturnHeadingBefore(rngWhere As Range) As String
Dim oPara As Paragraph
Dim rngSearch As Range
Dim i As Integer
Dim sRet As String

Set rngSearch = rngWhere.Duplicate
rngSearch.Start = rngSearch.Parent.Content.Start
For i = rngSearch.Paragraphs.Count To 1 Step -1
Set oPara = rngSearch.Paragraphs(i)
sRet = HeaderOutlineTitle(oPara.Range.text)
If sRet <> "" Then
fReturnHeadingBefore = sRet
Exit For
End If
Next
End Function

Is this making sense, thus far?

dipique
06-07-2012, 11:59 AM
Frosty,

Thanks again for all your help. I'm in the final stages of rewriting the macro based on your suggestions. Once I'm done I'll test it and post the results as well as the new code.

Dan

Frosty
06-07-2012, 01:57 PM
Sure thing... one last thing to consider, although I'm curious to see your development process.

You are essentially trying to do a pattern matching function, and limit yourself to acceptable patterns (in terms of finding the headings previous to a found range containing "shall" and "will").

You should investigate using a wildcard search, rather than string pattern analysis. In the same way that you can build a collection of "found" sentences, you can also build a collection of successful wildcard searches, and then compare the one that is closest.

In fact, it may be better to simply build that collection as well once for the document, and then compare the start values of that with the start values of your sentences. I don't have any demo code for you at the moment, but wildcard searches can be *very* powerful, and remove the need for any actual string comparisons (which can be memory intensive, especially when you get to a large document and are going through it in textual chunks-- whether line by line, or sentence by sentence).

This link may help.... I'm toying around with a little demo code at the moment...
http://word.mvps.org/faqs/General/UsingWildcards.htm

Frosty
06-07-2012, 02:56 PM
Well, I couldn't resist. This seems to work much better on your demo document. The general concept is...
1. Build a collection of found sentences, based on your search terms (currently "shall" and "will")
2. Build a collection of found Headings, based on your existing string functionality analyzing the text of each paragraph in the document which has a text length of longer than 1 character (i.e., don't bother with empty paragraph marks, section breaks, etc)
3. cycle through the collection of found sentences, adding the appropriate heading text, by comparing the start ranges
4. Spit the results into a Word table (I'll leave getting it into Excel to you).

I left out the idea of wildcard searching, because a) I always have to wrap my mind around the wildcard searching process-- there are others here who are much faster and familiar with that and b) it seemed pretty fast already.

Running this on your demo document completes the operation in about 10 seconds. It will be a bit slower when you have to dump the data into Excel (since it takes time to open the excel app, generate a new worksheet, etc), but I think you can probably work out how to deal with the collection based on the creation of the table.

So, in summary... this requires the following:
1. A class called myInfo, which contains the following

Option Explicit
'for the actual found range
Public InfoRange As Range
'the heading string of the found range
Public sHeading As String
'this can be used to re-order in the logical order
Public lKeyWordStart As Long
Public sKeyWordFound As String

'return the text, but remove any empty paragraph marks (including multiple paragraph marks)
Public Property Get InfoText() As String
If Not InfoRange Is Nothing Then
InfoText = Replace(InfoRange.text, vbCr, "")
End If
End Property
2. A second class called myHeading, which contains the following

Option Explicit

Public sHeadingText As String
Public rngHeadingPara As Range
Public lStart As Long
3. The following code in a regular module (I've included your original code, as well as a couple of functions I had strictly for testing from the immediate window, although I think you only need the HeaderOutlineTitle function.

Option Explicit
Dim sKWArray() As String
Sub Demo()
Dim oInfo As myInfo
Dim rngSearch As Range
Dim sKeyWords As String
Dim sKeyWord As String
Dim i As Integer
Dim colFoundInfo As Collection
Dim colHeaders As Collection

'get our keywords
sKeyWords = "shall|will"
'set up our collection of info
Set colFoundInfo = New Collection

'search for each of our keywords
For i = 0 To UBound(Split(sKeyWords, "|"))
'reset the range for each search term
Set rngSearch = ActiveDocument.Content
sKeyWord = Split(sKeyWords, "|")(i)
'set up the find, not matching case
With rngSearch.Find
.MatchCase = False
'get the keyword to search
.text = sKeyWord
'while we find our search term
Do Until .Execute = False
'create a new object
Set oInfo = New myInfo
'put the relevant data into it
oInfo.lKeyWordStart = rngSearch.Start
Set oInfo.InfoRange = rngSearch.Sentences(1)
oInfo.sKeyWordFound = sKeyWord

'oInfo.sHeading = fReturnHeadingBefore(oInfo.InfoRange)
'here we would add in functionality to grab the header info for this object
'(or could put into the actual class-- depends on your preference)
'and then add the objectto the collection
colFoundInfo.Add oInfo
Application.StatusBar = "Added heading: " & oInfo.sHeading
Loop
End With
Next
'now, get the heading texts collection
Set colHeaders = fGetHeadingCollection
For Each oInfo In colFoundInfo
oInfo.sHeading = fGetClosestHeadingText(oInfo.lKeyWordStart, colHeaders)
Next

'output to a word table, just for the purposes of this demo
DemoOutput colFoundInfo
End Sub
Public Function fGetClosestHeadingText(lBefore As Long, colHeadings As Collection) As String
Dim oHeading As myHeading
'Dim colHeadings As Collection
Dim i As Integer
Dim sRet As String

'Set colHeadings = fGetHeadingCollection
'if we're before any headings, then return an empty string before we loop through
If lBefore < colHeadings(1).lStart Then
Exit Function
End If
'cycle through all the headings, finding the one with the closest item
'start at 2, since we always refer to the previous one
For i = 2 To colHeadings.Count
Set oHeading = colHeadings(i)
If lBefore < oHeading.rngHeadingPara.Start Then
sRet = colHeadings(i - 1).sHeadingText
Exit For
End If
Next
'it's the last heading
If sRet = "" Then
sRet = colHeadings(colHeadings.Count).sHeadingText
End If
fGetClosestHeadingText = sRet
End Function
'build the collection of headings
Public Function fGetHeadingCollection() As Collection
Dim oPara As Paragraph
Dim oHeading As myHeading
Dim sRet As String
Dim colRet As Collection

Set colRet = New Collection
For Each oPara In ActiveDocument.Paragraphs
'don't bother with blank paragraphs or heading rows
If Len(oPara.Range.text) > 1 Then
sRet = HeaderOutlineTitle(oPara.Range.text)
If Len(sRet) > 0 Then
Set oHeading = New myHeading
oHeading.sHeadingText = sRet
oHeading.lStart = oPara.Range.Start
Set oHeading.rngHeadingPara = oPara.Range.Duplicate
colRet.Add oHeading
End If
End If
Next
Set fGetHeadingCollection = colRet
End Function
'just dump it into a table at the moment
Sub DemoOutput(colInfo As Collection)
Dim oTable As Table
Dim oDoc As Document
Dim i As Integer
Dim oInfo As myInfo

Set oDoc = Documents.Add
Set oTable = oDoc.Tables.Add(oDoc.Content, colInfo.Count + 1, 3)
With oTable
.Cell(1, 1).Range.text = "Position"
.Cell(1, 2).Range.text = "Header"
.Cell(1, 3).Range.text = "Sentence"
For i = 1 To colInfo.Count
Set oInfo = colInfo(i)
.Cell(i + 1, 1).Range.text = oInfo.lKeyWordStart
.Cell(i + 1, 2).Range.text = oInfo.sHeading
.Cell(i + 1, 3).Range.text = oInfo.InfoText
Next

'quick and dirty sort, delete and autofit
.Sort excludeHeader:=True, FieldNumber:="Column 1", SortFieldType:=wdSortFieldNumeric, _
SortOrder:=wdSortOrderAscending
.Columns(1).Delete
.AutoFitBehavior (wdAutoFitContent)
End With

End Sub
Sub SelectTheNextShallSentence()
Dim rngSearch As Range
Dim oInfo As myInfo

Set rngSearch = ActiveDocument.Content
rngSearch.Start = Selection.Start
With rngSearch.Find
.MatchCase = False
.text = "shall"
If .Execute Then
Set oInfo = New myInfo
oInfo.lKeyWordStart = rngSearch.Start
Set oInfo.InfoRange = rngSearch.Sentences(1)
oInfo.sKeyWordFound = "shall"
oInfo.InfoRange.Select
End If
End With
End Sub
Sub DisplayTheHeadingOfThisSentence()
MsgBox fReturnHeadingBefore(Selection.Range)
End Sub
Function fReturnHeadingBefore(rngWhere As Range) As String
Dim oPara As Paragraph
Dim rngSearch As Range
Dim i As Integer
Dim sRet As String

Set rngSearch = rngWhere.Duplicate
rngSearch.Start = rngSearch.Parent.Content.Start
For i = rngSearch.Paragraphs.Count To 1 Step -1
Set oPara = rngSearch.Paragraphs(i)
sRet = HeaderOutlineTitle(oPara.Range.text)
If sRet <> "" Then
fReturnHeadingBefore = sRet
Exit For
End If
Next
End Function
'See if we have a heading-style number. Returns the heading, or blank if not a heading.
Public Function HeaderOutlineTitle(line As String) As String
'Set up variables
Dim CharCounter As Integer
Dim NumberCounter As Integer
Dim tCh As String
Dim y As Integer

CharCounter = 0
NumberCounter = 0

'For the 4 character headings, this skips all the unncessary logic
If Len(line) >= 4 And IsDigits(Left(line, 4)) Then
HeaderOutlineTitle = Left(line, 4)
GoTo SkipToEnd
End If

'Start looping through the line character by character
For y = 1 To Len(line)
tCh = Mid(line, y, 1)

'This allows the letter "H" to be the first character
If y = 1 And tCh = "H" Then
GoTo SkipNumberCheck
End If

'If it's blank or a number or a period, exit
If tCh = " " Or (Not IsDigit(tCh) And tCh <> ".") Then
Exit For
End If

'If this is a number, keep track, otherwise reset the counter
NumberCounter = IIf(IsNumeric(tCh), NumberCounter + 1, 0)

'If we've had more than 2 numbers in a row, we're probably not seeing a heading
If NumberCounter > 2 Then
CharCounter = 0
Exit For
End If

SkipNumberCheck: 'We can also skip all the other logic since it's the first character
CharCounter = CharCounter + 1
Next y

'If this was a outline heading, list that information in column 4
'(AND As long as the last character before the space was a number)
If CharCounter <> 0 Then
If IsNumeric(Mid(line, CharCounter, 1)) Then
HeaderOutlineTitle = Left(line, CharCounter)
Else
HeaderOutlineTitle = Left(line, CharCounter - 1)
End If
Else
HeaderOutlineTitle = ""
End If

SkipToEnd:
End Function
Function IsDigit(char As String) As Boolean
'Uses [Asc] function to get an ASCII code of a character.
Dim iAsciiCode As Integer
iAsciiCode = Asc(char)

'The ASCII codes for the digits range from 48 to 57. If an ASCII
'code of the character is within this range, function returns true
IsDigit = iAsciiCode >= 48 And iAsciiCode <= 57
End Function

Function IsDigits(text As String) As Boolean
'If the text is 0 length, it's not a digit-only string
If Len(text) = 0 Then
IsDigits = False
Else
Dim x As Integer
'the default is true. We'll set to false if...
IsDigits = True
For x = 1 To Len(text)
'...we find something that's not a digit
If Not IsDigit(Mid(text, x, 1)) Then
IsDigits = False
Exit For
End If
Next x
End If
End Function
Function FileThere(FileName As String) As Boolean
FileThere = (Dir(FileName) > "")
End Function

'Queries the user for a file name where we'll save our information
Function GetXLSFileName() As String
'open excel document
Dim fd As FileDialog
Set fd = Application.FileDialog(msoFileDialogFilePicker)
Dim vSelFile As Variant
With fd
.AllowMultiSelect = False
.Filters.Add "Excel 2003", "*.xls", 1
.Filters.Add "Excel 2007 & 2010", "*.xlsx", 1
.Title = "Choose your destination spreadsheet file."

'Show the dialog box
If .Show = -1 Then
'Set the file string to the selected file
GetXLSFileName = .SelectedItems(1)
'If the user presses Cancel...
Else
MsgBox "No file selected. Macro will use 'C:\ShallStatements.xls' as default."
GetXLSFileName = "C:\ShallStatements.xls"
End If
End With
End Function

'Interface function that gathers keywords to search for in the document.
Function SetUpKeywords() As Boolean
ReDim sKWArray(1 To 2) As String
sKWArray(1) = "shall"
sKWArray(2) = "must" '<< Section simplified in this example as it was not causing a performance issue
SetUpKeywords = True
End Function

'Finds the first searched keyword in a line and returns the position
Function HasKeywords(line As String) As Integer
HasKeywords = 0
Dim x As Integer
For x = 1 To UBound(sKWArray)
HasKeywords = IIf(Not IsNull(InStr(1, line, sKWArray(x), 1)), InStr(1, line, sKWArray(x), 1), 0)

'Exit the loop if we find something
If HasKeywords <> 0 Then Exit For
Next x
End Function




'Main function that extracts requirements from the document
Sub ExtractRequirements()
Dim x As Integer, y As Integer, tCh As Integer
Dim intRowCount As Integer 'The row excel will import data into. Used as counter & start value.
intRowCount = 10 'Data starts on row 10. The first 9 rows are reserved for performance metrics and the headers.

'Create arrays to hold data
Dim sHeaders() As String
Dim sStatements() As String
ReDim sHeaders(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
ReDim sStatements(1 To 1000) As String 'Supports maximum of 1000 requirements per document.

'Record Start Time
sHeaders(1) = "Start Time"
sStatements(1) = DateTime.Now

'Write headers
sHeaders(intRowCount - 1) = "Header"
sStatements(intRowCount - 1) = "Requirement"

'Figure out how many lines are in the document
Dim NumOfLines As Integer
NumOfLines = ActiveDocument.ComputeStatistics(wdStatisticLines, False)

'Record time for setup to be completed
sHeaders(2) = "Lines Computed"
sStatements(2) = DateTime.Now

'Set up requirement words (exit sub on failure)
If Not SetUpKeywords Then Exit Sub

'Turn off screen updating to improve performance
If MsgBox("Would you like to turn off screen updating to improve performance? " & _
"Please note that this may cause Word to appear unresponsive for quite " & _
"some time on large files.", vbYesNo, "Improve Performance") = vbYes Then
Application.ScreenUpdating = False
End If

'Record time for information to be collected
sHeaders(3) = "Params Collected"
sStatements(3) = DateTime.Now

'Loop through every line and look for headers
Dim tmpHeaderRslt As String, LineText As String, SentenceText As String, header As String, ShallPos As Integer
header = "None" 'This is the header value used before we've found a legitimate header
For x = 1 To NumOfLines
'Go to the correct line
Selection.GoTo wdGoToLine, wdGoToAbsolute, x

'Expand to select the entire line, but skip the logic if it's blank
Selection.Expand wdLine
LineText = Trim(Selection.text)
If LineText <> "" Then
'see if it's a header, and if so, set the header
tmpHeaderRslt = HeaderOutlineTitle(LineText)
If tmpHeaderRslt <> "" Then header = tmpHeaderRslt

'See if the line has the word "shall" or "must" in it
ShallPos = HasKeywords(LineText)

'If so, write it to the spreadsheet
If ShallPos > 0 Then
'Move to the correct character and expand to the full sentence
Selection.Collapse wdCollapseStart
Selection.Move wdCharacter, ShallPos
Selection.Expand wdSentence
SentenceText = Selection.text

'Write the header to the spreadsheet
sHeaders(intRowCount) = header
sStatements(intRowCount) = SentenceText

'Move the selection forward to the proper line (avoids replication)
Selection.Collapse wdCollapseEnd
If (x + 1) < Selection.Information(wdFirstCharacterLineNumber) Then x = Selection.Information(wdFirstCharacterLineNumber) - 1

'Incremement the row counter
intRowCount = intRowCount + 1
End If
End If
Next x

'Record time for statements to be collected
sHeaders(4) = "Statements Collected"
sStatements(4) = DateTime.Now

'Open excel file and set up for receiving data
Dim xlsFileString As String
xlsFileString = GetXLSFileName
Dim appExcel As Object
Dim objSheet As Object
Set appExcel = CreateObject("Excel.Application")
Set objSheet = appExcel.workbooks.Open(xlsFileString).Sheets("Sheet1")

'Write data to the spreadsheet
sHeaders(5) = "Excel write begin"
sStatements(5) = DateTime.Now
For x = 1 To intRowCount
objSheet.Cells(x, 1) = sHeaders(x)
objSheet.Cells(x, 2) = sStatements(x)
Next x

'Write end time
objSheet.Cells(6, 1) = "Excel write ends"
objSheet.Cells(6, 2) = DateTime.Now

'Clean up and open for viewing.
appExcel.workbooks(1).Close True
appExcel.Quit
Set objSheet = Nothing
Set appExcel = Nothing

'Open the new spreadsheet for viewing
If FileThere("C:\Program Files\Microsoft Office\Office12\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office12\excel.exe """ & xlsFileString & """")
ElseIf FileThere("C:\Program Files\Microsoft Office\Office14\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office14\excel.exe """ & xlsFileString & """")
End If
End Sub
As a note: Your use of the IsDigit or IsDigits functions could be substituted by simply using the built in IsNumeric function (which accomplishes the same thing those two functions do) within the HeaderOutlineTitle function. But I left that alone, as that is your primary heading pattern analysis which you are the most comfortable with. And it also doesn't appear to adversely impact the speed that much.

Since this works in about 10 seconds on your 30 page document, I would imagine you shouldn't have this run much longer than a minute on any longer document.

As a general note-- *always* use Option Explicit at the top of your modules. It saves you from having VBA try to determine the type of a variable on the fly. You didn't have a Dim y as Integer in your HeaderOutlineTitle function, which means it will be a Variant type-- that can be costly to the memory process.

And lastly-- please note that since I went away from your "line by line" approach, and instead use the Sentences collection, there will be two major problems that you will need to handle.
1) Sentences collection is really tricky. It truncates at times you wouldn't expect, and doesn't truncate at other times. Dr. Rev. Martin Luther King Jr. died in the 20th century. That "sentence" looks to Word like it is 4 sentences.
2) The sentences collection also continues on using any white space. So in some cases, the text returned will include a lot of empty paragraphs. So I've simply placed in the class a quick and dirty method of removing any paragraph marks from the myInfo class. There are numerous ways to approach this, but this is one way.

Hope this helps. Please post the code you got to, if you're interested in any suggestions for your approach. The main takeaway from this approach is that the use of collections is far superior to your inclination to use arrays. Not only can you put custom objects (the classes) into collections, being able to use the .Range.Start property as your main algorithm for document location speeds things up considerably, because you don't have to a) go through the document line by line and b) Longs are much less memory intensive than a string. When you magnify that by string arrays with 1000 elements, you can definitely slow yourself down.

Cheers,
- Jason aka Frosty

Frosty
06-07-2012, 03:22 PM
Returned your sample document, since you said the info is public. Contains my code, obviously. Open the document and run the Demo subroutine to test.

dipique
06-11-2012, 04:59 AM
All right, thanks for your patience. I took off Friday and the weekend but I'm back and ready to roll.

I've uploaded my new sample (creatively: "Newsample.docm"). I've incorporated your ideas except for the wildcard matching, which I'll give a try after this. It's running pretty quickly right now in the sample document, and in longer documents I'm not seeing the same slowdown at the end. Still, it takes a couple hours to process really long documents (100+) page, which is better than before but still could use some work.

I'll take a look at the new suggestions and see what I can do. Thanks again!

Dan

dipique
06-11-2012, 05:06 AM
Looked in more detail at your new code. I'm sorry I didn't get back to you faster! I think my new code implemented the main performance gains, and it runs in 9 seconds (I implemented some performance testing functionality as well) so I think we're about on par where performance is concerned. Let me know if you think creating the two collections first and then running the comparisons offers a substantial benefit to what I've implemented in my most recent document.

I've generally simplified the code pretty significantly; the main complexity at this point is in the header location routine, so I'll look into wildcards and see what I come up with. Hopefully it's something like regular expressions.

Wish me luck!

Dan

Frosty
06-11-2012, 05:43 AM
Haven't downloaded your sample yet, but how does code which runs in 10 seconds on 30page sample doc turn into a couple of hours on a 100+ page doc? Are you sure this is a representative sample?

dipique
06-11-2012, 06:33 AM
A valid question. The problem document is 172 pages. It's too big for the forum (just under 4MB) but it can be downloaded here:

https://docs.google.com/file/d/0B06duhJ9c0sqS0NXcmNDa0NKS0U/edit?pli=1

Just go to the link and go to file->download. It doesn't seem qualitatively different than the sample document, but maybe I'm missing something.

In other news: I'm working on an alternate means of identifying headings right now... I just finished with the first implementation, but it's actually slower than the original, believe it or not. I'll keep on workin'.

Dan

Frosty
06-11-2012, 06:41 AM
Did you try my code on a couple of different chunks? Did the two collection approach seem to operate the same as yours not just on the sample, but on a real doc? I won't be able to look at this for a few hours.

For the purposes of testing, putting the info into excel is meaningless, so you should be able to run the code I had on your stuff to test as well.

Bottlenecks can show up in the strangest places. One of the big problems I had with your original code wasn't what I thought it would be-- but rather you using the move method on the selection. When it encountered section breaks, it would hang. I think you may be dealing with document corruption to some degree (not all that strange when the source is scanned PDFs)

Will look over your stuff in a couple of hours. In the meantime, see if my code performs any differently for you on a larger sample size

dipique
06-11-2012, 06:53 AM
Holy...

Well, I tried your code and dear God is it fast. The biggest problem is that it's not matching the capital-letter-style headings... I'll see if I can patch that up.

I'll start working on adapting my code to use this method. More soon.

Dan

dipique
06-11-2012, 06:57 AM
(The capital-letter-style headings issue wasn't your fault, you're just using the old version of my header-finding code... I think once I drop in the updated code it will solve the issue.)

Frosty
06-11-2012, 09:07 AM
That's why I left the header identification algorithm alone, since I figured that was the thing most likely for you to change. The only thing I can add is that you really should get rid of your IsDigit/IsDigits functionality. It's redudant to IsNumeric and may very well be a bottleneck. If nothing else, it's code to wade through which you don't need to wade through.

At this point, the biggest approach difference I see between your NewSample code is that you were using the paradigm of finding the heading before, rather than the additional collection of headings and then comparing the nearest.

I think the 2nd collection of headings is going to be better, since it means you won't be back-tracking so much. Why go through the same paragraphs multiple times, only to find out that they aren't useful? Especially on a longer document. If you're going to need to loop through each paragraph in order to analyze different heading patterns, you might as well only do it once. Also, give yourself some short-circuits (i.e., if the Len of the paragraph text is <=2 or something, don't do anything). The most likely spot for corruption in a document is section breaks... might as well skip thos.

I think building the collection of headings will still be the bottleneck (since the primary collection of sentences containing search terms uses the Find object) -- but you can further refine if it seems slow. But at the very least you won't have to change any of the methodology of your main routine-- you simply have to try to build the headings collection faster.

Apart from that, you can look at various techniques to optimizing string functions (the statement If Len(myString) = 0 is technically faster than If myString = "" -- although I prefer the later statement for readability).

VBA contains a LIKE operator, although it is limited... and the wild card searching can be very very powerful. But this structure (two collections, one with the sentences, one with the headings) is very scalable. If you then later need to add to it, it's not that difficult (search terms, criteria for the headers).

Frosty
06-11-2012, 09:52 AM
So now I'm trying to examine your HeaderOutlineTitle function... and there are a few bits I don't get, and a few bits I think you can improve the logic of.

So a couple of comments:
1. Don't use single line If...Then statements. There is zero optimization to this, and it essentially means you can't effectively break point it.
2. IIF constructs are not appreciably faster (to my knowledge), and they are much more confusing to read and troubleshoot.
3. Complex And/Or operations are likewise difficult to troubleshoot... for example.

'If it's blank or a number or a period, exit
If tCH= " " Or (Not IsNumeric(tCh) And tCH <> ".") Then
Exit For
End If
That logic test doesn't make a lick of sense to me. Especially with the inaccurate comment. What you are literally testing there is
A. If it is a blank space OR
B. It is NOT a number AND it is NOT a period.

So, basically... you have some valid characters to allow you to keep processing, right? This is an excellent time to make use of Select Case, for multiple reasons. It's easier to modify, and it's easier to understand. For example...

Select Case tCH
Case 0 to 9, "."
'acceptable, keep processing
Case Else
Exit For
End Select
You don't have to use the 0 to 9 construct, you could do "0", "1", "2", etc... but Select Case is often a good way to do an acceptable list. It's certainly easier to read and modify than a complicated If statement utilizing AND/OR operators.

4. IIF... blech. Easier to read this way...

'If this is a number, keep track, otherwise reset the counter
If IsNumeric(tCH) Then
NumberCounter = NumberCounter + 1
Else
NumberCounter = 0
End If

I would be surprised if the above simplifications changed the over all processing time.

As a benchmark, running my code on your 4 meg document took around 2 minutes, with the majority of that time being putting stuff into the word table (which may very well be faster for you, since excel doesn't have to re-paginate periodically when large amounts of data are put into a spreadsheet.

I think there are ways to simplify what you want out of the HeaderOutlineTitle functionality even without going into Wildcard searching.

But if you want to investigate that, it's probably worth of its own post -- something along the lines of "How can I use wildcard searching to match..."
And then list out all the patterns that you want to find, coupled with the kinds of patterns you don't want to match, and perhaps one of our resident expert wildcard searchers will weigh in.

dipique
06-11-2012, 11:44 AM
Attached is the updated version of the macro. It's now running the 200 page document in under 30 seconds, so I'm happy performance-wise. I've also added some additional features and planned several more.

With regard to the code clean-up... I understand that some of my code is considered poor practice. Certainly I could stand to use better variable/function names. But some of my practices are because it's more readable for me--and I'm usually the person who needs to read my code, since I'm not a developer. I'm much better off getting basic logic functions on one line so that I can read comments instead of code (I did correct the comment on that one obscure line so it wasn't inaccurate).

That said, I understand that for people other than me it can be hard to decipher, and I'll try to submit code that's more easily readable from now. I appreciate the tips.

Lastly: The IsDigit/s function(s) are subtly different from IsNumeric: it doesn't count parenthesis. This eliminates hundreds of false positives in the header routine.

Thank you (so much!) for all your work and analysis and advice. I can't tell you how much I appreciate it, and I've learned a great deal in the bargain.

Thanks again,

Dan

dipique
06-11-2012, 11:46 AM
PS. I love the use of Select Case you suggested. I'm used to the C# version which is somewhat less... flexible. I'll be sure to use it in the future.

Frosty
06-11-2012, 01:40 PM
Ahh, I see the value of your IsDigits/IsDigit functions then. I would suggest commenting that specific functionality, and how it differs... and I think you can simplify(?) to a single function, and still retain the short-circuiting you used in IsDigits. In fact, you can probably gain marginal improvement by further filtering with IsNumeric (no need to 10 characters and find the 11th is a letter if the whole string doesn't first pass the IsNumeric test). I realize you would probably change the structure to start with a single statement a la
If Len(sText) > 0 OR IsNumeric(sText) = True Then
But I leave my flavor because it's easier (for me) to add any additional branches or troubleshoot any flaws in my conceit. For your purposes, there is the possibility of a "bad" return if something was font-formatted poorly but appeared as a digit (which can happen in scanned documents-- the ASC could be one character, but the font formatting could be in Symbol font or wingdings-- and appear as a digit-- theoretically this could happen, anyway).

'returns true ONLY if all characters passed by the string are actual digits
'short-circuits logic test if 0-length string or fails initial IsNumeric test
Public Function IsOnlyDigits(sText As String) As Boolean
Dim i As Integer
Dim bRet As Boolean

'if an empty string, return false
If Len(sText) = 0 Then
bRet = False
'if it isn't numeric, then false
ElseIf IsNumeric(sText) = False Then
bRet = False

'if we've gotten here, make sure no IsNumeric(sText) = True statements
'contain any special characters, such as $123.45, or (123.45)
Else
For i = 1 To Len(sText)
Select Case Mid(sText, i, 1)
Case 0 To 9
'continue on
bRet = True
Case Else
bRet = False
GoTo l_exit
End Select
Next
End If
l_exit:
IsOnlyDigits = bRet
End Function

I understand your points on the single line structure, and don't disagree. Every coder has different styles (even if those differences are subtle ones). My comment was made in case you had simply copied and pasted those lines of code without fully understanding them... but if your main purpose is to make the code more readable for you-- then by all means, stick to that. Readability is very important. But I would humbly suggest that as you get used to actual If...End If blocks of code, rather than single line If statements, it will end up being as easy to read as your single line statements, with the addition of being easier to debug when the inevitable happens: something breaks ;)

You're clearly an experienced coder, based on the way you organize and structure your code. I would simply comment the functionality more accurately, so that others will understand what you're doing.

A couple of other notes...
1. I don't think you need to turn off screen updating, at this point. And in any event, you didn't actually present a choice-- just an OK button.

2. You should note that the sentences collection can break in unexpected ways when dealing with the first sentence of a range (which doesn't necessarily mean the first sentence of a paragraph).
The first link sums up a little bit of the problem with some simple scenarios.
http://www.vbaexpress.com/forum/archive/index.php/t-34386.html
The gmaxey link is something Greg and I worked on, and is probably far more than you need at the moment (there are some simple solutions in the first link)
http://gregmaxey.mvps.org/word_tip_pages/deduced_sentences.html
In summary, you may not want to use
Set riInfo.RangeInfo = rngSearch.Sentences(1)
but rather
dim rngTemp As Range
Set rngTemp = rngSearch.Sentences(1)
rngTemp.Start = rngSearch.Start

or something along those lines. Basically, the first sentence may end up as not the first sentence if there are special characters involved near the "sentence ender" -- which can easily be the case if you're dealing with scanned documents.

3. I would break your ExtractRequirements routine apart at least a bit. You've got a "read data" part and a "write data" part-- that could be separated. In general, if you're using Dim statements somewhere other than the top of a procedure, that can be a guide for modularizing further. But this is just shifting deck-chairs somewhat. You've commented very well. It's just a long routine with a lot of different things going on.

4. This could be an alternate way to getting your search terms... although I think ultimately you might want to have a userform for this. MsgBox and InputBox are ultimately somewhat limiting... but at least this allows the user to see duplicates rather than guess and have to do your "no duplicates" methodology? You could also utilize the KeyField parameter in a collection (rather than an array), and trap for the error. This is just more collection tricks, but, assuming myCol was a collection, you will raise an error on the second line of the following two lines
myCol.Add "Hello", "Hello"
myCol.Add "Hello", "Hello"
whereas
myCol.Add "Hello"
myCol.Add "Hello"
would not raise an error.

Adding keyfields to collections allows you to reference by the index field or the keyword... so...
myCol(1) is the only way to return "Hello" text for the second collection, but myCol("Hello") would work for a collection you also used the keyfield. I'm not sure where this would help your current code, but collections are a very powerful part of the VBA object model. I'm constantly finding new ways to use them.

Public Function fGetSearchTerms() As Variant
Dim sRet As String
Dim sDefault As String
Dim aryTerms() As String

'start with an extra pipe
sDefault = "shall|must|"
Do
sRet = InputBox("The following terms will be searched." & vbCr & _
"To add another, press the right arrow key and type" & vbCr & vbCr & _
"Search terms are *not* case-sensitive" & vbCr & _
"Press ESC or Cancel to stop adding terms", _
"Add Search Keywords?", _
sDefault)
If sRet = "" Then
Exit Do
ElseIf Right(sRet, 2) = "||" Then
Exit Do
ElseIf Right(sRet, 1) = "|" Then
'allow them to continue, since it might have been a mis-click to hit okay
Else
sDefault = sRet & "|"
End If
Loop Until sRet = ""
'remove any trailing pipe characters
Do Until Right(sDefault, 1) <> "|"
sDefault = Left(sDefault, Len(sDefault) - 1)
Loop
'get the array of search terms, using split
aryTerms = Split(sDefault, "|")

'and return it
fGetSearchTerms = aryTerms
End Function
5. You don't need to set default values at the top of a routine when you've defined the types. So...

dim CharCounter As Integer
CharCounter = 0

is redundant. It's dimensioned as an integer... in the absence of any changes, it will be 0.

6. And one last point-- VBA defaults arrays to being 0-based. However, you can use Option Base 1 at the top of your modules to default to 1 based arrays (so you don't have to Dim sKWArray(1 to 2) As String... so with Option Base 1 on, you only need Dim sKWArray(2) As String. However, the Split function (extremely useful with simple arrays-- displayed above), remains 0-based, even with Option Base 1 on. And don't forget to use Option Explicit. You don't currently have it on in your SharedFunctions module.

Good luck, Dan. Nice working with you!

- Jason aka Frosty

dipique
06-12-2012, 04:58 AM
Thanks again, and thank you for the parting comments as well. My thoughts:

1. What happened to my vbYesNo parameter! I added it back in, then thought better of it and removed the line. You're right, I don't really need it for performance anyway, and it's just one more cryptic prompt to click through for the user.

2. I just downloaded the Deduced Sentences template you worked on... that is wild. I'm having some trouble getting it to work in 64-bit Word, but I'll fiddle. Besides, I think the problems I'm having are with the ribbon part, which I don't need anyway. Thanks for pointing me in that direction; more accurate sentence-reading is definitely the next step.

4. I've been mulling over whether I should just bite the bullet and re-write this as a real application in C#. But now that the performance is in a good place, I probably won't, and I'll probably leave the UI elements alone. Honestly, allowing addition/subtraction of search terms is only implemented because I hate the idea of hard-coding, but if I removed it I don't think the users would miss it. I'm curious though... why do you have that function declared as a variant rather than a string array?

5. You're seeing my ignorance/inexperience with VBA. In C# EVERYTHING needs to be declared and set, and I don't know what all the default values are (i.e. null, empty, "", 0, etc.) so sometimes I set the value just to be safe. I'll do some research and remove the useless declarations.

6. This is another C# takeaway, where everything is 0 based. I declared LB and UB partly so I remember what the lower bound is :p. I've heard of the Option Base 0/1 but I'd be worried that I'd forget about it. Thanks for noting that about Split, though... that's a little crazy. And I've added Option Explicit now to all my modules. I agree, it's a must-have, especially for someone used to strict typing.

Thanks again for all your help. It was a real privilege working on this with you.

Dan

Frosty
06-12-2012, 09:36 AM
1. Probably went away when you put your message into a constant ;)
2. Yes, the UI isn't something you'd want... but the class itself may be useful. There are samples of how to use the class (forgetting the ribbon stuff) within other modules. But it wasn't necessarily constructed with the kind of use you might put to it, so I'm not sure if it would be a help or a hindrance at this point.
3. What happened to three? ;)
4. In VBA, you can't have arrays as returns of functions. You have to use a Variant. Doesn't make sense to me either. You can pass arrays as parameters (and since you can pass them, you can return them into a parameter). Or you can have a public/module variable that a function populates. But you can't actually have a function return the array. I think I knew why at one point, but I no longer remember.

Sub HelloWorld()
Dim i As Integer
Dim sMsg As String
Dim sRet As String
Dim aryMy() As String

aryMy = GetMyArray
For i = 0 To UBound(aryMy)
sRet = aryMy(i)
If sRet = "" Then
Exit For
Else
sMsg = sMsg & " " & sRet
End If
Next
MsgBox sMsg
End Sub
'returned from the function
Public Function GetMyArray() As Variant
Dim aryMy(10) As String
'populate the parameter
GetMyArray2 aryMy
'return it
GetMyArray = aryMy
End Function
'returned from the parameter
Public Function GetMyArray2(aryRet() As String) As Boolean
aryRet(0) = "Hello"
aryRet(1) = "World"
GetMyArray2 = True
End Function
5. It won't be much research... all data types start with a value. And they are basically what you'd expect:
boolean is false
int/long/single/dbl is 0
string is ""
Date is 0, but really #12:00:00 AM#
Variant is Empty
Object is Nothing

Probably the stupidest thing they did in VBA was reverse the convention of what the boolean value is. I believe in other programming languages that 0 is true, and all other values are false, right? In VBA it is the opposite.

dipique
06-12-2012, 11:41 AM
3. I didn't want to fall into the trip of answering something when I didn't have anything in particular to say just to make a complete numbered list. That way lies long boring conversations. :)
4. Huh. That's annoying.
5. Heh, that's pretty much all the research I needed. Thanks! I swear, dealing with uninstantiated variables and all the myriad forms of null that exist is the most annoying part of programming, particularly when moving between languages. I'm doing a project in Access (my first) right now and I had the most aggravating experience testing text boxes for being either null or blank. I mean, seriously? I understand why, but it's things like that that make me hate VBA. It goes halfway toward trading performance for ease/rapid development, but you're never sure which half! If I could script office in C#, I would.

Your last comment reminded me: using IIF is actually a C# habit as well. I use ternary operators all the time--they make so much more sense to me than 5 lines of If/Then code--and I was looking for the VBA equivalent.

More to the point: in C# (I'm a mere hobbyist in C#, but I certainly feel more comfortable with it than VB, which is new to me within the last few months), Booleans cannot be converted to integers. If I wanted a conversion, I would have to write something like:

bool b = false;
int i = b ? 1 : 0;
\\i now equals 0

But there is no implicit conversion. There is an explicit conversion using the Convert.ToInt32() function, and in that case false=0.

I have a feeling you have a good reason for your distaste of VBA's choice in that regard, but personally I prefer it. For example, I can multiple two arrays (one boolean and one integer) and sum the resulting array to get the sum of, for example, all the items chosen from check boxes. Or all the items that meet some criteria.

The only reason I'd want 0 to be true is if variables weren't type-defined, in which case I would want to be able to explicitly test for true rather than false.

But that's just me.

Dan

Frosty
06-12-2012, 12:17 PM
Hmm, this is one of those times when I get to revisit something I always accepted as truth, but doesn't appear to be true. One of my mentors years ago told me the boolean type in VBA was reversed from previous programming languages. So I had always thought that was true. But it may not be... so disregard that comment.

My main programming experience has been in VBA, so I never had that assumption challenged until now. So thanks!

As for implicit/explicit conversions... VBA does a lot implicitly, although there are a lot of explicit conversion functions in VBA... CBool, CInt, etc...
CBool(0) = False
CBool(-1) = True
CBool(1231245) = True
but...
CInt(True) = -1
Cint(False) = 0

But the conversion functions aren't blackboxed, so
CBool(myObject)
Will return a type mismatch error (which is why I have a bunch of comparable functions which return the default value on any error, i.e.,)

Function fCInt(val) As Integer
On Error Goto l_err
fCInt = CInt(val)
l_exit:
Exit Function
l_err:
'blackbox to default value of the type
resume l_exit
End Function
This allows
fCInt(ActiveDocument) to return 0... where CInt(ActiveDocument) will return a Type Mismatch error. Of course, that application doesn't make a lot of sense... but can be very useful if you write the fCStr analog, and pass it a Null value in your Access application. Then you get your empty string, without having to code for the Null state on each test.

VBA will coerce a lot of values of different data types, and do implicit conversions for you a lot of times, but fail to do so when just when you start relying on it. Of course, with experience, you start to avoid the pitfalls instinctively, but I think it's pretty rough when you first start learning.

myIntVar = True
is an acceptable line of code and won't cause an error... which doesn't seem like a big deal, but when applied to something like a Range object (which you can set to a string variable, and get the text of the range implicitly converted to a string -- but only if the range isn't nothing), it can cause mysteries to the beginning vba programmer.

I believe in VBA that booleans are technically integers... the help on Data Type Summary indicates it uses the same amount of memory (2 Bytes).

Couple additional things that may help, if you don't yet know about them: Locals Window/Watch Window/Immediate Pane. I find those three items extremely useful in programming. And it will also allow you to check out existing values of items while you program. You may find these a little more useful than the standard way of learning VBA: recording macros and seeing what gets spit out.

In any event, glad to help push the ball forward for you a bit in learning VBA.

Frosty
06-12-2012, 05:10 PM
Oh, and this is my version of Greg's coding of the sentences class... slightly simpler interface, but (I believe) the core functionality is the same (he and I worked on the class together, after that the majority of it was his baby-- but if you're having issues with the interface, this doesn't have any of that).

Maybe this will help...

dipique
06-19-2012, 10:08 AM
Thanks Frosty, both for the notes and for the class with the simplified UI. It works without causing the problems I was experiencing with the ribbon UI.

My favorite sentence: "Jason is great, but he's not in a list."

Dan

Frosty
06-19-2012, 10:16 AM
Haha. When I used to teach Word classes, the first thing I had students type was "Jason is great" -- it was always good for a laugh and a bit of an icebreaker... then we would start underlining/bolding "Jason" and "great" as an entry into the concepts of font formatting.

So that has stuck with me when I need short sample text in the same way that "Hello World" is the standard first text anyone every programs to display to an end-user. ;)

I haven't looked at the sentence class in 6 months, and have yet to apply it in a real-world application, so there may be some warts to discover. Please chime in if you discover some.

Cheers!

Jason aka Frosty