-
Well, I couldn't resist. This seems to work much better on your demo document. The general concept is...
1. Build a collection of found sentences, based on your search terms (currently "shall" and "will")
2. Build a collection of found Headings, based on your existing string functionality analyzing the text of each paragraph in the document which has a text length of longer than 1 character (i.e., don't bother with empty paragraph marks, section breaks, etc)
3. cycle through the collection of found sentences, adding the appropriate heading text, by comparing the start ranges
4. Spit the results into a Word table (I'll leave getting it into Excel to you).
I left out the idea of wildcard searching, because a) I always have to wrap my mind around the wildcard searching process-- there are others here who are much faster and familiar with that and b) it seemed pretty fast already.
Running this on your demo document completes the operation in about 10 seconds. It will be a bit slower when you have to dump the data into Excel (since it takes time to open the excel app, generate a new worksheet, etc), but I think you can probably work out how to deal with the collection based on the creation of the table.
So, in summary... this requires the following:
1. A class called myInfo, which contains the following
[vba]
Option Explicit
'for the actual found range
Public InfoRange As Range
'the heading string of the found range
Public sHeading As String
'this can be used to re-order in the logical order
Public lKeyWordStart As Long
Public sKeyWordFound As String
'return the text, but remove any empty paragraph marks (including multiple paragraph marks)
Public Property Get InfoText() As String
If Not InfoRange Is Nothing Then
InfoText = Replace(InfoRange.text, vbCr, "")
End If
End Property
[/vba] 2. A second class called myHeading, which contains the following
[vba]
Option Explicit
Public sHeadingText As String
Public rngHeadingPara As Range
Public lStart As Long
[/vba] 3. The following code in a regular module (I've included your original code, as well as a couple of functions I had strictly for testing from the immediate window, although I think you only need the HeaderOutlineTitle function.
[vba]
Option Explicit
Dim sKWArray() As String
Sub Demo()
Dim oInfo As myInfo
Dim rngSearch As Range
Dim sKeyWords As String
Dim sKeyWord As String
Dim i As Integer
Dim colFoundInfo As Collection
Dim colHeaders As Collection
'get our keywords
sKeyWords = "shall|will"
'set up our collection of info
Set colFoundInfo = New Collection
'search for each of our keywords
For i = 0 To UBound(Split(sKeyWords, "|"))
'reset the range for each search term
Set rngSearch = ActiveDocument.Content
sKeyWord = Split(sKeyWords, "|")(i)
'set up the find, not matching case
With rngSearch.Find
.MatchCase = False
'get the keyword to search
.text = sKeyWord
'while we find our search term
Do Until .Execute = False
'create a new object
Set oInfo = New myInfo
'put the relevant data into it
oInfo.lKeyWordStart = rngSearch.Start
Set oInfo.InfoRange = rngSearch.Sentences(1)
oInfo.sKeyWordFound = sKeyWord
'oInfo.sHeading = fReturnHeadingBefore(oInfo.InfoRange)
'here we would add in functionality to grab the header info for this object
'(or could put into the actual class-- depends on your preference)
'and then add the objectto the collection
colFoundInfo.Add oInfo
Application.StatusBar = "Added heading: " & oInfo.sHeading
Loop
End With
Next
'now, get the heading texts collection
Set colHeaders = fGetHeadingCollection
For Each oInfo In colFoundInfo
oInfo.sHeading = fGetClosestHeadingText(oInfo.lKeyWordStart, colHeaders)
Next
'output to a word table, just for the purposes of this demo
DemoOutput colFoundInfo
End Sub
Public Function fGetClosestHeadingText(lBefore As Long, colHeadings As Collection) As String
Dim oHeading As myHeading
'Dim colHeadings As Collection
Dim i As Integer
Dim sRet As String
'Set colHeadings = fGetHeadingCollection
'if we're before any headings, then return an empty string before we loop through
If lBefore < colHeadings(1).lStart Then
Exit Function
End If
'cycle through all the headings, finding the one with the closest item
'start at 2, since we always refer to the previous one
For i = 2 To colHeadings.Count
Set oHeading = colHeadings(i)
If lBefore < oHeading.rngHeadingPara.Start Then
sRet = colHeadings(i - 1).sHeadingText
Exit For
End If
Next
'it's the last heading
If sRet = "" Then
sRet = colHeadings(colHeadings.Count).sHeadingText
End If
fGetClosestHeadingText = sRet
End Function
'build the collection of headings
Public Function fGetHeadingCollection() As Collection
Dim oPara As Paragraph
Dim oHeading As myHeading
Dim sRet As String
Dim colRet As Collection
Set colRet = New Collection
For Each oPara In ActiveDocument.Paragraphs
'don't bother with blank paragraphs or heading rows
If Len(oPara.Range.text) > 1 Then
sRet = HeaderOutlineTitle(oPara.Range.text)
If Len(sRet) > 0 Then
Set oHeading = New myHeading
oHeading.sHeadingText = sRet
oHeading.lStart = oPara.Range.Start
Set oHeading.rngHeadingPara = oPara.Range.Duplicate
colRet.Add oHeading
End If
End If
Next
Set fGetHeadingCollection = colRet
End Function
'just dump it into a table at the moment
Sub DemoOutput(colInfo As Collection)
Dim oTable As Table
Dim oDoc As Document
Dim i As Integer
Dim oInfo As myInfo
Set oDoc = Documents.Add
Set oTable = oDoc.Tables.Add(oDoc.Content, colInfo.Count + 1, 3)
With oTable
.Cell(1, 1).Range.text = "Position"
.Cell(1, 2).Range.text = "Header"
.Cell(1, 3).Range.text = "Sentence"
For i = 1 To colInfo.Count
Set oInfo = colInfo(i)
.Cell(i + 1, 1).Range.text = oInfo.lKeyWordStart
.Cell(i + 1, 2).Range.text = oInfo.sHeading
.Cell(i + 1, 3).Range.text = oInfo.InfoText
Next
'quick and dirty sort, delete and autofit
.Sort excludeHeader:=True, FieldNumber:="Column 1", SortFieldType:=wdSortFieldNumeric, _
SortOrder:=wdSortOrderAscending
.Columns(1).Delete
.AutoFitBehavior (wdAutoFitContent)
End With
End Sub
Sub SelectTheNextShallSentence()
Dim rngSearch As Range
Dim oInfo As myInfo
Set rngSearch = ActiveDocument.Content
rngSearch.Start = Selection.Start
With rngSearch.Find
.MatchCase = False
.text = "shall"
If .Execute Then
Set oInfo = New myInfo
oInfo.lKeyWordStart = rngSearch.Start
Set oInfo.InfoRange = rngSearch.Sentences(1)
oInfo.sKeyWordFound = "shall"
oInfo.InfoRange.Select
End If
End With
End Sub
Sub DisplayTheHeadingOfThisSentence()
MsgBox fReturnHeadingBefore(Selection.Range)
End Sub
Function fReturnHeadingBefore(rngWhere As Range) As String
Dim oPara As Paragraph
Dim rngSearch As Range
Dim i As Integer
Dim sRet As String
Set rngSearch = rngWhere.Duplicate
rngSearch.Start = rngSearch.Parent.Content.Start
For i = rngSearch.Paragraphs.Count To 1 Step -1
Set oPara = rngSearch.Paragraphs(i)
sRet = HeaderOutlineTitle(oPara.Range.text)
If sRet <> "" Then
fReturnHeadingBefore = sRet
Exit For
End If
Next
End Function
'See if we have a heading-style number. Returns the heading, or blank if not a heading.
Public Function HeaderOutlineTitle(line As String) As String
'Set up variables
Dim CharCounter As Integer
Dim NumberCounter As Integer
Dim tCh As String
Dim y As Integer
CharCounter = 0
NumberCounter = 0
'For the 4 character headings, this skips all the unncessary logic
If Len(line) >= 4 And IsDigits(Left(line, 4)) Then
HeaderOutlineTitle = Left(line, 4)
GoTo SkipToEnd
End If
'Start looping through the line character by character
For y = 1 To Len(line)
tCh = Mid(line, y, 1)
'This allows the letter "H" to be the first character
If y = 1 And tCh = "H" Then
GoTo SkipNumberCheck
End If
'If it's blank or a number or a period, exit
If tCh = " " Or (Not IsDigit(tCh) And tCh <> ".") Then
Exit For
End If
'If this is a number, keep track, otherwise reset the counter
NumberCounter = IIf(IsNumeric(tCh), NumberCounter + 1, 0)
'If we've had more than 2 numbers in a row, we're probably not seeing a heading
If NumberCounter > 2 Then
CharCounter = 0
Exit For
End If
SkipNumberCheck: 'We can also skip all the other logic since it's the first character
CharCounter = CharCounter + 1
Next y
'If this was a outline heading, list that information in column 4
'(AND As long as the last character before the space was a number)
If CharCounter <> 0 Then
If IsNumeric(Mid(line, CharCounter, 1)) Then
HeaderOutlineTitle = Left(line, CharCounter)
Else
HeaderOutlineTitle = Left(line, CharCounter - 1)
End If
Else
HeaderOutlineTitle = ""
End If
SkipToEnd:
End Function
Function IsDigit(char As String) As Boolean
'Uses [Asc] function to get an ASCII code of a character.
Dim iAsciiCode As Integer
iAsciiCode = Asc(char)
'The ASCII codes for the digits range from 48 to 57. If an ASCII
'code of the character is within this range, function returns true
IsDigit = iAsciiCode >= 48 And iAsciiCode <= 57
End Function
Function IsDigits(text As String) As Boolean
'If the text is 0 length, it's not a digit-only string
If Len(text) = 0 Then
IsDigits = False
Else
Dim x As Integer
'the default is true. We'll set to false if...
IsDigits = True
For x = 1 To Len(text)
'...we find something that's not a digit
If Not IsDigit(Mid(text, x, 1)) Then
IsDigits = False
Exit For
End If
Next x
End If
End Function
Function FileThere(FileName As String) As Boolean
FileThere = (Dir(FileName) > "")
End Function
'Queries the user for a file name where we'll save our information
Function GetXLSFileName() As String
'open excel document
Dim fd As FileDialog
Set fd = Application.FileDialog(msoFileDialogFilePicker)
Dim vSelFile As Variant
With fd
.AllowMultiSelect = False
.Filters.Add "Excel 2003", "*.xls", 1
.Filters.Add "Excel 2007 & 2010", "*.xlsx", 1
.Title = "Choose your destination spreadsheet file."
'Show the dialog box
If .Show = -1 Then
'Set the file string to the selected file
GetXLSFileName = .SelectedItems(1)
'If the user presses Cancel...
Else
MsgBox "No file selected. Macro will use 'C:\ShallStatements.xls' as default."
GetXLSFileName = "C:\ShallStatements.xls"
End If
End With
End Function
'Interface function that gathers keywords to search for in the document.
Function SetUpKeywords() As Boolean
ReDim sKWArray(1 To 2) As String
sKWArray(1) = "shall"
sKWArray(2) = "must" '<< Section simplified in this example as it was not causing a performance issue
SetUpKeywords = True
End Function
'Finds the first searched keyword in a line and returns the position
Function HasKeywords(line As String) As Integer
HasKeywords = 0
Dim x As Integer
For x = 1 To UBound(sKWArray)
HasKeywords = IIf(Not IsNull(InStr(1, line, sKWArray(x), 1)), InStr(1, line, sKWArray(x), 1), 0)
'Exit the loop if we find something
If HasKeywords <> 0 Then Exit For
Next x
End Function
'Main function that extracts requirements from the document
Sub ExtractRequirements()
Dim x As Integer, y As Integer, tCh As Integer
Dim intRowCount As Integer 'The row excel will import data into. Used as counter & start value.
intRowCount = 10 'Data starts on row 10. The first 9 rows are reserved for performance metrics and the headers.
'Create arrays to hold data
Dim sHeaders() As String
Dim sStatements() As String
ReDim sHeaders(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
ReDim sStatements(1 To 1000) As String 'Supports maximum of 1000 requirements per document.
'Record Start Time
sHeaders(1) = "Start Time"
sStatements(1) = DateTime.Now
'Write headers
sHeaders(intRowCount - 1) = "Header"
sStatements(intRowCount - 1) = "Requirement"
'Figure out how many lines are in the document
Dim NumOfLines As Integer
NumOfLines = ActiveDocument.ComputeStatistics(wdStatisticLines, False)
'Record time for setup to be completed
sHeaders(2) = "Lines Computed"
sStatements(2) = DateTime.Now
'Set up requirement words (exit sub on failure)
If Not SetUpKeywords Then Exit Sub
'Turn off screen updating to improve performance
If MsgBox("Would you like to turn off screen updating to improve performance? " & _
"Please note that this may cause Word to appear unresponsive for quite " & _
"some time on large files.", vbYesNo, "Improve Performance") = vbYes Then
Application.ScreenUpdating = False
End If
'Record time for information to be collected
sHeaders(3) = "Params Collected"
sStatements(3) = DateTime.Now
'Loop through every line and look for headers
Dim tmpHeaderRslt As String, LineText As String, SentenceText As String, header As String, ShallPos As Integer
header = "None" 'This is the header value used before we've found a legitimate header
For x = 1 To NumOfLines
'Go to the correct line
Selection.GoTo wdGoToLine, wdGoToAbsolute, x
'Expand to select the entire line, but skip the logic if it's blank
Selection.Expand wdLine
LineText = Trim(Selection.text)
If LineText <> "" Then
'see if it's a header, and if so, set the header
tmpHeaderRslt = HeaderOutlineTitle(LineText)
If tmpHeaderRslt <> "" Then header = tmpHeaderRslt
'See if the line has the word "shall" or "must" in it
ShallPos = HasKeywords(LineText)
'If so, write it to the spreadsheet
If ShallPos > 0 Then
'Move to the correct character and expand to the full sentence
Selection.Collapse wdCollapseStart
Selection.Move wdCharacter, ShallPos
Selection.Expand wdSentence
SentenceText = Selection.text
'Write the header to the spreadsheet
sHeaders(intRowCount) = header
sStatements(intRowCount) = SentenceText
'Move the selection forward to the proper line (avoids replication)
Selection.Collapse wdCollapseEnd
If (x + 1) < Selection.Information(wdFirstCharacterLineNumber) Then x = Selection.Information(wdFirstCharacterLineNumber) - 1
'Incremement the row counter
intRowCount = intRowCount + 1
End If
End If
Next x
'Record time for statements to be collected
sHeaders(4) = "Statements Collected"
sStatements(4) = DateTime.Now
'Open excel file and set up for receiving data
Dim xlsFileString As String
xlsFileString = GetXLSFileName
Dim appExcel As Object
Dim objSheet As Object
Set appExcel = CreateObject("Excel.Application")
Set objSheet = appExcel.workbooks.Open(xlsFileString).Sheets("Sheet1")
'Write data to the spreadsheet
sHeaders(5) = "Excel write begin"
sStatements(5) = DateTime.Now
For x = 1 To intRowCount
objSheet.Cells(x, 1) = sHeaders(x)
objSheet.Cells(x, 2) = sStatements(x)
Next x
'Write end time
objSheet.Cells(6, 1) = "Excel write ends"
objSheet.Cells(6, 2) = DateTime.Now
'Clean up and open for viewing.
appExcel.workbooks(1).Close True
appExcel.Quit
Set objSheet = Nothing
Set appExcel = Nothing
'Open the new spreadsheet for viewing
If FileThere("C:\Program Files\Microsoft Office\Office12\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office12\excel.exe """ & xlsFileString & """")
ElseIf FileThere("C:\Program Files\Microsoft Office\Office14\excel.exe ") Then
Shell ("C:\Program Files\Microsoft Office\Office14\excel.exe """ & xlsFileString & """")
End If
End Sub
[/vba] As a note: Your use of the IsDigit or IsDigits functions could be substituted by simply using the built in IsNumeric function (which accomplishes the same thing those two functions do) within the HeaderOutlineTitle function. But I left that alone, as that is your primary heading pattern analysis which you are the most comfortable with. And it also doesn't appear to adversely impact the speed that much.
Since this works in about 10 seconds on your 30 page document, I would imagine you shouldn't have this run much longer than a minute on any longer document.
As a general note-- *always* use Option Explicit at the top of your modules. It saves you from having VBA try to determine the type of a variable on the fly. You didn't have a Dim y as Integer in your HeaderOutlineTitle function, which means it will be a Variant type-- that can be costly to the memory process.
And lastly-- please note that since I went away from your "line by line" approach, and instead use the Sentences collection, there will be two major problems that you will need to handle.
1) Sentences collection is really tricky. It truncates at times you wouldn't expect, and doesn't truncate at other times. Dr. Rev. Martin Luther King Jr. died in the 20th century. That "sentence" looks to Word like it is 4 sentences.
2) The sentences collection also continues on using any white space. So in some cases, the text returned will include a lot of empty paragraphs. So I've simply placed in the class a quick and dirty method of removing any paragraph marks from the myInfo class. There are numerous ways to approach this, but this is one way.
Hope this helps. Please post the code you got to, if you're interested in any suggestions for your approach. The main takeaway from this approach is that the use of collections is far superior to your inclination to use arrays. Not only can you put custom objects (the classes) into collections, being able to use the .Range.Start property as your main algorithm for document location speeds things up considerably, because you don't have to a) go through the document line by line and b) Longs are much less memory intensive than a string. When you magnify that by string arrays with 1000 elements, you can definitely slow yourself down.
Cheers,
- Jason aka Frosty
Last edited by Frosty; 06-07-2012 at 03:14 PM.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules