PDA

View Full Version : Scanning a Word document for specific text



aeroboy86
12-03-2006, 10:12 PM
Hello All,
I looking to write a macro that searches through a word document and finds all of the words that begin and end in a capital letter then populates a table with the words it finds. The idea is to scan word a document to find possible acroynms contained within it.

eg
"This is some random SS text"
whereby SS = System Specification.

Obviously i need to keep track of the possible acronyms found as they are used many times within a document.

Any help/sample code on this idea would be greatly appreciated.

Thank you for your time.
James Galea

fumei
12-04-2006, 06:35 AM
Hi aeroboy86. Welcome to VBAX!

We are going to start at the beginning. What have YOU done so far to start this rolling?

Have you tried searching the threads here? Have you tried setting up some search routines? What have you done?

aeroboy86
12-04-2006, 03:02 PM
Hi Fumei

I have been searching through this forums threads for a couple of days now and dont seem to be finding the information i require. I am very new to VBA but not to programming. I understand i am probably trying to climb a mountain before i can walk with my previous post but i just want to know if what i want to do is possible and if so a bit of a help getting started. If you can recommend a text book or good website also that would be greatly appreciated.

Thanks
James

fumei
12-04-2006, 10:31 PM
1. What is it exactly that you require, that you say you have not found.

2. Again, what have you tried so far? There are good web sites out there - including this one - but far more important is what YOU actually try.

Post the code that you have tried so far, as we can probably suggest something.

Using regular expressions would likely help.

aeroboy86
12-04-2006, 10:54 PM
I dont have any code as i am not sure where to start. look i understand alot of post seem trivial to you as you obviously are very compotent with VBA as you reply to most post, but i have no experience in VBA and just want to learn. What i need to do is:

1. Scan each word in a word doc 1 by 1 and see if they begin with i capital letter.

eg.

Here is SomE sample TexT

in the above example the code i wish to write would scan the words above 1 by 1 and find obviously the text "SomE" and "TexT" as they begin and end in capital letter.

Regards
James

fumei
12-05-2006, 02:08 AM
I understand perfectly what you want to do, and what you are asking.

But I have no intention of just handing you code solution if you are not going to even try to do something for yourself.

I dont have any code as i am not sure where to start. You start by starting. You DO know where to start - I am not sure why you say you don't.

You know that to start you need to look at every word...now don't you? Don't you? Well...why are you saying you don't know where to start. Start there.

Have you tried doing some code to even look at each word? That may be a good place to start. YOU have to start somewhere. I, nor anyone else, am going to just hand it to you. It is much better for you (or anyone else for that matter) to have a good handle on what is going on. And the only way to do that is to start trying things yourself.

If you are not even going to try anything at all...well...good luck. Maybe someone will just give it to you. This is not a Help desk, and we are not here to hand over solutions to people who ask for them

Again, try something. Post some code that you have tried. Tell us what is not working, and we will suggest things to will help to make it work.

I even gave you a hint to perhaps start. Possibly try using regular expressions. However, you can do it within Word functions.

Again, I understand perfectly what you are asking for.
but i have no experience in VBA and just want to learn.EVERYONE here, and I do mean everyone, has primarily learned by actually trying to code stuff. Sorry, but you say you have programming experience (but not VBA)...well then...start programming.

Take your test sentence - "Here is SomE sample TexT" and work on it. Post what you have as a start.

in the above example the code i wish to write May I repeat YOUR words? ...."I wish to write"


"I wish to write." I believe that means...you. Not me. So...write some. We will be glad to help you out when and if (and likely if) you have problems. I personally will be glad to help. And yes, I DO in fact post code solutions (and some very detailed sophisticated ones) for people. But I never do so for anyone who has not indicated that they are actually working on it.

As I stated, what you are asking is not particulatly easy. It is NOT trivial. Finding words is easy, but you have very specific requirements. I am just about done a solution, but it is a bit messy. Show us that you are trying something.

aeroboy86
12-05-2006, 10:32 PM
Hi Gerry

Ok i started trying to figure it out today and i have thought of possibly an easier way to solve my problem. The idea as you already know is to find possible acronyms contained within a document (sorry for repeating myself), what i thought i could do is scan each word in a word document and search through a list in an excel worksheet.

Can you let me know how i can make a reference to the excel library function ( I hope that is the right terminology ). I want to be able to write something like:


dim xlApp as Excel.Application
dim xlBook as Excel.Workbook
dim xlSheet as Excel.Worksheet

xlApp = new Excel.Application
xlBook = xlApp.workbook.open("the path to the xls sheet")
xlSheet = xlBook.worksheet(1)



im not sure if the above syntax is correct or not. can you please advise if this is correct. Im sorry if i ****ed you off before i had no intension to do so, and i understand that you cant do the work for me.

Also i was trying to figure out how to work with each word in a document. i wrote the following:


dim w as object

For Each w In ActiveDocument.Words
msgbox w
Next


This went through the sample document and a msg box poped up one by one printing the word as its prompts. But when i tried to use an if statement to find a word:

If w.Text = "test"
msgbox w
End If

the above doesnt say there is any word "test" which there is, obviously this means that im not handling the text correctly. If you have any ideas that would be very helpful.

Thanks again
James

fumei
12-06-2006, 07:33 PM
What exactly is your prpgramming experience? You mention it, but it may help to know what your background is for what kind of programming.

You did not **** me off. I was simply stating what the situation is. It does not matter to me if you make this work for you, or not. The point being, is that it is you who need to do the work.

You make a reference to Excel, by making a Reference to Excel. That is, if you are going to use early-binding, versus late-binding. You seem - by what code you posted - that you are going for early binding. Generally speaking (and no doubt there may be some here who would disagree), if possible, I think early binding is better.

In any case, you make a Reference by (while in the VBE) using Tools > References. Find and add the Excel reference. I suggest you do some looking up on references.

Regarding your code for each word. Let's look at it. Here is your first one:Dim w As Object
For Each w In ActiveDocument.Words
msgbox w
Next As you state, each word is displayed. I have something else to say on this, but we'll hold off for a sec.

Now here is your second (I'll add the declaration as well):Dim w As Object
For Each w In ActiveDocument.Words
If w.Text = "test"
msgbox w
End If
NextYou say it does not work. And of course, you are correct...it doesn't.

1. Try w.Text = "test ". Note the trailing space.

OR

2. Try Trim(w.text) = "test"

Each w will include the trailing space. BTW: you do not need to use w.text, just w will do. In fact, w does not have a .Text property.

You probably noticed in your messages that you got more than you thought you would.

"This is some text." FOUR words, right? Say you had your code that included a counter, like this:Sub EachWord()
Dim aWord
Dim i As Integer
i = 1
For Each aWord In ActiveDocument.Words
MsgBox aWord & " Count= " & i
i = i + 1
Next
End SubRunning it would display SIX messages.

This Count= 1
is Count= 2
some Count= 3
text Count= 4
. Count= 5
Count= 6

What is going on???? Well, the period is considered a word, distinct from "text" - the period is NOT considered a trailing space because it IS not a trailing space. (BTW: this is why it is better to use Trim(w.text), rather than w.text = "text ", because w may (or may not) have a trailing space.)

The period ( . ) is considered a separate object.

PLUS, the paragraph mark is also considered an object - which it is.

Which is why I stated that what you are asking is not really all that trivial. You need to be very careful on how you deal with it.

If I understand correctly, you are thinking of:

1. loading an Excel file with a listing of accepted acronyms
2. running through each word in the document and taking that word over to Excel and running it through each word in that list.

Can be done, but this is a huge use of resources. It is a LOT of checking with a LOT of switching back and forth. Think about it. I have no real idea, but would you say that acronyms make up 10% of the document? Less? I am guessing less. Let's say, 2% of all the words in the document are acronyms . Even that may be high.

In any case...that means 98% of all the work - picking up every word (separately) in Word, switching over to Excel, and checking that word against every single word in the list - is wasted work. 98% of the work is pointless.

Does that make sense?

What do you think you could do about that? Do you really need to send EVERY word over to be checked as an acronym? Maybe you have already figured that out. It is a logic issue.

BTW: have you considered using a custom dictionary?

aeroboy86
12-06-2006, 09:52 PM
Hi Gerry,

I agree with you about the waste of reasources checking every word in a document, and im trying to figure out a better way. I am not familiar with a custom dictionary as of yet I will go have a look visual basic help, i agree with you it is a great reasource tool.

Just so you know i am i third year avionics student, and my main back ground in programming is C, i do a bit of Visual Basic Express programming, C++ , Assembly, VHDL but i havent really done much work on VBA. I understand that Visual Basic 2005 Express is very similar, i just have never worked with MS Word using VBA.

The main reason behind the idea of using excel is that the acronym list will need to be continually updated and i originally thought it would be a possible solution.

Just a thought, do you think it is any better to possibly open the excel sheet grab an acronym then do a search through the document for an occurence of the acronym. In saying this you would still have to check for every acronym in the list, but i suppose it is better then sending every word then searching the excel sheet, what do you think?

Anyway thanks for your help ill keep at it and hopefully figure something out.

Cheers
James

aeroboy86
12-07-2006, 07:21 PM
Hi Gerry,

I have been trying to figure out how i can check the fist and the last characters in a word.

Question:
1. Is there a better way to scan text then:

dim aWord as Object
For Each aWord in ActiveDocument.Words
'Do something
Next aWord


This may be a stupid question but from the above VB code are all words considered an object in word?

Cheers
James

fumei
12-07-2006, 09:52 PM
It is not a stupid question at all. What do you think?

All words are not considered objects in Word. In fact, there IS no "word" object in Word. However, what the code does is MAKE an object for each of the defined Ranges - which is what Word considers to be a "word".

So, if I understand your question correctly, yes, every single "word" is considered as an object.

For EACH aWord in ActiveDocument.Words
' do something

For every single Range that I (Word) think of as a "word", yes, I will execute the following instructions....

And if I come across a paragraph mark, then yes, that is a "word" to me (Word) and I will execute those instructions. If you use a couple of Enter key strokes to put "space" between paragraphs...each one of those IS a paragraph mark, and each one will have those instructions executed.

I have been trying to figure out how i can check the fist and the last characters in a word.
Ahhhhhh, finally. You have come upon why I stated that this is NOT trivial.

Finding and checking a "word" is one thing, and is a very common task.

Checking the structure of the word can be fairly easy, and certainly, you CAN do a check to see if the first and last character is capitalized.

I know it may seem like I am giving you a hard time, but what I am really trying to do is nudge you into really thinking about it.

So, OK, yes you are checking each and every word - what choice do you have if you want to...uh, check every word?

And you want to check if the first and last letter is capitalized.

WHAT would be the best thing to do first? I mean, once you have the word you are going to check.

From a logic perspective (and this effort IS a logic operation), there are a number of possible operations that could be performed, but ONE of them is the best starting operation.

WHAT is that?

Dave
12-10-2006, 06:52 AM
James, if each word found is in string format, this should help but it is untested. Dave

Public Function CapLet(Strtest As String)
If (Asc(Right(Strtest, 1)) >= 65) And _
(Asc(Left(Strtest, 1)) <= 90) Then
MsgBox "Found one: " & Strtest
End If
End Function


Something like this to use...

Dim w As Object, Str As String
For Each w In ActiveDocument.Words
Str = w.Text
CapLet (Str)
Next w


As far as placing these in a table that comes next...

fumei
12-10-2006, 10:17 AM
Some comments.

1. (Asc(Right(Strtest, 1)) >= 65) will ALWAYS return FALSE because Right(Strtest,1) will always be the trailing space Asc(32), except for the final word in a sentence.

"This is SomE text with a CouplE with Caps." will NOT find SomE, or CouplE.

Again, it would work if you use Trim, as I suggested.
The value of w As Object includes the trailing space.

2. Even corrected using Trim, the ASCII logic is flawed. Example: "This is SomE text."

The word "This" will come back as "Found one."

(Asc(Right(Strtest, 1)) >= 65) is TRUE, as "s" >65

s = Asc(115)

(Asc(Left(Strtest, 1)) <= 90) is TRUE, as "T" < 90

T = Asc(84)

So "This" will be "Found one." Which of course is incorrect.

3. There is no need whatsoever for the string variable Str, as in:Dim w As Object, Str As String
For Each w In ActiveDocument.Words
Str = w.Text
CapLet (Str)
Next w This works just fine:Dim w As Object
For Each w In ActiveDocument.Words
CapLet (w)
Next wAgain, w is a object set as a string, and does not need .Text either. In fact, again, there IS no .Text property. There is no error for using .Text though.

msgbox w
msgbox w.test

will give exactly the same values.

fumei
12-10-2006, 11:12 AM
Here is an alternative. And I have to say, there are a number of other ways to do this.

If I understand this correctly, you have an Excel file with your list of capped words. Yes?

In Excel, put the following in a public module.Public CappedWords() As String
Public CappedWordsCounter As IntegerThis sets up an string array of all the capped words, and a counter of them.

In Word, put in the following:Sub TestCaps()
Dim w As Object
' explicitly set counter = 0 so this
' can be repeatedly done for testing
CappedWordsCounter = 0
For Each w In ActiveDocument.Words
' check the LAST letter first
' logically it is only going to be a possibility if
' at least the last letter is capped
' there are no words with only the last
' letter capped
If Asc(Right(Trim(w), 1)) >= 65 And _
Asc(Right(Trim(w), 1)) <= 90 Then
' so IF the last letter is capped, THEN
' check the first letter. If the last letter is
' NOT capped, forget it, don't bother checking
' the first letter. Go to next word.
Call CapWords(Trim(w))
End If
Next w
' this next calls a Sub to display the
' array of capped words, if there are any
' Note that even if only ONE word is capped
' the counter is incremented by 1, so even
' though the array may index at 0 (one word)
' the COUNTER would be 1
If CappedWordsCounter > 0 Then
Call ListCappedWords
Else
Msgbox "There are no first and last capped words."
End If
End Sub

Sub CapWords(Strtest As String)
' this is only called if the LAST letter is capitalized
' so now check FIRST letter
Dim bolFirst As Boolean
' if first letter is capped, boolean is TRUE
' note that the parameter Strtest is passed
' TRIM'd, so Left(Strtest,1) is the first letter
Select Case Asc(Left(Strtest, 1))
Case 65 To 90
bolFirst = True
End Select
' if it is TRUE, then you know BOTH first
' and last letters are capped....so
' redim the array and add word
If bolFirst Then
ReDim Preserve CappedWords(CappedWordsCounter)
CappedWords(CappedWordsCounter) = Strtest
CappedWordsCounter = CappedWordsCounter + 1
End If
End Sub

' now you have an array of all words that are capped
' with first and last letters
' and you can DO stuff with that array

Sub ListCappedWords()
' this simply displays a message listing all
' the capped words...HOWEVER
' if the array was in Excel (not Word)
' you can change this to run through logic
' in the Excel file
Dim msg As String
Dim var
For var = 0 To UBound(CappedWords)
' run through the array of capped words
' in this case, build message of all capped words
' OR you could do other logic processing on
' each item of array list
msg = msg & vbCrLf & CappedWords(var)
Next
MsgBox msg
End Sub

If you put all of the above code in Word, you will get a display of all capped words.

The point being, is you can build an array of all the capped words, then work from that.

fumei
12-10-2006, 11:33 AM
I know that looks like a lot of code, but it isn't really. It is heavily commented for you. Here is what it looks like without the comments.Public CappedWords() As String
Public CappedWordsCounter As Integer


Public Sub CapWords(Strtest As String)
Dim bolFirst As Boolean
Select Case Asc(Left(Strtest, 1))
Case 65 To 90
bolFirst = True
End Select
If bolFirst Then
ReDim Preserve CappedWords(CappedWordsCounter)
CappedWords(CappedWordsCounter) = Strtest
CappedWordsCounter = CappedWordsCounter + 1
End If
End Sub

Sub TestCaps()
Dim w As Object
CappedWordsCounter = 0
For Each w In ActiveDocument.Words
If Asc(Right(Trim(w), 1)) >= 65 And _
Asc(Right(Trim(w), 1)) <= 90 Then
Call CapWords(Trim(w))
End If
Next w
If CappedWordsCounter > 0 Then
Call ListCappedWords
End If
End Sub

Sub ListCappedWords()
Dim var
For var = 0 To UBound(CappedWords)
' do stuff with each item in array
Next
End Sub