PDA

View Full Version : Extracting Text from word docs



Jdrewrd
08-31-2009, 07:49 AM
Hi,

I am in need of a Word macro which will pull text from multiple files inside a directory. These files contain text strings of the format:


N 35˚ 04’ 45.6” W 085˚ 08’ 24.1” (global coordinates)

There are several instances of these coordinates within each word file.

The N 35˚ 04’ and W 085˚ 08’ is always the same, but the last 4 digits (24.1”) will change. What I need is a comma delimited or csv, xls, etc (whatever is easiest) that contains this:

N 35˚ 04’ 45.6”, W 085˚ 08’ 24.1”, <filename>

for every instance of a coordinate that is found (file names will be duplicated). These could actually exclude the N 35˚ 04’ & W 085˚ 08’ since they never change. So it could look like this (actually preferable):

45.6,24.1,<filename>

Any help will be greatly appreciated! Thanks!

BTW, these text strings are contained within a table that looks like it has been pasted from excel. I don't know if this matters. I'm attaching a sample.

fumei
09-01-2009, 12:31 PM
1. It would make it easier if the tables were bookmarked so you could extract data from them by name.

2. Regarding multiple files from a folder, use FileSearch - that is if you are NOT using 2007. Alternatively, if the files are in the SAME folder, you can use the DIR function.

3. While technically this kind of thing would be better to use in Excel - it does after all deal with numbers - it can be done with Word.

4. You do not state what you actually want to do with the final data. But it SURE makes a difference!

5. Looking at the demo file, I notice only ONE table has the pertinent data of N 35˚ 04’ 45.6. All the other tables have things like N 35˚ 04’ 48.2; N 35˚ 04’ 47.4. Do you want all of them? You do not say. I am working on the assumption you DO want them all.

6. The data is NOT totally consistent, which could be (but could be fixed) a problem. For example:

N 35˚ 04’ 49.6” W 085˚ 08’ 22.0” Elev. 733 +/- 1’
N 35˚ 04’ 45.9” W 085˚ 08’ 24.3” Elev. 740 +/- 1’

They look similar, but.......there are TWO spaces between
the 04' and 45.9", and there is ONE space between 04' 49.6"

ASIDE: another reason to have Show/Hide = ON!

You can see this easily in the demo attached. This demo - click Show Demo on the top toolbar - simply displays the results (as stated) in a message box. This could be easily extracted into another fresh document if you like. Here is some (not particularly efficient) demo code. For one thing, because of the number of items, the limit of messagebox display is exceeded. This is not the way I would do it. See #1 above.
Option Explicit

Sub GetTableData()
Dim aTable As Table
Dim oCell As Cell
Dim r1 As Range
Dim r2 As Range
Dim DataString As String
Dim oBM As Bookmark
Dim DocBM As Bookmarks
Dim msg As String

Set DocBM = ActiveDocument.Bookmarks
For Each oBM In DocBM
If oBM.Range.Tables.Count = 1 Then
Set aTable = oBM.Range.Tables(1)
For Each oCell In aTable.Range.Cells
DataString = oCell.Range.Text
If InStr(DataString, "N 35") = 1 Then
Set r1 = oCell.Range
r1.MoveStart Unit:=wdCharacter, Count:=10
r1.End = r1.Start + 4
Set r2 = oCell.Range
r2.MoveStart Unit:=wdCharacter, Count:=26
r2.End = r2.Start + 5
msg = msg & r1.Text & "," & r2.Text & _
"< " & ActiveDocument.Name & " >" _
& vbCrLf
End If
Next
End If
Next
MsgBox msg
End Sub


As for multiple files in a folder, this is quite straightforward and can be done simply with the DIR function. if your processing code is in a global template (better), or in Normal.dot (not so good), then you can simply Call it for each file in a folder. Assuming the files are all in the C:\Yadda folder (but of course could be ANY folder). Kind of like this:

Public msg As String

Sub ProcessStuff()
Const path As String = "C:\ZZZ\Locations\"
Dim r As Range

Dim file
' this makes a NEW document to hold the results
Dim ResultsDoc As Document
Set ResultsDoc = Documents.Add
Set r = ResultsDoc.Range
file = Dir(path & "*Location*.doc")
Do While file <> ""
' open each document
Documents.Open FileName:=path & file
' call the getdata procedure
' dumps the data from current document
' into NEW results document
' then RESET range of results doc and Collapse!
' VERY IMPORTANT!
Call GetTableData(ResultsDoc)
Set r = ResultsDoc.Range
r.Collapse 0
ActiveDocument.Close wdDoNotSaveChanges
' go to the next document
file = Dir()
Loop
With ResultsDoc
With .Range
.Collapse 0
.InsertAfter Text:=" useless processes = " & _
numUselessProcesses
End With
.SaveAs FileName:="c:\zzz\locations\Results.doc"
End With
End Sub


Result (truncated):

45.6, 24.1< DemoLocations_A.doc >
45.7, 25.1< DemoLocations_A.doc >
45.9, 21.6< DemoLocations_A.doc >
46.9, 22.7< DemoLocations_A.doc >
46.4, 25.1< DemoLocations_A.doc >
.......
45.,’ 24.< DemoLocations_A.doc > (notice result of the EXTRA space!)
......
47.5, 23.2< DemoLocations_B.doc >
48.2, 23.4< DemoLocations_B.doc >
48.5, 22.1< DemoLocations_B.doc >
47.9, 24.2< DemoLocations_B.doc >
46.7, 22.0< DemoLocations_B.doc >
45.6, 24.1< DemoLocations_C.doc >
45.7, 25.1< DemoLocations_C.doc >
45.9, 21.6< DemoLocations_C.doc >
46.9, 22.7< DemoLocations_C.doc >
useless processes = 3936

Which is sort of what you are asking about.

NOTES!:

1. Notice the Public variable "msg" used to carry the data over to the new document.

2. So...what is with the " useless processes = 3936" tacked on to the end?

Remember at the start of this?

"1. It would make it easier if the tables were bookmarked so you could extract data from them by name."

In my first code I used a bookmarked table.
Set DocBM = ActiveDocument.Bookmarks
For Each oBM In DocBM
If oBM.Range.Tables.Count = 1 Then
' so ONLY if a bookmark HAS a table do the next processing


You most likely are NOT using bookmarked tables, so......to process, you have to process through ALL the tables, including the blank one at the start of the doc; including the one with "TECHNICIAN" and "CONTRACTOR"...right? How does VBA know which table to look for the "N 35˚ 04’ " , unless that table is bookmark? It can not. So for my first test document, I bookmarked.

So out of curiousity, I did another, that I did NOT bookmark (and thus has to test through ALL of the tables, and ALL of cells, looking for the "N 35˚ 04’ "). Here is the code, with the counter "nunUselessProcesses".

Public numUselessProcesses As Long


Sub GetTableData(DocIn As Document)
Dim aTable As Table
Dim oCell As Cell
Dim DocInRange As Range
Dim r1 As Range
Dim r2 As Range
Dim DataString As String
Dim msg As String

Set DocInRange = DocIn.Range
' this process against the CURRENT doc re: DIR function
For Each aTable In ActiveDocument.Tables
For Each oCell In aTable.Range.Cells
' counting every time it process a cell INCLUDING
' EVERY SINGLE blank cell!
numUselessProcesses = numUselessProcesses + 1
DataString = oCell.Range.Text
If InStr(DataString, "N 35") = 1 Then
' make ranges for two separate text parts
Set r1 = oCell.Range
r1.MoveStart Unit:=wdCharacter, Count:=10
r1.End = r1.Start + 4
Set r2 = oCell.Range
r2.MoveStart Unit:=wdCharacter, Count:=26
r2.End = r2.Start + 5
' make one string of those
msg = r1.Text & "," & r2.Text & _
"< " & ActiveDocument.Name & " >" _
& vbCrLf
' dump them in results doc
With DocInRange
.InsertAfter _
Text:=msg
.Collapse 0
End With
End If
Next
Next
End Sub



Thus....for three test (only three) documents, there were 3936 useless, pointless, and extremely inefficient and extraneous procedures executed!

Admittedly, as I have a rather fast machine it did not make a huge huge difference, but there was noticeably some. From a coding POV though...yuck.

It would be even better to write a wee chunk of code to look through each table for the search string ("N 35") , and bookmark the table with a sequential number THEN process the table using that bookmark.

Tinbendr
09-21-2009, 01:12 PM
It's been a little while since the OP posted, but here are my efforts.

This is the main code. See attachment for the rest.


Private Sub GetCoordinates(FName$)
Dim bDoc As Document
Dim SrcDoc As Document
Dim RngResult As Range
Dim Rng As Range
Set bDoc = ActiveDocument
Set SrcDoc = Documents.Open(FName$)
Set RngResult = SrcDoc.Range
Do
With RngResult.Find
.ClearFormatting
.Text = "N 35"
.Forward = True
.Wrap = wdFindStop
.Execute
End With

If Not RngResult.Find.Found Then Exit Do

'Since the text is located in a table,
'and if you manipulate the range Found start/end point,
'Find starts over from the top, resulting in an endless loop.
'So duplicate the Found range and manipulate it instead.
Set Rng = RngResult.Duplicate
'parse string
'N 35° 04' 45.6" W 085° 08' 24.1" Elev. 739 +/- 1'
'Result should look like this.
'45.6,24.1,<filename>
'Move the start over to the Single Quote
Rng.MoveStartUntil Chr(146)
'Move the Start over one to skip the quote
Rng.MoveStart wdCharacter
'Move the End over to the Double Quote symbol
Rng.MoveEndUntil Chr(148)

'Insert the first coordinate
bDoc.Range.InsertAfter Trim(Rng) & ","

'Do the same process to capture the second coordinate.
Rng.MoveStartUntil Chr(146)
Rng.MoveStart wdCharacter
Rng.MoveEndUntil Chr(148)

'Insert the second coordinate
bDoc.Range.InsertAfter Trim(Rng) & "," & SrcDoc.Name
'Insert a carriage return
bDoc.Range.InsertAfter vbCr

Loop Until Not RngResult.Find.Found
SrcDoc.Close wdDoNotSaveChanges
End Sub