PDA

View Full Version : Solved: Cleaning up a Word document



johnske
10-31-2004, 05:25 AM
Hi,

Every 3 months I receive 2 Word docs containing records, each is about 35 pages long.

The layout is atrocious and really needs to be cleaned up to make them readable.

They are set out in columns that are delimited by a combination of either 1, 2, or 3 tab spacings and/or 2 or more ordinary spacings on each line.

This combination renders my attempts to use Words' text-to-table useless so I need a procedure that - for a start - would (say) find and replace 2 or more spaces (or tabs) with a single tab (keeping the lines intact) then either convert this text to a table OR export it to an excel workbook instead.

How would I go about doing this? (can zip part of one if that helps) :bink:

Kelly
10-31-2004, 12:13 PM
Is each record one single word? (even if the "word," for example, is something like "555-1212")

What I'm trying to say is: WITHIN a record, are there ever any spaces?

Because if NONE of the records contain spaces (or tabs), then I'm thinking you could do this:

(possibly using regular expressions, or not)

find a "word" (a string of ANY quantity of consecutive NON-space/NON-tab characters), then find the following "word," and then delete ANYTHING in between then (be it a space, two spaces, a tab, two tabs, etc) and put only a single tab between them.

Would that work? If so, I could elaborate further. But it any of the records contain spaces, then I guess it kills my idea.

Bilby
10-31-2004, 03:48 PM
the spaces aren't a big problem, if you use regular expressions. But Regular Expressions are best avoided unless you have no choice. they're rather confusing.

johnske
10-31-2004, 04:32 PM
Is each record one single word? (even if the "word," for example, is something like "555-1212")

What I'm trying to say is: WITHIN a record, are there ever any spaces?

Because if NONE of the records contain spaces (or tabs), then I'm thinking you could do this:.......(etc)

Would that work? If so, I could elaborate further. But it any of the records contain spaces, then I guess it kills my idea.
Unfortunately, yes , some of them contain spaces....

Have attached an abridged version for you to look at.

This is only the start of the clean-up, because the real problem is that it's riddled with mistakes such as names being spelt incorrectly and in various ways such as i) J Smoth ii) J.Smath iii) JSmith when it is one person (J. Smith) being referred to and also, dates such as 8-9th Oct 02 being written as 8/9/10/02.....:bink:

Kelly
10-31-2004, 04:32 PM
I only meant that the spaces would be a problem if, say "Bob Smith" were a record, and elsewhere "January 2003" was actually TWO records, separated only by a single space. You see? how would a two-word single record be distinguishable from two separate one-word records separated only by a space.

BUT - now that I have reread the original situation, it appears that two records always have AT LEAST one tab between them, so I guess my imagination ran away with me when I was envisioning the scenario.

Kelly
10-31-2004, 04:37 PM
Thanks for posting a sample, Johnske!

I'll start playing with it. (after I go grocery shopping :) )

Kelly
10-31-2004, 05:05 PM
Okay, I couldn't resist.... (groceries can wait a little longer)

just to give you an idea of what I have in mind, try this:

open the find and replace dialog in Word, and make sure the option "use wildcards" is selected (checked, ON, etc)

search for:
([^t]{1,3})

that will find any and all of the following:
1. a (all) single tab character
2. TWO CONSECUTIVE tab characters
3. THREE CONSECUTIVE tab characters

if you continue clicking "find next" over and over, you should see how this goes down the document finding almost all of the divisions between recordsets that we need to find.

Again, this is only the beginning. Try a search on Google for "Word wildcards" or something similar to learn more.

just to be clear, I'm attaching a picture below of what the Find and Replace dialog should look like:

(if you can't see the attached picture, you can click here (http://members.dslextreme.com/users/kkj/vbax_special/findandreplace.html))

Kelly
10-31-2004, 07:30 PM
Hey, I'm back already.

Well, here we go.

This isn't perfect, but it's a darn good start.

Open this doc (it's really just the same as the first document posted in this thread, only now it has macros)

hit alt+f8 for the list of macros, and run the macro called "MakeSomeTablesHere"

I'm testing on Word 2000, and I didn't get any errors.
To everyone: let me know if you get any errors

Thanks!!

johnske
10-31-2004, 07:52 PM
Kelly, that is just so cool, it's par excellence!! :eek:

Unfortunately, there is a few errors, the first one I found is on page 3 with the first column entry being "Sydney" - after that the entries are displaced 2 columns to the right.

This is probably because in the original, there are a very large number of ordinary spaces, then a tab space between "Sydney" and "H.Bochow". I'm not sure what you've done in your procedure (have done very little in Word) so I don't know how this would be corrected... :confused:

John :bink:

Kelly
10-31-2004, 08:29 PM
Hi johnske!

So glad you liked it.

Unfortunately, I have to say goodnight for tonight, but I will return tomorrow to further clarify what I did.

How's this for a cliff-hanger: I actually rewrote my macro just now and have now managed to get it to work with 100% accuracy (eliminating the errors that you described, such as with Sydney...H.Bochow).

However, I don't want to "clog up" this thread by posting document after document. So what I will do is this:

tomorrow (or Tuesday at the latest, I hope) I will go through my NEW macro and clean up the indentation and all that. Then I will add explanatory comments to the code. THEN I will post another zipped doc with a nice, clean, effective version of the macro.

I think when you see the "clean" version with comments in the code, I think that you won't have the slightest difficulty understanding what I did, johnske. It's just that the first version is written in really sloppy code. I rushed through because I wasn't sure when someone might demand of me that I get off the computer!! So i wanted to post at least a sloppy sample tonight, because I KNEW that what you were asking was possible.

So... see you tomorrow!

johnske
10-31-2004, 08:43 PM
Ok Kelly, sleep tight.... :snooze

It's actually quite good as it is, it gives much better results than the in-built text-to-tables, but if you can do better - who am I to complain? :yes

Thanx so much,
John :bink:

mdmackillop
11-01-2004, 10:58 AM
Hi Kelly,
I've been "tidying up" documents for years, and never come across the ([^t]{1,3}) expression. It will make life a lot simpler!
I found a useful site here http://word.mvps.org/FAQs/General/UsingWildcards.htm
MD,
PS. Please give your penguin and extra fish on me. :)

Kelly
11-01-2004, 01:49 PM
Hello mdmackillop! (glad to have introduced you to ([^t]{1,3}))

and hello johnske!

Okay, here is the corrected, cleaned-up version.

Besides making the code a little "prettier," the only changes I made were:

1. search and replace the spaces-followed-by-tabs BEFORE searching for excess spaces

2. "up the ante" so to speak, and search for up to 99 consecutive spaces, rather than the prior 15 or whatever quantity I had used

Johnske,
I would encourage you to "step through" the macro to learn more about it. (just open the VB editor window, place your cursor inside the macro, and then hit F8 over and over - while simultaneously checking the "action" ocurring in the document window. Just make sure not to click anywhere in the DOCUMENT while the macro is still being stepped through)

I have added comments to the macro code to make its purpose clear.

If the only confusing part is the "([* ]{1,55})([^t]{1,6})" type stuff, then that really isn't a VBA problem or a VBA challenge at all. That is just a wildcards problem. those same search strings can be used in the normal Word interface without using macros at all.

Johnske,
you never mentioned whether you tried this macro on any other "scorecard" documents beside the sample doc that you posted. Will it work in ALL of your scorecard documents? If it doesn't, then I could most likely tell you why.

Here's why: I wrote my macro based on certain assumptions about the content of your document. Below is a list of my assumptions. Even if only ONE of the conditions is not met, then my macro will not be as effective:

1. the "data lines" in the document (the lines such as "D/IAU 600 25cm R.Crump TCA 2/3/96 1086 60+") MUST BE WRITTEN IN SIZE 10 FONT.

2. All other lines must NOT be written in size 10 font

3. each set of data must be located just below the heading line that ends in the word "SCORE"

4. there must be at least 3 hard returns between the end of a data set and the beginning of whatever follows it

5. within any one data set, there must NEVER be more than one consecutive blank line between data lines

johnske
11-01-2004, 06:48 PM
MD, I think the penguin deserves a cuppla fish now.... :thumb

Hi Kelly, Super!! :vv

There are a few errors, but not so many that they cant be handled manually now.

Here are the stats of one of the complete docs before cleaning up
Pages 36
Words 4,519
Characters(no spaces) 22,860
Characters(spaces) 32,167 {24,649 after}
Paragraphs 801
Lines 2,021
Because of the size of the doc, it takes about 5 min to run the macro, afterwards there were only 6 of the "displaced to the right type" errors (can handle that manually) however, for some reason, there were 80 lines (in "clumps" of 15 to 20, or in twos and threes) that'd had the spaces removed but weren't put into tables - this is not a prob as, because all the spaces have been removed it's quite simple now to use the built-in text-to-table (to 'manually' convert them)

All in all, this gives far better results than the in-built procedure

Thanx ever so much! :)
John :bink:

johnske
11-02-2004, 05:46 PM
Hi Kelly,

The second doc was even worse as the data also included several font sizes.

Reading your constraints, before using it again, I removed the many hundreds of superfluous (blank) lines "selected all" and set the font size to 10, then increased the font size of the headings and ran the macro.

This only gave 11 'displaced to the right' type errors (n.b. not the 1st, but the 2nd, 3rd, or 4th col displaced one col to the right) and only 7 lines that weren't converted to tables.

Given the size of the document and the horrible mess it was in originally, this is nothing short of amazing - 5 stars!! (+ 3 fish for the penguin)

John :bink:

mdmackillop
11-03-2004, 01:56 AM
Hi Kelly,
Just a thought, but if you could test MyRange for a count of tabs >1, instead of the Font 10, that might also work, and also put the Headres into the table. I haven't worked out how to do that yet though (no more time now)
MD

Kelly
11-03-2004, 07:59 AM
Hi Kelly,
Just a thought, but if you could test MyRange for a count of tabs >1, instead of the Font 10, that might also work, and also put the Headres into the table. I haven't worked out how to do that yet though (no more time now)
MD
That's an excellent idea about checking each paragraph for a "tab count" rather than for font size.

As far as putting the headers in the table, you are probably right about that, too. I DID actually consider putting them in, but I wasn't sure if we were later going to compile all of the individual data tables into one BIG table that would then only need one header to cover everything.

Actually, I've been toying with thoughts of how to speed the whole thing up, but I've been working hard this week (for a change ;)) so I haven't had a lot of free brain space to devote.

Kelly
11-03-2004, 08:02 AM
Hi Kelly,

Given the size of the document and the horrible mess it was in originally, this is nothing short of amazing - 5 stars!! (+ 3 fish for the penguin)

John :bink:
By the way... In responding to mdmackillop, how could I ever forget to respond to your message, Johnske!

Thank you from the bottom of my heart for expressing so much gratitude and admiration for what I did. I sincerely appreciate your kind words.

mdmackillop
11-03-2004, 10:54 AM
Hi Kelly,
Regarding the Headers, I meant that they would go into a table by default if the tab count method was used. This may or may not be desirable.
MD

johnske
11-12-2004, 03:56 AM
Hi Kelly,

The (rather large) attachment is a "before" and "after" look at the doc you didn't see, if you're interested in checking it out....

I thought that instead of cleaning docs up every 3 mths it'd be better to go to the source and clean up the originals, so I cleaned one up, put it into Excel form (as in the attachment), sent it to the Regional recorder and asked would he like the originals done like this? - The response was an enthusiastic YES!...Please! (the original is something that's been put together by various voluntary workers over the years). :*)

Another (non-obvious) problem that I stumbled on when cleaning it up was that some lines hadn't an "enter" (linefeed/newline) - if the word wrap took it round to the start of a new line, that's how they'd left it. :yes After fixing this, there were only 5 (consecutive) lines that hadn't been converted to tables - but I could've missed something at that point.

n.b. As these are sporting records they are in the public domain, and, as they're only regional - not that spectacular anyway

Anyway, this makes a lotta ppl happy cos all the Clubs have to go thru these every week to check whether any of their members have got a record claim or not, so, on their behalf, thanx again :vv

John :bink:

Kelly
11-12-2004, 08:07 AM
You're more than welcome!


If I ever found some time, I was planning to mess with this some more to speed it up, but... you convinced them to use Excel! Good for you (and for them)!

so I guess our work on this particular job is done, eh?

Let us know about the next "monster doc" that gets handed to you!

bye for now :hi: