Consulting

Page 1 of 2 1 2 LastLast
Results 1 to 20 of 21

Thread: Solved: Cleaning up a Word document

  1. #1
    Administrator
    Chat VP
    VBAX Guru johnske's Avatar
    Joined
    Jul 2004
    Location
    Townsville, Australia
    Posts
    2,872
    Location

    Solved: Cleaning up a Word document

    Hi,

    Every 3 months I receive 2 Word docs containing records, each is about 35 pages long.

    The layout is atrocious and really needs to be cleaned up to make them readable.

    They are set out in columns that are delimited by a combination of either 1, 2, or 3 tab spacings and/or 2 or more ordinary spacings on each line.

    This combination renders my attempts to use Words' text-to-table useless so I need a procedure that - for a start - would (say) find and replace 2 or more spaces (or tabs) with a single tab (keeping the lines intact) then either convert this text to a table OR export it to an excel workbook instead.

    How would I go about doing this? (can zip part of one if that helps)

  2. #2
    VBAX Regular Kelly's Avatar
    Joined
    Jun 2004
    Location
    Los Angeles, California
    Posts
    84
    Is each record one single word? (even if the "word," for example, is something like "555-1212")

    What I'm trying to say is: WITHIN a record, are there ever any spaces?

    Because if NONE of the records contain spaces (or tabs), then I'm thinking you could do this:

    (possibly using regular expressions, or not)

    find a "word" (a string of ANY quantity of consecutive NON-space/NON-tab characters), then find the following "word," and then delete ANYTHING in between then (be it a space, two spaces, a tab, two tabs, etc) and put only a single tab between them.

    Would that work? If so, I could elaborate further. But it any of the records contain spaces, then I guess it kills my idea.

  3. #3
    VBAX Regular
    Joined
    Sep 2004
    Posts
    65
    Location
    the spaces aren't a big problem, if you use regular expressions. But Regular Expressions are best avoided unless you have no choice. they're rather confusing.

  4. #4
    Administrator
    Chat VP VBAX Guru johnske's Avatar
    Joined
    Jul 2004
    Location
    Townsville, Australia
    Posts
    2,872
    Location
    Quote Originally Posted by Kelly
    Is each record one single word? (even if the "word," for example, is something like "555-1212")

    What I'm trying to say is: WITHIN a record, are there ever any spaces?

    Because if NONE of the records contain spaces (or tabs), then I'm thinking you could do this:.......(etc)

    Would that work? If so, I could elaborate further. But it any of the records contain spaces, then I guess it kills my idea.
    Unfortunately, yes , some of them contain spaces....

    Have attached an abridged version for you to look at.

    This is only the start of the clean-up, because the real problem is that it's riddled with mistakes such as names being spelt incorrectly and in various ways such as i) J Smoth ii) J.Smath iii) JSmith when it is one person (J. Smith) being referred to and also, dates such as 8-9th Oct 02 being written as 8/9/10/02.....

  5. #5
    VBAX Regular Kelly's Avatar
    Joined
    Jun 2004
    Location
    Los Angeles, California
    Posts
    84
    I only meant that the spaces would be a problem if, say "Bob Smith" were a record, and elsewhere "January 2003" was actually TWO records, separated only by a single space. You see? how would a two-word single record be distinguishable from two separate one-word records separated only by a space.

    BUT - now that I have reread the original situation, it appears that two records always have AT LEAST one tab between them, so I guess my imagination ran away with me when I was envisioning the scenario.

  6. #6
    VBAX Regular Kelly's Avatar
    Joined
    Jun 2004
    Location
    Los Angeles, California
    Posts
    84
    Thanks for posting a sample, Johnske!

    I'll start playing with it. (after I go grocery shopping )

  7. #7
    VBAX Regular Kelly's Avatar
    Joined
    Jun 2004
    Location
    Los Angeles, California
    Posts
    84
    Okay, I couldn't resist.... (groceries can wait a little longer)

    just to give you an idea of what I have in mind, try this:

    open the find and replace dialog in Word, and make sure the option "use wildcards" is selected (checked, ON, etc)

    search for:
    ([^t]{1,3})

    that will find any and all of the following:
    1. a (all) single tab character
    2. TWO CONSECUTIVE tab characters
    3. THREE CONSECUTIVE tab characters

    if you continue clicking "find next" over and over, you should see how this goes down the document finding almost all of the divisions between recordsets that we need to find.

    Again, this is only the beginning. Try a search on Google for "Word wildcards" or something similar to learn more.

    just to be clear, I'm attaching a picture below of what the Find and Replace dialog should look like:

    (if you can't see the attached picture, you can click here)

  8. #8
    VBAX Regular Kelly's Avatar
    Joined
    Jun 2004
    Location
    Los Angeles, California
    Posts
    84
    Hey, I'm back already.

    Well, here we go.

    This isn't perfect, but it's a darn good start.

    Open this doc (it's really just the same as the first document posted in this thread, only now it has macros)

    hit alt+f8 for the list of macros, and run the macro called "MakeSomeTablesHere"

    I'm testing on Word 2000, and I didn't get any errors.
    To everyone: let me know if you get any errors

    Thanks!!

  9. #9
    Administrator
    Chat VP VBAX Guru johnske's Avatar
    Joined
    Jul 2004
    Location
    Townsville, Australia
    Posts
    2,872
    Location
    Kelly, that is just so cool, it's par excellence!!

    Unfortunately, there is a few errors, the first one I found is on page 3 with the first column entry being "Sydney" - after that the entries are displaced 2 columns to the right.

    This is probably because in the original, there are a very large number of ordinary spaces, then a tab space between "Sydney" and "H.Bochow". I'm not sure what you've done in your procedure (have done very little in Word) so I don't know how this would be corrected...

    John

  10. #10
    VBAX Regular Kelly's Avatar
    Joined
    Jun 2004
    Location
    Los Angeles, California
    Posts
    84
    Hi johnske!

    So glad you liked it.

    Unfortunately, I have to say goodnight for tonight, but I will return tomorrow to further clarify what I did.

    How's this for a cliff-hanger: I actually rewrote my macro just now and have now managed to get it to work with 100% accuracy (eliminating the errors that you described, such as with Sydney...H.Bochow).

    However, I don't want to "clog up" this thread by posting document after document. So what I will do is this:

    tomorrow (or Tuesday at the latest, I hope) I will go through my NEW macro and clean up the indentation and all that. Then I will add explanatory comments to the code. THEN I will post another zipped doc with a nice, clean, effective version of the macro.

    I think when you see the "clean" version with comments in the code, I think that you won't have the slightest difficulty understanding what I did, johnske. It's just that the first version is written in really sloppy code. I rushed through because I wasn't sure when someone might demand of me that I get off the computer!! So i wanted to post at least a sloppy sample tonight, because I KNEW that what you were asking was possible.

    So... see you tomorrow!

  11. #11
    Administrator
    Chat VP VBAX Guru johnske's Avatar
    Joined
    Jul 2004
    Location
    Townsville, Australia
    Posts
    2,872
    Location
    Ok Kelly, sleep tight....

    It's actually quite good as it is, it gives much better results than the in-built text-to-tables, but if you can do better - who am I to complain?

    Thanx so much,
    John

  12. #12
    Administrator
    VP-Knowledge Base VBAX Grand Master mdmackillop's Avatar
    Joined
    May 2004
    Location
    Scotland
    Posts
    14,489
    Location
    Hi Kelly,
    I've been "tidying up" documents for years, and never come across the ([^t]{1,3}) expression. It will make life a lot simpler!
    I found a useful site here http://word.mvps.org/FAQs/General/UsingWildcards.htm
    MD,
    PS. Please give your penguin and extra fish on me.

  13. #13
    VBAX Regular Kelly's Avatar
    Joined
    Jun 2004
    Location
    Los Angeles, California
    Posts
    84
    Hello mdmackillop! (glad to have introduced you to ([^t]{1,3}))

    and hello johnske!

    Okay, here is the corrected, cleaned-up version.

    Besides making the code a little "prettier," the only changes I made were:

    1. search and replace the spaces-followed-by-tabs BEFORE searching for excess spaces

    2. "up the ante" so to speak, and search for up to 99 consecutive spaces, rather than the prior 15 or whatever quantity I had used

    Johnske,
    I would encourage you to "step through" the macro to learn more about it. (just open the VB editor window, place your cursor inside the macro, and then hit F8 over and over - while simultaneously checking the "action" ocurring in the document window. Just make sure not to click anywhere in the DOCUMENT while the macro is still being stepped through)

    I have added comments to the macro code to make its purpose clear.

    If the only confusing part is the "([* ]{1,55})([^t]{1,6})" type stuff, then that really isn't a VBA problem or a VBA challenge at all. That is just a wildcards problem. those same search strings can be used in the normal Word interface without using macros at all.

    Johnske,
    you never mentioned whether you tried this macro on any other "scorecard" documents beside the sample doc that you posted. Will it work in ALL of your scorecard documents? If it doesn't, then I could most likely tell you why.

    Here's why: I wrote my macro based on certain assumptions about the content of your document. Below is a list of my assumptions. Even if only ONE of the conditions is not met, then my macro will not be as effective:

    1. the "data lines" in the document (the lines such as "D/IAU 600 25cm R.Crump TCA 2/3/96 1086 60+") MUST BE WRITTEN IN SIZE 10 FONT.

    2. All other lines must NOT be written in size 10 font

    3. each set of data must be located just below the heading line that ends in the word "SCORE"

    4. there must be at least 3 hard returns between the end of a data set and the beginning of whatever follows it

    5. within any one data set, there must NEVER be more than one consecutive blank line between data lines

  14. #14
    Administrator
    Chat VP VBAX Guru johnske's Avatar
    Joined
    Jul 2004
    Location
    Townsville, Australia
    Posts
    2,872
    Location
    MD, I think the penguin deserves a cuppla fish now....

    Hi Kelly, Super!!

    There are a few errors, but not so many that they cant be handled manually now.

    Here are the stats of one of the complete docs before cleaning up
    Pages 36
    Words 4,519
    Characters(no spaces) 22,860
    Characters(spaces) 32,167 {24,649 after}
    Paragraphs 801
    Lines 2,021
    Because of the size of the doc, it takes about 5 min to run the macro, afterwards there were only 6 of the "displaced to the right type" errors (can handle that manually) however, for some reason, there were 80 lines (in "clumps" of 15 to 20, or in twos and threes) that'd had the spaces removed but weren't put into tables - this is not a prob as, because all the spaces have been removed it's quite simple now to use the built-in text-to-table (to 'manually' convert them)

    All in all, this gives far better results than the in-built procedure

    Thanx ever so much!
    John

  15. #15
    Administrator
    Chat VP VBAX Guru johnske's Avatar
    Joined
    Jul 2004
    Location
    Townsville, Australia
    Posts
    2,872
    Location
    Hi Kelly,

    The second doc was even worse as the data also included several font sizes.

    Reading your constraints, before using it again, I removed the many hundreds of superfluous (blank) lines "selected all" and set the font size to 10, then increased the font size of the headings and ran the macro.

    This only gave 11 'displaced to the right' type errors (n.b. not the 1st, but the 2nd, 3rd, or 4th col displaced one col to the right) and only 7 lines that weren't converted to tables.

    Given the size of the document and the horrible mess it was in originally, this is nothing short of amazing - 5 stars!! (+ 3 fish for the penguin)

    John

  16. #16
    Administrator
    VP-Knowledge Base VBAX Grand Master mdmackillop's Avatar
    Joined
    May 2004
    Location
    Scotland
    Posts
    14,489
    Location
    Hi Kelly,
    Just a thought, but if you could test MyRange for a count of tabs >1, instead of the Font 10, that might also work, and also put the Headres into the table. I haven't worked out how to do that yet though (no more time now)
    MD

  17. #17
    VBAX Regular Kelly's Avatar
    Joined
    Jun 2004
    Location
    Los Angeles, California
    Posts
    84
    Quote Originally Posted by mdmackillop

    Hi Kelly,
    Just a thought, but if you could test MyRange for a count of tabs >1, instead of the Font 10, that might also work, and also put the Headres into the table. I haven't worked out how to do that yet though (no more time now)
    MD
    That's an excellent idea about checking each paragraph for a "tab count" rather than for font size.

    As far as putting the headers in the table, you are probably right about that, too. I DID actually consider putting them in, but I wasn't sure if we were later going to compile all of the individual data tables into one BIG table that would then only need one header to cover everything.

    Actually, I've been toying with thoughts of how to speed the whole thing up, but I've been working hard this week (for a change ) so I haven't had a lot of free brain space to devote.

  18. #18
    VBAX Regular Kelly's Avatar
    Joined
    Jun 2004
    Location
    Los Angeles, California
    Posts
    84
    Quote Originally Posted by johnske
    Hi Kelly,

    Given the size of the document and the horrible mess it was in originally, this is nothing short of amazing - 5 stars!! (+ 3 fish for the penguin)

    John
    By the way... In responding to mdmackillop, how could I ever forget to respond to your message, Johnske!

    Thank you from the bottom of my heart for expressing so much gratitude and admiration for what I did. I sincerely appreciate your kind words.

  19. #19
    Administrator
    VP-Knowledge Base VBAX Grand Master mdmackillop's Avatar
    Joined
    May 2004
    Location
    Scotland
    Posts
    14,489
    Location
    Hi Kelly,
    Regarding the Headers, I meant that they would go into a table by default if the tab count method was used. This may or may not be desirable.
    MD

  20. #20
    Administrator
    Chat VP VBAX Guru johnske's Avatar
    Joined
    Jul 2004
    Location
    Townsville, Australia
    Posts
    2,872
    Location
    Hi Kelly,

    The (rather large) attachment is a "before" and "after" look at the doc you didn't see, if you're interested in checking it out....

    I thought that instead of cleaning docs up every 3 mths it'd be better to go to the source and clean up the originals, so I cleaned one up, put it into Excel form (as in the attachment), sent it to the Regional recorder and asked would he like the originals done like this? - The response was an enthusiastic YES!...Please! (the original is something that's been put together by various voluntary workers over the years).

    Another (non-obvious) problem that I stumbled on when cleaning it up was that some lines hadn't an "enter" (linefeed/newline) - if the word wrap took it round to the start of a new line, that's how they'd left it. After fixing this, there were only 5 (consecutive) lines that hadn't been converted to tables - but I could've missed something at that point.

    n.b. As these are sporting records they are in the public domain, and, as they're only regional - not that spectacular anyway

    Anyway, this makes a lotta ppl happy cos all the Clubs have to go thru these every week to check whether any of their members have got a record claim or not, so, on their behalf, thanx again

    John

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •