PDA

View Full Version : Newbie questions: advanced table processing



calvink
07-16-2007, 05:18 AM
Hello, please pardon me if what I'm asking sounds foolish, ignorant or stupid. I'm actually more into web development, however I have this one project for my company which is of top priority which isn't part of my repertoire, mainly VBA programming.

Here is the background. We have a lot of scanned documents in pdf format (hundreds of thousands) that need to be OCR'ed so that we can do data extraction and manipulation. These scanned documents come in different batches over time, hence there are at least a few different formats that they come in. But in general we're looking at a table consisting of a few columns, and possibly about 50 rows of data for one page of the pdf. Apart from the main data in the table, there are also other information above and below the table (header and footer) that need to be captured.

Using Abbyy Finereader OCR package, I can either have the OCR'ed results exported into either Excel or Word. Now let's say I do it in Word. So I have (for example) 10 Word files (generated from the 10 pdf sources) that each has a big table filling the entire page, with all the data in it.

So using VBA, can anyone please confirm that a program/macro/application can be written, to parse each of the 10 Word files, and perform a multitude of data extraction with the data in the table?

Here are some examples:
1) In the table headers, look for the column with 'Amount' text in it, and once found, process all the cells below this 'Amount' cell, in this column only. Then return results that are more than 1000 (ie that particular amount plus all the other columns' values for this row).
2) Have all the results from 1) (ie, those records whic have > 1000 in the 'Amount' column') saved into a file/db for further analysis (so basically we're just filtering out all the big amounts into a separate place).

etc...

So can this be easily/conveniently done via VBA, bearing in mind that I have to process more than 100,000 files eventually (of course not concurrently but that's another issue)?

If this is feasible then I will seriously consider posting it up in the Jobs Posting section as I really need a solution for this pronto. But please be aware that the whole process is actually alot more complicated than it sounds here (OCR uncertainty characters, inconsistent table formats, large amount of files, etc are just some of the problems).

Thanks in advance for any comments and suggestions :)

fumei
07-16-2007, 10:34 AM
Aaiieeeeeee!

"can this be easily/conveniently done via VBA" ....ummm, define easily.

Can it be done? Yes. However...."OCR uncertainty characters, inconsistent table formats" will be the biggest problem. THAT could be seriously non-trivial.

Hard to say. I just did a rough testing of what you are asking, and sure, it can be done...but I was testing on a simple, cleanly created document.

Methinks your situation would get ugly pretty fast.

mdmackillop
07-16-2007, 10:45 AM
Hi Calvink,
Welcome to VBAX
Having tried OCR a few times, I've never found it reliable enough to trust. You could be prompted to correct cells containing non-numeric values, which might help.
Can you zip and post a some sample tables so we can see how "corrupt" your data is?
Regards
MD

fumei
07-16-2007, 10:52 AM
Oh and please do NOT think your request is anywhere near foolish, ignorant or stupid. Under the circumstances you mention, your request seems reasonable and sane.

calvink
07-16-2007, 11:44 AM
Thanks for the replies, I share some of your concerns as well regarding this whole thing.

I will post some sample files from the office in a few hours. It's 2.40am here and I'm gonna go get some shuteyes now! :)

Cheers.

fumei
07-16-2007, 12:47 PM
Here is a sample start. It can be done differently. You can test with the attached document by clicking "Test1" on the top toolbar.

You can test by changing the numbers in the "Amount" column.

You can also move the Amount column as well. The code runs on finding any column with "Amount" in a cell on the first row. So if you moved it to be the third column, or first column, it does not matter.

It sets a Column object for the column "Amount" is found in, then tests each cell of that column as IsNumeric, adding up those that are numeric.

This is VERY simplistic and undoubtably would need to be greatly modified to work with the situatiin you describe.Option Explicit

Sub Test1()
Dim aTable As Table
Dim aCell As Cell
Dim aCellNum As Cell
Dim aRow As Row
Dim aCol As Column
Dim j As Long
Dim MyCheck

Set aTable = ActiveDocument.Tables(1)
Set aRow = aTable.Range.Rows(1)
For Each aCell In aRow.Cells
If CellText(aCell) = "Amount" Then
Set aCol = aTable.Columns(aCell.ColumnIndex)
For Each aCellNum In aCol.Cells
If MyCheck = IsNumeric(CellText(aCellNum)) _
= False Then
j = j + CLng(CellText(aCellNum))
End If
Next
Exit For
End If
Next
If j > 1000 Then
MsgBox "Amount column Total " & j & " is > 1000."
Else
MsgBox "Amount column Total is less than 1000."
End If
End Sub


Function CellText(aCell As Cell)
CellText = Left(aCell.Range.Text, Len(aCell.Range.Text) - 2)
End Function

calvink
07-16-2007, 10:28 PM
Many thanks for your test script fumei. I've tried it out, you're actually adding all the Amount values together right and compare it with 1000?

Well actually my goal is somewhat different. Let me try to explain it in a different way. Basically I need to parse through each row/record in the table, and return rows that have the Amount column value > 1000.

Name ID Amount
--------------------------
John 1234 800.00
Jane 3345 60.00
Sam 5432 1010.66

So here I will only get Sam's data (Sam, 5432, 1010.66). Let's say I get 10 of these rows which are > 1000. Then I would like to save them say in an Excel file so I can do further manipulation.

Reason why is for all these thousands of pdfs that we have, only a very small % are 'high value cases' that we're interested in and these are the ones > 1000 (or any arbitrary value that we can set).

I hope that makes it clearer. Sounds pretty simple huh.

OK then do have a look at the attached sample doc. This is an actual output from the OCR package. The first page has a different format and needs to be processed separately. The rest are all tables as I explained in my first post. Notice all the 'errors'/uncertain characters highlighted in blue - this is largely due to the low-res nature of the original scanned pdf (which I'll need to somehow fix to reduce the errors) and also 'noises'/marks on the paper that have been picked up by the scanning.

So a few notes:
- The documents are in Malay language
- Columns (from left) are Number, Name, Address, ID number, Amount (RM), Reference Number and Date (month/year).
- The 'Amount' cell is normally corrupted (eg we get 'AAu<a' etc), and it's usually not on the very first row.
- The ultimate aim here is to capture all records with Amount > 1000, output these results into Excel. Uncertain characters if possible need to be captured so that we can go through each row and refer back to the original pdf to double check on the data if level of uncertainty is high. This also brings up the issue of capturing the original document's page & serial number for the cross referencing, but that's another post altogether.

One question: how hard (if at all) to code it in VBA Excel? Will the macro still be similar to Word as provided by fumei? This is because I can choose to output the OCR to Excel instead of Word - just that now Word is chosen because it can highlight the uncertain characters.

So, what do you guys think of my nightmare? :grinhalo::stars::cry:

mdmackillop
07-17-2007, 02:16 AM
Hi Calvink,
Post this in the Excel forum with a link to here. No harm in exploring both options and getting two points of view.

fumei
07-17-2007, 08:52 AM
Actually, my original test doc I was attaching had code for when the total for the column added up to > 1000.

However, I changed it when I re-read your post, which stated:
Then return results that are more than 1000 (ie that particular amount plus all the other columns' values for this row)It is easy enough to change it so each cell value is tested to be > 1000.

You are in big trouble.

You mention "errors'/uncertain characters highlighted in blue". What makes the "8" in 80 15 an error?

What makes the "." (dot) an error in 138.92???? It looks like a legitimate number to me. Especially in that it seems in line with the other values. What makes the "." (dot) in 35.91 an error????

Let's look at another partial listing:
9844
32905
274.93
7242
280 61
13429
652S

Interestingly enough the "S" in 652S is NOT highlighted as an error (in blue).

Looking at the data (as it stands), you COULD write logic that takes the cell content, and:

1. check if it blank
2. if it is NOT, then check if it IsNumeric
3. if it is NOT, strip off any leading and trailing alpha characters, as well as spaces
4. if it has alpha characters still (ie. within the string), and it is NOT a dot (.), put aside for manual checking
5. take the content (ie. ones with no inside alpha characters) and process numerically.

EXCEPT....you may have a problem. Look at another sample:
5 22
10.57
5.10
306 11
10 54
15 75
15.24
97 23
503

Now clearly you can infer that:

5 22 should be 5.22
306 11 should be 306.11
10 54 should be 10.54

But what do you do with 503? Considering how much errors are possible, is it 5.03, or 50.30? It IS possible to write logic that would make a pretty good guess. From checking, it does not seem that the OCR dropped digits - so my guess is that it is 5.03, not 50.30.

What do you do with "55.6?" Clearly the "?" is a number...but what number?

My opinion? Yes...it can be done. There is very messy string manipulation/validation, but it is do-able.

If the column required is always #5, that makes it straightforward identifying which column to use. No need to use a search for the column with "Amount". You can ignore the weirdness of AAuwhateverwhatever. Just use column 5. Except.....it is not. The first table has the Amount column as #5, but the next table has it has #6.

Oh-oh.

fumei
07-17-2007, 09:03 AM
Oh, as for the Excel vs Word issue, as the logic required to get information out of the files is the same, AND you seem to be dealing with numbers....then it is a no-brainer really.

Excel is for numbers.
Word is for text.

BTW: I did succeed in getting quite good values out of my test on the first table in your sample. It was annoying that the first cell value was:

100.94 138.92

That is, TWO (I think) values NOT on separate rows. It is the only one like that, but it made it a pain. Plus, of course, I had no idea what to do with 2C.S0. However, other than that 2C.S0, I got every value out.

So again, I believe it can be done. If you have posted it over in the Excel forum I am sure you will get excellent help.

Good luck.

fumei
07-17-2007, 09:08 AM
Oh, and regarding the column with the amount? It would be (and is) easy to process the code for the table the Selection is in. Put the cursor in a table, put up a InputBox asking which column, then process with the column number entered.

I just hate using Selection, if possible. Also considering the volume, you would certainly want to automate this as much as possible.

calvink
07-18-2007, 04:11 AM
Well fumei thanks for all your input, they're certainly valid and need to be considered thoroughly. What do you think if we just think on the big picture: ignore the uncertainties, just capture all records which could be > 1000. This involves checking the presence of the decimal, its position, and what to do if it's found or not found. Naturally this involves some complex logic to cover all possibilities thinkable.

Once these possible 1000-value rows have been captured and stored into another file, we can further process this filtered down list, and hopefully it won't be as big as the original document. Then at least if we have to do manual checking it won't be as much as the original.

For the Amount column we can always make use of surrounding 'anchors' from adjacent columns/rows to guess the location of the column.

Gets ugliers :)

fumei
07-18-2007, 10:24 AM
If the Amounts column varies a lot, I would first run a small procedure to get the columns to be processed. It would move through the document, from table to table, asking which column is the one to use.

The input numbers would be put into an array. You can then use the array for processing the correct column for each table. Like this:Option Explicit

Public ColDo() As Long

Sub GetColumns()
Dim j As Long
Dim ColNum As Long
Dim var
Selection.HomeKey Unit:=wdStory
If Selection.Information(wdWithInTable) = True Then
For var = 1 To ActiveDocument.Tables.Count
ColNum = InputBox("Which column???? " & _
"If NOT applicable, input 0.")
ReDim Preserve ColDo(j)
ColDo(j) = ColNum
j = j + 1
With Selection
.GoTo what:=wdGoToTable, which:=wdGoToNext
.Collapse Direction:=wdCollapseStart
End With
Next
Else
With Selection
.GoTo what:=wdGoToTable, which:=wdGoToNext
.Collapse Direction:=wdCollapseStart
End With
For var = 1 To ActiveDocument.Tables.Count
ColNum = InputBox("Which column???? " & _
"If NOT applicable, input 0.")
ReDim Preserve ColDo(j)
ColDo(j) = ColNum
j = j + 1
With Selection
.GoTo what:=wdGoToTable, which:=wdGoToNext
.Collapse Direction:=wdCollapseStart
End With
Next
End If
End Sub


Sub TestTableCol()
Dim var
Dim msg As String
Call GetColumns
For var = 0 To UBound(ColDo)
Select Case ColDo(var)
Case 0
msg = msg & "Table " & var + 1 & _
" has no Amount column. No action." & _
vbCrLf
Case Else
msg = msg & "Table " & var + 1 & _
" will process column " & ColDo(var) & "." & _
vbCrLf
End Select
Next
MsgBox msg
End SubFor this demo purpose the procedure displays a message box stating which column will be processed for each table. For your demo file you posted, the result is:

Table 1 has no Amount column. No action.
Table 2 will process column 5.
Table 3 will process column 6.
Table 4 will process column 5.
Table 5 will process column 6.
Table 6 will process column 5.

OK? Except........damn it......

I changed the procedure to production type instructions, like this:Sub TestTableCol()
Dim var
Call GetColumns
For var = 0 To UBound(ColDo)
Select Case ColDo(var)
Case 0
' do nothing, go to next table
Case Else
Call ActionTable(var + 1, ColDo(var))
End Select
Next
End Sub

Sub ActionTable(k As Long, j As Long)
Dim aTable As Table
Dim aCol As Column
Set aTable = ActiveDocument.Tables(k)
Set aCol = aTable.Columns(j)
' test with a message
MsgBox "Actioning Table # " & k & " with column #" & j
End Sub
See? The array of which column to process is built, then each item in the array is matched to each table.

If item = 0 (not applicable - eg. the first table), do nothing.
It item <> 0 then set a table object = doc tables(array index), and set a Column object = Table object Column (array value).

Then....process the cells in that column.

Except.....damn it, damn it....some tables are messed up by having mixed cell widths. The Column object can not, repeat not, be set for a column in a table with mixed cell widths.

This is a VERY serious problem. Pardon my language....

Tommy
07-18-2007, 02:36 PM
Could you post a couple PDF docs? I have version 6 of the abbyfinereader and a couple of others, I may be able to get this OCR'd and maybe it would process better.

calvink
07-19-2007, 08:36 PM
Damn I thought I posted some replies but apparently it's either gone or I did not hit the submit button! :banghead:

Anyway fumei about the problem with the mixed cell width, cant we just set all cells to a particular width? As long as the data within is not limited by the cell width (just like in regular Excel) we should be able to read the data correctly? Not sure whether programmatically it's possible like this...

And how do I run the code that you posted? Do i go into tools > macros > macros > create or tools > macros > visual basic editor? I tried posting the code in the vb editor, it couldn't run, something about compile error. Ive done some vb programming ages ago but it eludes me now :think:

calvink
07-19-2007, 08:44 PM
Could you post a couple PDF docs? I have version 6 of the abbyfinereader and a couple of others, I may be able to get this OCR'd and maybe it would process better.
Tommy here are the source pdfs for the word docs I posted earlier. I was hoping someone could have a crack at ocr'ing these buggers and produce better cleaner results than me... :) Please rename the file extension from .pdf to .zip since this forum doesn't allow .zip to be bigger than 200k!

Also if anyone's interested in the Excel output here is the link:

http://www.vbaexpress.com/forum/attachment.php?attachmentid=6246&d=1184752558

Thanks for your help :)

Tommy
07-20-2007, 07:07 AM
Hi calvink, :hi:

From what I can see from the pdf's you posted, you will need to get them rescanned.

The ocr can't do a real good job due to the size of the text in the printed documents. I believe that the pdf's were created by scanning, otherwise we could possibly export the text (via automation), I tried to no avail.

I also tried to print to a 11x17 page cut the page down and scan again to OCR. This worked much better. But there is still a lot of work to do before the information could be processed and trusted. The increase in size from the 8.5x11 to 11x17 was only about 25%. Even then some of the text was still faded out so the OCR software didn't have much of a chance to pick up the right character/number.:banghead:

LOL one of the ocr's I got was "ISO" for "180" that was on the good scan :rofl:

fumei
07-20-2007, 01:10 PM
ISO for 180? That seems pretty darn close! Not...ummmm, good enough t is true...but hey, pretty close.

calvink, the reality is that I have never seen OCR results that are 100%. Ever. Well, not if the document has more than four words in it...and even then....

Numbers are even worse. So. You may be able to get better than you have, but never 100%. Actually, I would say it is running now at about 80%...ummm, OK, maybe less. And that is the reality.

Regarding the merged cells. The problem is that the OCR created columns. In table 3 of your sample, it has eight columns...except about 1/4 way down, there are nine columns. There is no data in the cells, and I wish it could be ignored, but you can NOT make a column object in a table with mixed cells.

You could, I suppose, write a procedure that on getting the column object creation error, run through the offending table counting the columns for each row, and if a row has a "deviant" column count...delete the row. Now you could make the column object.

Except....there are two different mixed cell problems.

1. a cell that merges vertical - ie. cell 1 (and ONLY cell 1) covers two rows

2. a row that has crazy "extra" columns.

#2 is the worst. However, with no error trapping to make sure there is no desired data!!, you can "fix" the tables with this:Sub RemoveMixedCrap(aTable As Table)

aTable.Columns(1).Delete
For Each aRow In aTable.Rows
If aRow.Range.Columns.Count > 8 Then aRow.Delete
Next
End SubThe messed up tables all have an extra column, the first one. So I delete it. Then search for any rows with a deviant number of columns - there should be 8. I delete those rows.

Now, except for the first (Title) table, all data tables have the amount in column 5. Ignore Table 1, and process all the tables.

Check each cell in Col5 (set as a column object):

1. if blank, go to next cell
2. if not blank, use IsNumeric
3. if numeric, make sure the number is correctly decimalized - ie. if value is 7438, this becomes 74.38
4. if not blank, and NOT numeric, start parsing the string

eg.

"' 4234" needs to become 42.34

etc. etc. etc.