PDA

View Full Version : Solved: Extract Data From PDF Files to Excel



Carpiem
10-02-2005, 02:09 PM
Hello To All,

I was wondering given all the amazing things people achieve with VBA, is it possible to extract data / strings from PDF files using VBA?

If the answer is yes.... would someone please give me some insight as to how this is done.

Thank you,

Carpiem

BlueCactus
10-02-2005, 10:13 PM
Welcome to VBAX, Carpiem.

The good news is that yes, it is possible. The bad news is that to do it in VBA would probably (to the best of my knowledge) require extensive understanding of PostScript. But if you search around on the web, you'll probably be able to find programs for converting PDF to Word, plain text, HTML, and various other formats. Some of these programs are free and maybe they'll give you a good starting point.

Now, if the formats of your PDF documents all follow a similar design, you *might* be able to use VBA to extract the info you need with minimum or even no knowledge of PostScript. With even minor differences between files, things are going to get complicated.

Ivan F Moala
10-03-2005, 01:47 AM
If You use XL2003, then yes and you can do this via automation and MDI
But the process would require taking an image snap shot of the PDF file.

sheeeng
10-03-2005, 03:05 AM
If You use XL2003, then yes and you can do this via automation and MDI
But the process would require taking an image snap shot of the PDF file.

Can you explain it step by step if you have time? I'm looking forward to try it.
Thanks in advence.

Carpiem
10-05-2005, 01:47 PM
BlueCactus,

Thank you for the precis. I actually use an excellent program (Read Iris) to convert PDF to Excel. As with most things though there is cleaning up to do after the conversion.

It seems I might better ask a few questions to do with cleaning up the file rather than approaching the exercise from an admittedly intriguing and new process to me.

Carpiem

Welcome to VBAX, Carpiem.

The good news is that yes, it is possible. The bad news is that to do it in VBA would probably (to the best of my knowledge) require extensive understanding of PostScript. But if you search around on the web, you'll probably be able to find programs for converting PDF to Word, plain text, HTML, and various other formats. Some of these programs are free and maybe they'll give you a good starting point.

Now, if the formats of your PDF documents all follow a similar design, you *might* be able to use VBA to extract the info you need with minimum or even no knowledge of PostScript. With even minor differences between files, things are going to get complicated.

Carpiem
10-05-2005, 01:59 PM
Mr. Moala,

Never thought I would have the pleasure of responding to you. Have learned a good deal from you in the past and daresay if I was up to it would learn much more.

Confess to knowing zero about Multiple Document Interface... the program I dowloaded needed at least minimal understanding of the process to be invoked.

I shall plod along and try to refine my needs in a more subdued manner.

Thank you,

Carpiem

If You use XL2003, then yes and you can do this via automation and MDI
But the process would require taking an image snap shot of the PDF file.

mdmackillop
10-05-2005, 02:12 PM
Hi Carpiem,
Can you attach a sample of an imported spreadsheet, which might give an idea of the "clean up" required. You can attach a file by zipping it and using the Manage Attachments button found in the Go Advanced option.
Regards
MD

Carpiem
10-05-2005, 04:23 PM
Good Evening MD,

Have attached the much shortened file for your scrutiny. My explanation may not be to clear... have to wait and see.

I do really appreciate you taking time out to look at this.

Regards,

Carpiem

Hi Carpiem,
Can you attach a sample of an imported spreadsheet, which might give an idea of the "clean up" required. You can attach a file by zipping it and using the Manage Attachments button found in the Go Advanced option.
Regards
MD

mdmackillop
10-05-2005, 07:04 PM
Hi Carpiem,
I'm not too clear on obtaining the results you show in G38. Can you explain why this should be 120. If this is correct, the code can be modified as required. The split is done in the Sample columns by two simple UDF (user defined formulae) in Module 1, which can be manipulated to suit.
Regards
MD

Carpiem
10-06-2005, 11:09 AM
Hello MD,

The UDF's you created are just what "The Doctor Ordered" and much appreciated.

I can't explain what I was thinking when I inserted column G.... hence no idea what to say. In short order though I will be using your functions and adding to them.

On the subject of UDF's how does one incorporate them into vba. In this instance I am referring to the possibility of having the UDF's automatically do their job when the "Quoted Cost" (E38) was entered or pasted in?

I do not wish to waste your time and am only exploring the possibility of further automating the order entry process.

Regards,

Carpiem

mdmackillop
10-06-2005, 01:07 PM
Hi Carpiem,
There's no question of "wasting time", that is just what we do here.
One method as follows. Paste the code into the worksheet module, edit the intersect range and adjust the offset values etc. as appropriate. If you don't need the formulae, and only the result, then this could be extracted and pasted in as values. It all depends on your purposes.
Regards
MD


Private Sub Worksheet_Change(ByVal Target As Range)
On Error GoTo LastLine
If Not Intersect(Target, Range("E38:E45")) Is Nothing Then
Target.Offset(0, 3).Formula = "=UOM(E" & Target.Row() & ")"
Target.Offset(0, 4).Formula = "=UNIT(E" & Target.Row() & ")"
End If
LastLine:
End Sub

Carpiem
10-06-2005, 02:25 PM
Hello MD,

Seems everyone is exceedingly helpful here, which is great given the alternative. So once again 'Thank you".

Looking at your code and given that the range could be anything from a few rows to hundreds, would it be sufficient for me to use a named range to cater for additions or deletions to the rows and still incorporate your code as is?

Regards,

Carpiem

mdmackillop
10-06-2005, 02:52 PM
Hi Carpiem,
A named range would work fine. The only reason for the range is to exclude cells where you would not wish the code to run. eg, titles, totals or whatever. You could also get the code to run only if the text contains the / character. It's all depends on your conversions.
Regards
MD

brettdj
10-06-2005, 06:03 PM
FWIW, I once did something similar from outlook whereby I sent the pdf to an adobe email address that converted the pdf to text, then I scanned for a reply email from this adress, then used a RegExp to extract the portion of data from the text.

Carpiem
10-07-2005, 07:31 AM
Hello MD,

Well that is clear enough. Thank you again.

Regards,

Carpiem

Hi Carpiem,
A named range would work fine. The only reason for the range is to exclude cells where you would not wish the code to run. eg, titles, totals or whatever. You could also get the code to run only if the text contains the / character. It's all depends on your conversions.
Regards
MD

Carpiem
10-07-2005, 07:37 AM
Hello BrettDJ,

I understand what you are saying.... the problem is that I have never heard of an Adobe E-Mail address before and searching on the web has not revealed anything relevant.

Could you please elaborate on the procedure you used a little more as it sounds like something I should explore.

Regards,

Carpiem


FWIW, I once did something similar from outlook whereby I sent the pdf to an adobe email address that converted the pdf to text, then I scanned for a reply email from this adress, then used a RegExp to extract the portion of data from the text.

Bob Phillips
10-07-2005, 10:20 AM
Hello BrettDJ,

I understand what you are saying.... the problem is that I have never heard of an Adobe E-Mail address before and searching on the web has not revealed anything relevant.

Could you please elaborate on the procedure you used a little more as it sounds like something I should explore.

Perhaps this is what Brett was referring to http://www.adobe.com/products/acrobat/access_onlinetools.html

Carpiem
10-07-2005, 02:35 PM
Hola XLD,

Gracias por dirigirme a ese sitio.

?Eso es un Saxophone muy fresco tambi?n!

Respeto,

Carpiem

Carpiem
10-07-2005, 03:29 PM
Good Evening MD,

I have more or less got my head around the way you constructed the UDF's and the vba to automate them.

Would you be willing to walk me through the next stage i.e. "You could also get the code to run only if the text contains the / character." .... ?

Regards,

Carpiem


Hi Carpiem,
A named range would work fine. The only reason for the range is to exclude cells where you would not wish the code to run. eg, titles, totals or whatever. You could also get the code to run only if the text contains the / character. It's all depends on your conversions.
Regards
MD

mdmackillop
10-07-2005, 03:43 PM
Good evening Carpiem,
A simple modification of the WorkSheet sub can test for / in a string, if is found, the split takes place, if not, you can exit the sub, or as in this example, clear the UOM and Unit cells.

Private Sub Worksheet_Change(ByVal Target As Range)
On Error GoTo LastLine
If Not Intersect(Target, Range("E38:E45")) Is Nothing Then
If InStr(1, Target, "/") = 0 Then
Target.Offset(0, 3).ClearContents
Target.Offset(0, 4).ClearContents
Else
Target.Offset(0, 3).Formula = "=UOM(E" & Target.Row() & ")"
Target.Offset(0, 4).Formula = "=UNIT(E" & Target.Row() & ")"
End If
End If
LastLine:
End Sub

Regards
MD

mdmackillop
10-07-2005, 05:36 PM
A further thought.
If you are running other code which changes a sheet containing a WorkSheet Change type macro, it is best to add "Application.EnableEvents = False" at the head of the code, resetting it to True at the end (very important - it's not reset by default). This prevents the WorkSheet code from running each time the sheeet is changed by your routines.

brettdj
10-08-2005, 11:04 PM
Perhaps this is what Brett was referring to http://www.adobe.com/products/acrobat/access_onlinetools.html
Thanks Bob

Carpiem
10-10-2005, 11:09 AM
Good Afternoon MD,

Thank you,

Carpiem

A further thought.
If you are running other code which changes a sheet containing a WorkSheet Change type macro, it is best to add "Application.EnableEvents = False" at the head of the code, resetting it to True at the end (very important - it's not reset by default). This prevents the WorkSheet code from running each time the sheeet is changed by your routines.