PDA

View Full Version : extract data from pdf to excel



kevvukeka
06-23-2013, 11:39 PM
Hi All,

I need some directions for finding solution to the below problem. As I know its not going to be easy thing so I need help to know the different approaches that can be tried.

Problem: I have thousands of pdf files. Each file has specific format as below:

Name:XXXXX DOB:XXXXXXX
ID:XXXXXXX

Dependent Name:XXXXXX
DOB:XXXXXXX


I have to extract this specific data to excel before updating them into a software.

Are there any ways to extract specific data from a pdf to excel sheets in a better way than going through each pdf. I don't know how OCR works but are they helpful. I just came across it in google.

Kindly provide your suggestions..

Thanks for your help.

Sock
06-24-2013, 10:48 AM
I'm not sure what the pros here might say, but when I have to pull from pdf reports and such, I've used the features of a pdf reader to extract the files. I used Nitro pdf once which was able to pull out data into excel surprisingly well. One useful feature was that I was able to do it from the file menu, so I could select a lot of different files and extract from there, I believe. Big downside is you have to pay for that program.

I'd be interested in hearing what the others have to say, though. Good question! :)

Kenneth Hobs
06-24-2013, 11:03 AM
Depends on the type of PDF file. Attach an example. Obfuscate the data if needed.

SamT
06-24-2013, 03:30 PM
You might try this freebie:

http://www.generalfreeware.com/freeware/convert-pdf-to-text-file-17264.htm

They say it will do batch conversion of many PDF's into one text file.

Once you have the PDF's converted to one or more text files. VBA can import them into Excel.

When you've tried it and decided, show us a few lines of the output text.

Kenneth Hobs
06-24-2013, 06:27 PM
You can probably do it if you have Adobe Acrobat, not Adobe Reader. To reference the object, see this example. http://www.vbaexpress.com/forum/showthread.php?t=40734

For another 3rd party converter, I found this one: http://www.sejda.org/shell-interface/tutorial/

Obviously, you can Shell() to a 3rd party console program. Here is an example that I posted for pdfsam which is similar to sejda. http://vbaexpress.com/forum/showthread.php?p=180767

I could probably do it with iTextSharp in vb.net but that might be too involved for you.

kevvukeka
06-24-2013, 10:59 PM
Thanks for your suggestions. Will check these options and post my feedback... Thanks again...

evanpan
02-04-2016, 04:14 AM
I wonder whether there are any 3rd party toolkits whose way of processing is simple and fast to help with that?

Tommy
02-04-2016, 03:38 PM
You can save the pdf file as an Excel file. Use the axAcroPDF (https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwjM9vPFk9_KAhVDwiYKHVYBCCYQFggxMAI&url=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F23687564%2Fdocumentation-for-adobe-pdf-reader-control-axacropdf&usg=AFQjCNHFyM7VR-ALTUYEXP57TBO6ojQJaQ&sig2=DOopw3DPtcWLyb8uRGjOpA&bvm=bv.113370389,d.eWE).dll to access the pdf and save the file via VBA. The execute command is what you are looking for, the item to execute would be the menu item for saveas excel workbook. You will need to do some research for the internal menu item name you are looking for.