Allow PowerQuery to Extract Data from PDF tables
Allow PowerQuery to import tables of data from PDF files.
Hi all! I am happy to announce that recently we have released From PDF data connector in Excel which allows you to import and work with data from your PDF documents. Refer to this blog post for more information:
- Excel Team
Darnie Christopher commented
This does not work when the PDF source file is password protected . Please suggest a
This is already there is Excel - via PDF.Files - we just need a button for this on the UI
David Uyarra commented
PDF is a very common file type that we have to deal with, although we know it is not the best suited for analyzing data.
Power BI has a connector that works with its query editor. Could it be ported to Excel's get and transform (query editor) ? It would be very useful for excel users to be able to extract data from these files without a third party app. Other data analyzing tools (eg. IDEA) support this type of connector and it's proven to be very useful.
Henn Sarv commented
PDF Connector for PowerQuery exist and works fine in PowerBI - request (not only mine) is to include same Connector into Excel PowerQuery. And not only PDF Connector. IDeal when all new connectors and connector updates and POwerQuery functions appear same way in Excel too
Problem is that with excel I can do aditional analysis with extracted data. PowerBI is mainly only for visualizing.
[Deleted User] commented
The problem with getting data OUT of a PDF is that a PDF is more like a program than a simple document.
Worse, anyone creating a PDF can take a bajillion different paths to creating it than someone else does. All that matters is that the PDF displayed result looks like the original source would have.
So Excel, or any other program MAKING a PDF is simplicity itself (for thems whats knows how to), as they have easily accessed data and full control over how to handle it. But going the other way with the huge variations in how their programmers chose to create them is a WHOLE other story.
Wish I could, but I can't ding Excel for this one.
By the way, if the need is to just sometimes take a chunk of data out as it appears on the screen, rahter than the bizarre way it sometime is connected and split up, try pressing Ctrl, holding it and marking a rectangular block with the mouse. Leave some extra white space, as you are able. Then copy it to the clipboard with formatting.
It will try to OCR it and will make it a pastable block, like it appears on screen. I find it... 85-90% effective. The loss is never in the block chosen, just the OCR result. So say you want a table and highlighting it you get some of it highlighted but find yourself getting text from five pages later too and it's all over the map as well, this is very effective at solving things for you.
Henn Sarv commented
Connectors available for PowerBI POwerQuery might be available for Excel PowerQuery too
One example with missing support is PDF Connector
Abiola David commented
I will suggest that Excel product team adds From PDF files in the Get & Transform group of the Data tab. This will make life much easier to extract data stored in PDF for further analysis in Excel.
Any update on this one? It has already been implemented in Power BI (beta)
Jaq Dyer commented
As a student, there are so many files that I get in PDF format that I need to find information in and excel is the best way to do it! As an example, for an assignment recently I had to find different phone numbers and there were 17 pages! Additionally, it would also be handy to managers ect who get forms in PDF format to be able to convert them to excel.
Ken Puls commented
FYI, this is also in the Power BI community site. I would suggest also logging votes there: https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/6973371-tables-in-pdf-files
The ability to identify and automatically parse the values of multi page tabular data without non columnar data (headings, page breaks and numbers, page headings, page column headings) on the second and subsequent pages corrupting the excel columns of the tabled data regardless of how formatted in the PDF.
PDFs include tabular data in various formats:
• Multiple pages/single page
• As a PDF table with headings on every page and typically page headers, page numbers, and other data that destroy any parsing I know how to use
• Tabulated data formatted as simple text that one can only capture as raw text
• Null values (spaces) that shift every subsequent cell in the table to the right