Import text from pdf files in Power BI

While Power BI will soon provide functions to import tables from pdf-files, there might be occasions when you actually need to import text from pdf files (in unstructured form). With a little help from R in Power BI you can do exactly that. (And don’t worry: No need to learn R here: The necessary R-code is already included in my function below. All you need is to have R installed your machine). Please also note that at the time of writing the refresh of these queries in the service is only supported with the personal gateway and not with the enterprise version.

Prerequisites

You can use the function below just like a normal M-function, just pass the (URL- or file-) path to it. All you have to take care of is that a instance of R is running on your machine. If this is new to you, check out Ruth Pozuelo’s video showing all the necessary steps: How to install R for Power BI

There is one package required: pdftools. The video above also shows how to install it.

Function

Import text from PDF files:

	let func =
	(pdf as text, optional ByPage as number, optional ownerPW, optional userPW) as table =>
	let
	opw = if List.Contains({null, ""}, ownerPW) then """" else """"&ownerPW&"",
	upw = if List.Contains({null, ""}, userPW) then """" else """"&userPW&"",
	turnSlashes = Text.Replace(pdf, "\", "/"),
	queryString = """"& turnSlashes & """, opw = "&opw&""", upw = "&upw&"""",
	RunRScript = R.Execute("# 'dataset' holds the input data for this script#(lf)output <- data.frame(pdftools::pdf_text("&queryString&"))"),
	output = RunRScript{[Name="output"]}[Value],
	DemoteHeaders = Table.DemoteHeaders(output),
	RemoveOldHeader = Table.Skip(DemoteHeaders,1),
	AddPageIndex = Table.AddIndexColumn(RemoveOldHeader, "PageIndex", 1, 1),
	TransformTextToRows = Table.TransformColumns(AddPageIndex,{{"Column1", each List.Transform(Text.Split(_, "#(lf)"), Text.Clean)}}),
	ExpandRows = Table.ExpandListColumn(TransformTextToRows, "Column1"),
	AddRowIndex = Table.AddIndexColumn(ExpandRows, "RowIndex", 1, 1),
	Result = if List.Contains({null, "",0}, ByPage) then AddPageIndex else AddRowIndex
	in
	Result
	, documentation = [
	Documentation.Name = " ImportPdfText_R
	", Documentation.Description = " R script to import text from a pdf file. Returns a table with one row per page.
	" , Documentation.LongDescription = " R script to import text from a pdf file. Returns a table with one row per page by default. Using 1 in the second optional parameter will return one row per line of text instead with matching indices.
	", Documentation.Category = " Accessing data functions
	", Documentation.Source = " http://wp.me/p6lgsG-M3 .
	", Documentation.Author = " Imke Feldmann: www.TheBIccountant.com .
	", Documentation.Examples = {[Description = "
	" , Code = "
	", Result = "
	"]}]
	in
	Value.ReplaceType(func, Value.ReplaceMetadata(Value.Type(func), documentation))

view raw

ImportPdfText.pq

hosted with ❤ by GitHub

You can try calling this function for a pdf-file from the internet like the M formula language specification like this:

ImportPdfText("http://download.microsoft.com/download/8/1/A/81A62C9B-04D5-4B6D-B162-D28E4D848552/Power%20Query%20Formula%20Language%20Specification%20(October%202016).pdf")

If you want to import local files from your computer, just paste the full file-path instead of the URL. You don’t have to care about the direction of the slashes, both versions (forward and backwards) are accepted.

How to use

The script will return a table with one row for each page in the pdf-file by default. But it has an optional 2nd parameter that will return one row per pdf-text-line instead, if you put 1 into it. A page index and a row index will help navigating the result.

The 3rd parameter is an optional owner password for the pdf and the 4th the optional user password. If you’re using them, you have to enter null for the previous optional parameters. The following example shows how to use a user password while leaving the others “empty”:

ImportPdfText("MyPdfPath",null, null, "MyPassword")

Also check out Ruth Pozuelos video where she shows how to use this: https://www.youtube.com/watch?v=z15dF-jDXIo

Enjoy & stay queryious 🙂

Comments (19) Write a comment

Pingback: Reutilizar funciones en Power Query – Power BI y Business Intelligence
- sam
- 18.01.18 @ 8:44 AM
@Imke – Thanks – this would be a big help.
Can a Password be passed as a parameter if the pdf file is password protected

- - Admin
  - 18.01.18 @ 11:16 AM
  Thanks sam!
  Yes, passwords are supported by the R-function: https://cran.r-project.org/web/packages/pdftools/pdftools.pdf
  I’ve included them as the 3rd and 4th parameter in the function, which is now updated accordingly.
  Please let me know if that works, because I couldn’t test it.
  Thanks and cheers, Imke
  
- Wes Baird
- 24.01.18 @ 5:16 AM
This works really well. Thank you, Imke.
And the youtube video by @Curbal is a nice icing on the cake.

- - Admin
  - 24.01.18 @ 6:11 AM
  Thanks Wes!
  
Hi, i tried the function but couldn’t make it run correctly.
R scripts are enabled and pdftools are installed.
When the function is invoked, a null reference error is generated.
Can you please provide guidance?
Thanks

- - Admin
  - 01.02.18 @ 6:45 PM
  Did you use one of the optional parameters and forgot to enter null for the preceeding ones that you didn’t use?
  Otherwise I need more details: Please post your exact syntax and the full text of the error message please.
  Thx, Imke
  
Pingback: PDF Tabellen in Power Query einlesen - PowerBI Pro
- Luis
- 25.04.18 @ 12:43 AM
Imke: Great solution. I wonder if that query can be used in excel in order to extract a bank statement from a PDF. I tried powerquery but it brings all the information in one column. Anyway you can help me? Thanks

- - Admin
  - 25.04.18 @ 11:05 PM
  Yes, my version will return just one column and is meant for pdfs who don’t come in table-format. If you want to retrieve real pdf-tables, then this method might work better for you: https://exceleratorbi.com.au/import-tabular-data-pdf-using-power-query/ .
  Otherwise you have to split up the column with Power Query as you would have to do with other unstructured sources.
  
- Vlad Samoilov
- 06.05.18 @ 11:33 PM
That’s a great solution, thank you very much for sharing. Is it possible to remove passwords, as in my case I have to hit “Enter” every time Power BI reads new pdf (and I have tens and hundreds of them).
And I am really interested if I can run this in Excel.

- - Admin
  - 07.05.18 @ 7:18 AM
  Thanks Vlad!
  I cannot think of a way to remove the passwords.
  With regards to Excel: R-scripts don’t run in PQ in Excel (yet). So you have to use them in PBI and export your results to csv (with an R-script) to consume the results in Excel.
  
- Admin
- 22.06.18 @ 5:59 PM
I haven’t come across this error message so far, so sorry, no solution from my side here unfortunately.

- After two days I was able to figure out what was the problem. The platfrom version of Power BI Desktop and RStudio were not the same, Power BI Desktop was an a 32 bits and RStudio on 64 bits. To solve the problem I installed the 64 bits version of Power BI Desktop to have the two apps on the same platform.
  Tahnk you again for sharing
  
  - Admin
    
    22.06.18 @ 8:01 PM
    
    Great & thanks for sharing the solution, as it might help others!
    
- ogureisu
- 25.07.19 @ 3:05 PM
Hello! This is sooo amazing! Is there a way to get a file from a sharepoint folder. Like ask you to login first? I tried the code however i got this error:

An error occurred in the ‘’ query. DataSource.Error: ADO.NET: R script error.
Error in open.connection(con, “rb”) : cannot open the connection
Calls: data.frame … poppler_pdf_text -> loadfile -> open -> open.connection
In addition: Warning message:
In open.connection(con, “rb”) :

cannot open URL ____: HTTP status was ‘403 FORBIDDEN’
Execution halted

Details:
DataSourceKind=R

- - Admin
  - 25.07.19 @ 9:38 PM
  So far, I haven’t been able to load from SharePoint, unfortunately. But haven’t spent too much time on it. So maybe there is a way.
  
- Siddiqui, Umer
- 16.09.21 @ 5:55 AM
Hi I am exploring Power BI for my personal use. I referred to the video,! Awesome. Could you pls help me on following
(1) How to club the rows together to make it look like a source file
(2) Apple to Apple comparison of each word in each para under each heading
(3) pull the images and compare them for both visuals and Text (annotations/legends/figures/numbers)

Thanks.

very nice. thank you so much for this tool.

New Generation Finance – Accounting – Controlling using Microsoft BI stack

Import text from pdf files in Power BI

Prerequisites

Function

How to use

Like this:

Comments (19) Write a comment

Leave a ReplyCancel reply

Prerequisites

Function

How to use

Share this:

Like this:

Comments (19) Write a comment

Leave a ReplyCancel reply