Using Power Query to combine files from Hadoop into a single table (part 1)
Hadoop | HDInsight | Power Query Preview
For today’s post, I’m showing a scenario where I want to load some text files from my Hadoop cluster into a single Excel Power Pivot table. For this purpose I created some files with meaningless content and three columns (date, product, sales). We will see, that Power Pivot offers builtin support for this use case from it’s GUI (ability to combine multiple files), but we will use a custom import function for each of the files. This gives us the maximum of flexibility to control the import process.
I stored those files on hdfs in a directory /users/demo/csv. The purpose is to load all the files from this directory into a single table.
One more thing before we get started. This post if more intended to demonstrate the functionality of Power Query functions. Of course, if you have many small files on your hadoop cluster, the better solution would be to assemble the files on the cluster itself. In my case I could launch the following pig latin script to create a joined file:
- a = LOAD '/user/demo/csv/*.csv' using PigStorage(',') AS (Date:chararray, Product:int, Sales:float);
- STORE a INTO '/user/demo/result.csv';
Here are the first output rows of this script (when using dump a):
But let’s try to achieve the same result using Power Query. This post will be separated in two parts:
part 1: Combine/import the files using Power BI basic functionality
part 2: Write a custom function to combine/import the csv files
For part 1 I will first connect to the Hadoop cluster. So, I open a blank Excel file, go to the Power Query ribbon and choose “From Other Sources/From Hadoop File (HDFS)”:
In the following dialog, you need to enter the address of your WebHDFS service:
Since Power Query gets a recursive list back with a list of all files, I used a specific directory (/user/demo/csv) here in order to reduce the amount of files that are returned.
Please not the arrows-icon in the first column. This is used to get the content of all listed files (binary concatenation of the files). Here is the result:
As you can see, there are some lines with errors. The reason for that is, that we did a binary concatenation of the files here. As the last line of each of my csv-files does not end with a line break, the last line and the first line of each file are combined into a single line causing the error. You can see the binary combine operation, when checking the query steps. Therefore, expand the pane on the right and choose the “CombinedBinaries” step.
So, a binary combination might not be the best idea here. We will need to use another approach. First, let’s remove all the steps starting with the binary combine:
The ability to delete steps, or to move them within the query position is one of the great functionalities in Power Query: You can interactively design your query by trying certain operations, going back to every single step if necessary and try a different approach. And this process is fully visual: Clicking on each step shows the query result in that specific step.
So after deleting our first approach for the combined binaries, we’re back at the file list. For the next approach I’m only using the binary column, so I can remove all other columns here:
Then I insert a Custom column:
In the formula editor, we use this formula:
Be aware that the function names are case sensitive. The full list of available functions is available as the Power Query Formula library specification. This and other resources for Power Query can be found here:
Quite impressive, the formula library specification already has 257 pages with lots of useful functions. And remember, we’re still in the preview time of this product.
The result shows our new column, having “Table” as content:
Now, we don’t need the binary data anymore, so we can right-click / remove this column.
By clicking the arrow symbol in the column header, we can now expand the tables that resulted from our function call. Let’s include all columns:
After clicking “Ok”, we get the file contents, this time without the error line.
The only task left is to rename the columns properly. In my case, I name them Date, Product and Sales (right click/rename). You should also check the data type of each column. For example, we could convert the first column to data type “Date”:
The last step here is to rename the query (choose some comprehensive name) and to finish the query operation by clicking Ok in the query windows.
Back in Excel you may adjust the query behavior using the query settings pane:
For example, you could disable “Load to worksheet” and then click “Load to data model” in order to load the data directly into a Power Pivot model (bypassing Excel and therefore also bypassing Excel’s limitation of 1 million rows).
Querying big data sources like Hadoop might not be a common use case, but there may still some cases where this might be useful. And of course we could filter down the file list (for example to read only files starting with a given date) or the file content (with the reader function from above).
So, Power Query and the embedded M formula language offer very powerful tools for retrieving and transforming (reshaping) data. For most cases none or only little coding is required and the step by step engine allows to get back to a specific point in the transformation at any time. The use cases for Power Query are literally endless and this tool is definitely a big step ahead for bringing Business Intelligence Power to the end users.