29.05.2015

Polybase in SQL Server 2016 CTP 2

Technical Value

One of the exciting new features in SQL Server 2016 that is already available in the public preview version (CTP 2) is Polybase, the transparent access bridge from SQL to Hadoop. This functionality has been available in the Microsoft Analytics Platform System (APS) since version 2012 and has now made it’s way into SQL Server. You can find a lot more information in this blog post for example: http://blogs.technet.com/b/dataplatforminsider/archive/2014/06/02/polybase-in-aps-yet-another-sql-over-hadoop-solution.aspx

This blog post covers the first steps for using Polybase in SQL Server 2016 community preview 2 (CTP 2).

First of all, you need to install Polybase functionality during the installation process by checking the appropriate option:

image

This installs two services for Polybase as shown below:

image

Also, when opening a database in SQL Server Management Studio you’ll see some new objects:

image

External tables are your gateway to hadoop. Later in this post we will create an external table to a Hadoop file which can then be used mostly like a normal SQL Server table.

External data sources define the connection to the Hadoop cluster (more precisely to the Hadoop file system, HDFS), while external file formats are used to define the structure of the file on the Hadoop cluster.

Before we can start, we need to configure SQL Server for Hadoop connectivity. The following code is an example:

  1. EXEC sp_configure 'hadoop connectivity', 5;      
  2. GO      
  3. RECONFIGURE;      
  4. GO      

You can use the following configuration values:

0 no Hadoop connectivity (default)
3 Enable connectivity to Cloudera CDH for Linux
4 Enable connectivity to Hortonworks Data Platform for Windows Server (HDP), HDInsight on Analytics Platform System, or HDInsight’s Microsoft Azure blob storage
5 Enable connectivity to Hortonworks Data Platform (HDP)for Linux

In my case, I’m using a Hortonworks Sandbox on Linux, so I’m using the configuration value 5.

Also, the user that is used to access Hadoop has to be configured in the Polybase configuration file which can be found in the Binn\Polybase\Settings sub directory of the instance, in my case

  1. C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

The default user is pdw_user which I changed to ‘hue’ here:

image

To make all these changes active, I had to restart SQL Server and the two Polybase services.

Now, let’s create an external data source. You can create a code snippet from the context menu of the external data sources folder. Here is the code I’m using for my sandbox:

  1. CREATE EXTERNAL DATA SOURCE HDP2 WITH      
  2. (      
  3.     TYPE = HADOOP,      
  4.     LOCATION = 'hdfs://sandbox.hortonworks.com:8020'      
  5. )

Next, we need to create an external file format. I’d like to access the tab delimited file sample_07 from Hive which is shipped with the sandbox. Therefore the external file format looks like this:

  1. CREATE EXTERNAL FILE FORMAT TSV      
  2. WITH (      
  3.     FORMAT_TYPE = DELIMITEDTEXT,      
  4.     FORMAT_OPTIONS (      
  5.         FIELD_TERMINATOR = '\t',      
  6.         DATE_FORMAT = 'MM/dd/yyyy'      
  7.     )      
  8. )

The date format is not needed for my case (there is no date column in this file) but just there to show the functionality.

The two new server objects now appear in the “External Resources” section of the database.

image

Next, we can create the external table itself using the defined data source and file format.

  1. create external table sample_07      
  2. (      
  3.     code nvarchar(255),      
  4.     description nvarchar(255),      
  5.     total_emp int,      
  6.     salary nvarchar(255)      
  7. )      
  8. WITH      
  9. (      
  10.     LOCATION = '/apps/hive/warehouse/sample_07',      
  11.     DATA_SOURCE = HDP2,      
  12.     FILE_FORMAT = TSV,      
  13.     REJECT_TYPE = value,      
  14.     REJECT_VALUE=0      
  15. )

 

The new external table appears in the “External Tables” folder of the database.

 

image

We can now query the table like any other table in the database, for example:

  1. select * from Sample_07

image

Here is another example

  1. select left(code,2) code2, sum(total_emp) total_emp, count(*) Count      
  2. from Sample_07      
  3. group by left(code,2)

And of course you can also write queries with joins between Polybase external tables and local SQL Server tables which gives a transparant SQL query experience to Hadoop data.

 

Final notes

  • The functionality and syntax may change in the final product, so this only applies to the technical preview 2 (CTP2)
  • Currently, writing to a hadoop file is not supported (there should be something like the “create external table as select” syntax in the APS)
  • This post is just a brief overview. There are many more options (for example more file formats, compression etc.) that can be used. A good orientation can be found in the Microsoft APS support tools download which can be found here: http://www.microsoft.com/en-us/download/details.aspx?id=45294

Kommentare

Hilmar Buchta
Hilmar
Buchta
Mi, 28.10.2015 - 09:55

I guess the configuration is a little bit tricky during the CTP process and also the error messages might not be self-explaining. However, I just tested this configuration successfully: SQL Server 2016 CTP 3, HortonWorks HDP 2.3 Sandbox on Linux (CentOS), configuration for polybase: 5, port for hdfs: 8020 (for the location argument in the create external data source Statement), port for Job tracker: 8050 (optional, for the lob_tracker_location argument in the create external data source Statement).
Also, if you experience problems during the configuration, make sure that the hdfs connection works (for example: ping to andbox address, connect with hdfs explorer tool).
It's also important that the two polybase services (under machine Management/Services) are restarted after reconfiguring hadoop connectivity and/or modifying the hadoop.config file.
Hope this helps.

Dennes Torres
Mi, 25.11.2015 - 02:12

Hi,

I'm using CTP 3 but sill receiving the message 'could not obtain block'.

Did anyone solved this error message ?

Thank you !

Dennes Torres
Mi, 25.11.2015 - 03:34

Did you tried to do a pushdown ?

yadu
Mi, 28.10.2015 - 09:13

I am also facing the same issue as mike and couldn't able to resolve it. the error message is - "Could not obtain block: BP-1487918654-10.0.2.15-1418756667447:blk_1073742154_1334 file=/user/hue/NYSE_dividends_A.csv"
any have solution for this problem. I am using Hortonworks 2.2 sandbox and configured my polybase with 7

Manoj Pandey
Fr, 05.06.2015 - 06:11

Very nice write up @Hilmar and explaining in setting up Polybase in CTP2. I don't have any idea on Hadoop, can you please help/guide me in setting up the sandbox, so that I can use it here.

Hilmar Buchta
Hilmar
Buchta
Fr, 05.06.2015 - 10:32

Thanks for the Feedback. As for the documentation you best have a look at this site: http://hortonworks.com/products/hortonworks-sandbox/

Hortonworks has many tutorials and instructions available on their website. Basically, to get running with Hadoop, you just need to download the virtual machine ("Sandbox") for VirtualBox or VMWare.

Mike
Di, 16.06.2015 - 17:19

Hello Hilmar, Any chance you know what this error is? I followed your example but when I run select * from sample_07 I get the following error:

Could not obtain block: BP-111963099-172.16.144.128-1419863445313:blk_1073742103_1283 file=/apps/hive/warehouse/sample_07/sample_07

Hilmar Buchta
Hilmar
Buchta
Di, 16.06.2015 - 17:29

Sounds like a generic error message. If the message contains something like "closing tag does not match opening tag" you're probably using the wrong port (in the create external data source). Otherwise you should check the general availibility of the hadoop system (for example with a command line ping to the address you used in the create external data source statement). Also, if you're using a different distribution than the Hortonworks sandbox, the port and the connectivity settings may be differenz compared to this post. Finally you could check if you can access the file from the machine running SQL Server 2016 CTP2 via webbrowser or another explorer tool for the Hadoop filesystem.

Mike
Di, 16.06.2015 - 19:48

Thanks, I'm wondering if it's an issue with Sandbox 2.2. Based on your demo you look to be using 2.0 (config value = 5). I've verified I'm connected and I can tell from Hadoop log files that I was granted access to open the file.

Vidya
Mo, 29.06.2015 - 11:52

Great Article Hilmar.
I too got the same issue which Mike faced.
Any resolution found Mike ?

Mike
Di, 07.07.2015 - 01:16

@Vidya Hello, No I have not found an answer yet.

Mike
Fr, 10.07.2015 - 18:46

I solved my issue. I was using the Sandbox 2.2 in Azure. You have to have your SQL Server VM and the Linux Sandbox on the same VNET.

Neuen Kommentar schreiben

Der Inhalt dieses Feldes wird nicht öffentlich zugänglich angezeigt.

Klartext

  • Keine HTML-Tags erlaubt.
  • HTML - Zeilenumbrüche und Absätze werden automatisch erzeugt.
  • Web page addresses and email addresses turn into links automatically.
Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden

Unsere Website benutzt Cookies, um bestmögliche Funktionalitäten bieten zu können. Durch die Nutzung unserer Website, erklären Sie sich mit der Verwendung von Cookies einverstanden. In unserer Datenschutzerklärung finden Sie unter §5 Informationen und Links, wie Sie diesen Maßnahmen jederzeit widersprechen können.