How to use SSIS to do ETL with Hadoop
Recently Microsoft released the new SQL Server Data Tools for Visual Studio 2015 (still in Preview) which includes new SQL Server Integration Services (SSIS) Tasks to extract and load data from and into a Apache Hadoop Cluster. This is nice - but it is even possible with older versions of SSIS.
Let me first give you a short description of the mentioned Apache Hadoop Services: The KNOX Gateway is used to secure Hadoop clusters as it provides a single point of authentication and authorization. So users can access Apache Hadoop Services and their APIs like WebHDFS or WebHCat over an own REST API which delegates the different requests to the corresponding services. The KNOX Service comes with an integrated “Demo-LDAP” for authentication - so we use the preconfigured guest user to work with the service. To use the LDAP, you have to start it over “Service Actions”
Start LDAP Service
WebHDFS is a REST API that provides a complete file system interface for HDFS.
WebHCat is a REST API for Hadoops table and storage management layer HCatalog.
To communicate with the REST APIs in SSIS we just need a tool that provides this functionality.
I used cURL for Windows in an “execute process task”. (see screenshot)
SSIS Execute Task
This is an example for uploading a file to HDFS by using the WebHDFS API. The File is zipped as GZIP - a format that is known by HIVE for table compression. That means, the file can directly be uploaded into an external table folder and can be queried immediately.
- Path to curl.exe
- -iku guest:guest-password -L -T sample_table.gz -X PUT https://127.0.0.1:8443/gateway/default/webhdfs/v1/user/guest/sample_table/sample_table.gz?op=CREATE
After the file is uploaded we can use another “Execute Process Task” to use cURL for creating a new external table. The location of the table will be the created sample_table directory. This time we use the HCatalog API - WebHCat:
- Path to curl.exe
- -v -i -k -u guest:guest-password -X POST -d execute="CREATE+EXTERNAL+TABLE+sample_test+(id+string,fistname+string,lastname+string)+ROW+FORMAT+DELIMITED+FIELDS+TERMINATED+BY+'\073'+LOCATION+'/user/guest/sample_table/';" -d statusdir="tmp" "https://127.0.0.1:8443/gateway/default/templeton/v1/hive"
After execution the new table will show up in the database explorer of the Ambari HIVE view.
Hive View Database Explorer
Now you can simply query the table or do some more Transformation.
Hive View Query Editor
If you want WebHCat to use a script file in HDFS to run your hive program you can replace “execute” with “file” and the HIVE statement with the file path.
You see - this is a very simple way to use the powerful SQL Server Integration Services (SSIS) to do ETL with Hadoop.