03.02.2015

Geo spatial data support for Hive

Technical Value

Hive 0.1x
Apache Hive doesn’t have support for geo spatial functions out of the box. However, it’s easy to extend the functionality by using external libraries, that can be called via user defined functions (UDFs). This post shows how to use the ESRI Hadoop libraries in Hive. I’m using the Hortonworks HDP sandbox here, so it’s easy to reproduce the examples from below.

Preparation
In order to use the libraries, they first need to be installed. I’m using the following script to download, build and deploy the libraries to the Hadoop file system. Please note that you may have to correct the build number of the jar files. You will also need maven to build the libraries, so you need to install this prior to running the following commands:

  1. # create working directories      
  2. cd /tmp      
  3. mkdir gis      
  4. cd gis      
  5. mkdir framework      
  6. mkdir api
  7. # download and build framework      
  8. cd framework      
  9. wget https://github.com/Esri/spatial-framework-for-hadoop/archive/master.zip      
  10. unzip master      
  11. cd spatial-framework-for-hadoop-master      
  12. mvn clean package      
  13. hadoop fs -put hive/target/spatial-sdk-hive-1.0.3-SNAPSHOT.jar /user/esri      
  14. hadoop fs -put json/target/spatial-sdk-json-1.0.3-SNAPSHOT.jar /user/esri      
  15. hadoop fs -put hive/function-ddl.sql /user/esri
  16. # download and build api      
  17. cd /tmp/gis/api      
  18. wget https://github.com/Esri/geometry-api-java/archive/master.zip      
  19. unzip master      
  20. cd geometry-api-java-master      
  21. mvn clean package      
  22. hadoop fs -put target/esri-geometry-api-1.2.jar /user/esri
  23. # clean up      
  24. cd /tmp      
  25. rm -rf gis

If successful you should see the following files in your Hadoop file system, directory /user/esri:

image

Making the ESRI functions available in Hive
After building the ESRI libraries, we still need to declare the new user defined functions within Hive. First we start with the jar-files:

  1. add jar hdfs:///user/esri/spatial-sdk-hive-1.0.3-SNAPSHOT.jar;      
  2. add jar hdfs:///user/esri/spatial-sdk-json-1.0.3-SNAPSHOT.jar;      
  3. add jar hdfs:///user/esri/esri-geometry-api-1.2.jar;

Running a “list jars” in Hive shows the libraries being loaded successfully:

image

Please note that you can add the add jar commands to the .hiverc file. You can find more details on that file in this blog post.

Next we do have to declare the spatial functions. You will find the full declaration for all functions in the function-ddl.sql file that I also placed in hdfs under /user/esri (as shown in the screenshot from above) using the build script. However, for this post, I’m only using a few functions with the following declarations:

  1. create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';      
  2. create temporary function ST_LineString as 'com.esri.hadoop.hive.ST_LineString';      
  3. create temporary function ST_Length as 'com.esri.hadoop.hive.ST_Length';      
  4. create temporary function ST_GeodesicLengthWGS84 as 'com.esri.hadoop.hive.ST_GeodesicLengthWGS84';      
  5. create temporary function ST_SetSRID as 'com.esri.hadoop.hive.ST_SetSRID';      
  6. create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon';      
  7. create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';
  8.  
  9. create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon';
  10. create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';

Starting with Hive 0.13 you can also declare these functions permanently by omitting the “temporary” keyword from the declarations from above. This has the advantage that you do not need to declare the functions for every session. You can also include the jar file in the create function statement which makes it easier to create a permanent declaration.

For example, for the definition of the ST_Point function you would write the following SQL statement:

  1. create function ST_Point as 'com.esri.hadoop.hive.ST_Point' using jar 'hdfs:///user/esri/spatial-sdk-hive-1.0.3-SNAPSHOT.jar';

Using the ESRI functions in Hive
In order to demonstrate the ESRI geo spatial functions I’m using the Uber data set with 25,000 Uber routes from San Francisco.

The first rows of the data are looking like this:

image

Each trip has a id (1 to 25,000) with the latitudes and longitudes at each time. For example, using window functions we can easily calculate the total time for each trip (by using first_value to get the starting time and restricting the result set to the last entry per trip):

  1. -- get trip duration      
  2. select id, unix_timestamp(dt)-unix_timestamp(dt_first) trip_duration      
  3. from (      
  4. select      
  5. id,dt,      
  6. first_value(dt) over (partition by id order by dt) dt_first,      
  7. row_number() over (partition by id order by dt desc) lnr      
  8. from uber      
  9. ) Sub      
  10. Where lnr=1;

image

Ok, so far we did not use any of the geo spatial functions we’ve declared above. Let’s say we’d like to calculate the distance and average speed for each trip. The idea is to calculate the distance between two GPS points along the trip. Therefore I’m using window functions again to retrieve the preceding GPS coordinates along the way as shown below:

  1. select      
  2. id,      
  3. dt,      
  4. latitude,      
  5. longitude,      
  6. lag(latitude,1) over (partition by id order by dt) prev_latitude,      
  7. lag(longitude,1) over (partition by id order by dt) prev_longitude      
  8. from uber;

Looking at a very short trip shows how we now have the preceding coordinates in the same row.

image

In order to calculate the distance between the coordinates, we first create a line between the two coordinates using the following query (the blue part is exactly the query from above):

  1. Select      
  2. id,dt,      
  3. unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,      
  4. latitude, longitude,      
  5. ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L      
  6. from      
  7. (select id, dt, latitude, longitude,        
  8. lag(latitude,1) over (partition by id order by dt) prev_latitude,        
  9. lag(longitude,1) over (partition by id order by dt) prev_longitude        
  10. from uber) Sub;

image

The line is not visualized in the query editor, so we only see its binary representation. The next step is a little bit tricky because we need to calculate the length of the line. Here, the function ST_GeodesicLengthWGS84 can be used as shown below (again the blue part is the previous query). This functions returns the distance in meters on a spheriod, however the correct SRID has to be set before:

  1. select id, dt, time_passed, ST_GeodesicLengthWGS84(ST_SetSRID(L,4326)) Distance      
  2. from      
  3. (
  4. Select id,dt,        
  5. unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,        
  6. latitude, longitude,        
  7. ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L        
  8. from (select id, dt, latitude, longitude, lag(latitude,1) over (partition by id order by dt) prev_latitude, lag(longitude,1) over (partition by id order by dt) prev_longitude from uber) Sub        
  9. ) Sub1

image

The only thing left is to aggregate distance and time difference. Since the time difference is in seconds and the distance is in meters, we can calculate speed in km/h as distance/1000 * (3600 / time_difference). This results in the following query (again only shown for trip no 44):

  1. select id, min(dt) start_time, max(dt) end_time, Sum(Distance) distance, Sum(Distance)/1000 * (3600.0/Sum(time_passed)) avg_speed      
  2. from      
  3. (select id, dt, time_passed, ST_GeodesicLengthWGS84(ST_SetSRID(L,4326)) Distance from      
  4. (Select id,dt,      
  5. unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,      
  6. latitude, longitude,      
  7. ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L      
  8. from (select id, dt, latitude, longitude, lag(latitude,1) over (partition by id order by dt) prev_latitude, lag(longitude,1) over (partition by id order by dt) prev_longitude from uber) Sub      
  9. ) Sub1      
  10. ) Sub2      
  11. group by id;

image

Ok, now this looks rather like a rather short trip, however it was still good to see the idea. Here are the waypoints from this short trip (PowerView):

image

Now let’s assume we’re interested in all the trips that go the the airport. In order to run this query, we need to define the area for the airport. I’m using Google Earth with this purpose here and simple draw a polygon for my target area:

image

You can then export the polygon to a .kml file. Since I have a very simple polygon here, I transferred the coordinates from the .kml-file manually into my geo spatial polygon type:

kml file ESRI polygon
  1. …            
  2. <LinearRing>            
  3. <coordinates>            
  4. -122.3939778660149,37.61286698251174,0            
  5. -122.387217599775,37.61405599476924,0            
  6. -122.3886445577052,37.61737414129696,0            
  7. -122.3948268153237,37.61592559803017,0            
  8. -122.3939778660149,37.61286698251174,0            
  9. </coordinates>            
  10. </LinearRing>            
  1. ST_Polygon(            
  2. -122.3939778660149,37.61286698251174,              
  3. -122.387217599775,37.61405599476924,              
  4. -122.3886445577052,37.61737414129696,              
  5. -122.3948268153237,37.61592559803017,              
  6. -122.3939778660149,37.61286698251174            
  7. )

In order to find all trips that end at the airport we just need to get the end-point for each trip and use ST_Intersects to see if this endpoints is within the polygon. Here is the resulting query:

  1. select count(*) CntTripsToAirport      
  2. from (      
  3. select id, start_coords, ST_Point(longitude, latitude) end_coords      
  4. from (      
  5.    select      
  6.      id,dt,longitude, latitude,      
  7.      first_value(ST_Point(longitude, latitude)) over (partition by id order by dt) start_coords,      
  8.      row_number() over (partition by id order by dt desc) lnr 
  9.   from uber 
  10.   ) Sub      
  11. Where lnr=1      
  12. ) trips      
  13. where      
  14. ST_Intersects(      
  15.   ST_Polygon(      
  16.    -122.392291,37.616543,      
  17.    -122.389115,37.616458,      
  18.    -122.389115,37.613552,      
  19.    -122.392119,37.613297,      
  20.    -122.392291,37.616543      
  21. )      
  22. , trips.end_coords);

This results in 26 trips:

image

Of course you can do a lot more using this data set and the geo spatial functions in Hive, so this post should just give you an idea. Also, you can use visualization tools like PowerView. For example, the following map shows the locations where most of the trips started (red means a lot of trips starting here):

image

Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden