06.11.2015

Using R in SQL Server 2016 CTP 3 to train and query a predictive model

Technical Value

One of the exciting features of SQL Server 2016 is the R integration based on Revolution R Enterprise. The online documentation contains detailed information about the installation steps but only a few examples about how to use this new feature. The installation process is explained here: https://msdn.microsoft.com/en-us/library/mt604883.aspx

  • Install Revolution R Open for R Enterprise
  • Install Revolution R Enterprise
  • Register components in SQL Server and configure SQL Server to allow external scripts

The online documentation also contains two examples of how to call an R function from within SQL Server. The first example just returns a data frame from R (in this case the pre-defined iris data set):

  1. CREATE PROC get_iris_dataset
  2. AS
  3. BEGIN
  4. EXEC sp_execute_external_script
  5. @language = N'R'
  6. , @script = N'iris_data <- iris;'
  7. , @input_data_1 = N''
  8. , @output_data_1_name = N'iris_data'
  9. WITH RESULT SETS (("Sepal.Length" float not null, "Sepal.Width" float not null,"Petal.Length" float not null, "Petal.Width" float not null, "Species" varchar(100)));
  10. END;

You can see the call to the external script here. I changed the text color of the R statement to blue so it’s easier to be found within the SQL code. We have no input data for this simple case but return a table using the WITH RESULT SETS clause of the stored procedure. Executing the procedure get_iris_dataset just shows the data from the R script in SQL Server. Here are the first rows of the result set. image I want to use this data as my training data, so I load the data into a SQL Server table. Usually, your training data would already be in a table so you wouldn’t need the procedure from above.

  1. create table iris_data
  2. (
  3. "Sepal.Length" float not null,
  4. "Sepal.Width" float not null,
  5. "Petal.Length" float not null,
  6. "Petal.Width" float not null,
  7. "Species" varchar(100)
  8. )
  9. GO
  10. INSERT INTO iris_data Exec get_iris_dataset
  11. GO

So, this copies the data from R into a SQL Server table. Next, I’d like to train an R model based on this data. Again, this code can be found in the documentation:

  1. CREATE PROC generate_iris_model
  2. AS
  3. BEGIN
  4. EXEC sp_execute_external_script
  5. @language = N'R'
  6. , @script = N'
  7.  
  8. library(e1071);
  9. irismodel <-naiveBayes(iris_data[,1:4], iris_data[,5]);
  10. trained_model <- data.frame(payload = as.raw(serialize(irismodel, connection=NULL)));'
  11. , @input_data_1 = N'select "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" from iris_data'
  12. , @input_data_1_name = N'iris_data'
  13. , @output_data_1_name = N'trained_model'
  14. WITH RESULT SETS ((model varbinary(max)));
  15. END;

Again, the R code is in blue. If you haven’t installed the library e1071 in R you have to do so before running this procedure. To so, open the R console and run install.packages(e1071) as described here . The interesting thing about this procedure is, that we actually return the trained model as a varbinary object. In R the object class is “naiveBayes” but here it is serialized to a raw data frame that is returned from the function. You could also save the model in R (using the ‘save’ command) but it’s still interesting to see you we can store the model in SQL Server. To do so, I create a simple table and persist the trained model there:

  1. CREATE TABLE [dbo].[model](
  2. [model] [varbinary](max) NULL
  3. )
  4. GO
  5. insert into model exec generate_iris_model
  6. GO

image So now we’re getting to the more interesting part of the game which took me some time to figure out. If we want to query the existing model, we have to de-serialize it and use a prediction function. For simplicity I’m using the same data set (iris) again to see how well the model performs on known data. First we load the model into a variable (again varbinary) and then we pass the model to the external script together with the data on which we like to do our prediction. Here is the final code:

  1. declare @model varbinary(MAX)
  2. select @model=model from model

EXEC sp_execute_external_script @language = N'R' , @script = N'

  1. library(e1071)
  2. model <-unserialize(as.raw(iris_model))
  3. pred<-predict(model, iris_data)
  4. result<-cbind(iris_data, pred)'
  5. , @input_data_1 = N'select "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" from iris_data'
  6. , @input_data_1_name = N'iris_data'
  7. , @output_data_1_name = N'result'
  8. , @params = N'@iris_model varbinary(MAX)'
  9. , @iris_model= @model
  10. WITH RESULT SETS (("Sepal.Length" float not null, "Sepal.Width" float not null
  11. ,"Petal.Length" float not null, "Petal.Width" float not null, "Species" varchar(100), "SpeciesPredicted" varchar(100)))

As you can see, I’m passing the data to the function using the @input_data_1 and @input_data_1_name parameters. Then I’m passing the model using the @params parameter of the script. Here is a subset of the resulting rows: image As you can see, the model predicted the correct species for most of the cases although there are still some cases that haven’t been predicted correctly. Actually the R script we executed here was quiet simple as shown here:

  1. library(e1071);
  2. iris_data <- iris;
  3. irismodel <-naiveBayes(iris_data[,1:4], iris_data[,5]);
  4. result<-predict(irismodel, iris_data[,1:4])
  5. cbind(iris_data, result)

However, the examples from above show how you can train a model, persist the trained model in SQL Server and then use this model with whatever data you like. So this makes it much easier to deploy R scripts into production than calling rscript.exe and use the RODBC interface to communicate with SQL Server. Just a warning though: The example from above are run on SQL Server 2016 CTP 3 which is not the final product. So when SQL Server 2016 goes ready to market some of the functionality might still change.

Kommentare

Jatin
Di, 02.08.2016 - 13:41

Hello Hilmar,

Is there a way to pass two SQL tables w/o JOIN to process into R using MS SQL Server 2016?

For Example, we '@input_data_1='TABLE', is there a way to somehow access other tables in R?

BR,
Jatin

Hilmar Buchta
Hilmar
Buchta
Di, 02.08.2016 - 18:44

Hi Jatin, Good question. As far as I know, this isn't implemented yet. However, looking at the naming convention (input_data_1 and input_data_1_name) it may suggest that Microsoft will be implementing support for more than one table in the future.
BR, Hilmar

Hilmar Buchta
Hilmar
Buchta
Fr, 13.05.2016 - 14:04

Hi Rick, good question. I think it's not an easy task. Of course you can export certain aspects of your model as a data.frame obect, for example,

data.frame(MyLinearRegModel$coefficients)
or
data.frame(MyLinearRegModel$residuals)

However, since you can only return one data frame this wouldn't be of much help.

Another option could be to serialize the model to a non-binary form, for example as JSON and return this information to SQL Server (as output variable or wrapped in a varchar(max) data set):

library(jsonlite)
json<-serializeJSON(MyLinearRegModel, digits=8, pretty = F)

This allows you to use SQL Server 2016 JSON support functions to parse the model content.

Hope this helps.

Rick B.
Di, 10.05.2016 - 23:19

Hilmar - Thank you for putting together these great examples of R-Integration with SQL Server 2016. I've followed most of this, and the examples that are starting to be written about this great new capability. I have a question I haven't seen addressed yet. Is it possible to send the output of the linear regression formula and its associated t-statistics, R-squared, etc - such that it can be captured in SQL? What I'm talking about is this. I have a trained model andif I say:
print(MyLinearRegModel);
it will output the intercept and coeffiecients to the SSMS message box - but I can't seem to send it as output_data from the sp_execute_external_scripts proc. Information about my model, like this:

Coefficients:
(Intercept) Age BodyMassIndex Diabetes Smoker HighBP
284.936 3.014 2.658 399.036 180.224 138.344

And: print(summary(MyLinearRegModel));

Residuals:
Min 1Q Median 3Q Max
-195.85 -31.99 -0.20 31.81 532.46

Coefficients:
Estimate Std. Error t value Pr(&gt;|t|)
(Intercept) 284.93596 2.26329 125.89 &lt;2e-16 ***
Age 3.01370 0.01874 160.82 &lt;2e-16 ***
BodyMassIndex 2.65815 0.06675 39.82 &lt;2e-16 ***
Diabetes 399.03566 1.58657 251.51 &lt;2e-16 ***
Smoker 180.22435 0.72264 249.40 &lt;2e-16 ***
HighBP 138.34384 1.16915 118.33 &lt;2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 48.53 on 24400 degrees of freedom
Multiple R-squared: 0.894, Adjusted R-squared: 0.894
F-statistic: 4.117e+04 on 5 and 24400 DF, p-value: &lt; 2.2e-16

...is what I&#039;m trying to get back (in SQL) to be parsed and stored in a table - along with that varbinary field that holds the trained model itself.

Is there a technique your&#039;re aware of to do this?

Rick B.
Mi, 11.05.2016 - 02:14

Hilmar - I did a little more research on this - and someone at StackOverflow pointed me in this direction:

declare @rx_model varbinary(max) = (select model from dbo.My_model)
exec dbo.sp_execute_external_script
@language = N'R',
@script = N'require("RevoScaleR");
Mymodel &lt;- unserialize(rx_model);
OutputDataSet &lt;- data.frame( paste(names(Mymodel$coefficients), Mymodel$coefficients, sep=&#039;&#039;=&#039;&#039;, collapse = &quot; &quot;));&#039;,
@input_data_1 = N&#039;&#039;,
@input_data_1_name = N&#039;&#039;,
@output_data_1_name = N&#039;OutputDataSet&#039;,
@params = N&#039;@rx_model varbinary(max)&#039;,
@rx_model = @rx_model
with result sets (( OutputDataSet nvarchar(max) ))

The critical line is at:
(Mymodel$coefficients), Mymodel$coefficients, sep=&#039;&#039;=&#039;&#039;, collapse = &quot; &quot;));&#039;,

where apparently we can access fields withing the Model object. We were pulling the coefficients out - but I&#039;m now going to find if we can get all the other parameters of a regression model output from the sp_execute_external_script

I hadn&#039;t seen this done in any of the samples - so I wanted to share in the spirit of the help you&#039;ve provided. I think this is what ram_a was looking for too.

ram
Do, 25.02.2016 - 19:12

Hi, Hilmar!
How are you? How is your OLAPping going?

Here is a little question for you:

if in sql+R practice I want to use temporal tables their names must start with # and I have the following situation :

declare @inquery ..... as SELECT something from #TABLE
... then
exec sp_execute_external_script
@language = N’R‘
, @script = N‘ ...
where I have to point the data source - and you see I cannot do that:

linModel &lt;- lm(y ~ x+ z, data =#TABLE(????????));

.....and so on and on. R doesn&#039;t allow me that pleasure.

What do you think I have to undertake while working with temporal tables as source for R procedures?

Waiting for your response with hope....
Ram

ram_a
Fr, 22.01.2016 - 17:48

Dear Hilmar,

I would be very glad to get your suggestion about what happened with my RTools case. You know that SQL SERVER 2016 likes only R 3.2.2 to work with and this version I installed (although something for 3.2.3 can work too- from my experience).
Or you don't want to be my friend anymore??
O, mon dieu, what should I do...

ram_a
Do, 28.01.2016 - 18:36

Thank you, Hilmar, for your response. I will try to follow your recommendations about RTools.

ram_a
Mi, 13.01.2016 - 23:41

By the way, doing 'search()' in R I saw all packages there (and ggplot2 is there to, and for instance, "lattice" is among others), and if I use 'library(lattice)' command then everything is OK, sql doesn't scold me.

ram_a
Mi, 13.01.2016 - 23:49

Yes, I executed 'library(ggplot2)' in RRO first thinking that it can help -
it doesn't! donnerwetter...

Hilmar Buchta
Hilmar
Buchta
Do, 14.01.2016 - 12:35

This sounds strange. First, you don't have to load the library in RRO GUI before you can call it from SQL Server. This is only necessary to install the library and to check if it is available. SQL Server runs has its own session to R, separated from the GUI and you have to load packages for each session.
One possible reason would be, that you have two separate installations of R and RRO GUI uses one while SQL Server uses the other.
Also, when installing e1071 in RRO I had to install Rtools (from https://cran.r-project.org/bin/windows/Rtools/) first. Maybe you can check this too.

ram_a
Do, 14.01.2016 - 17:45

I did all seriously - running R as administrator and the command
has the specification of lib (if there no such specification ggplot2 goes to be installed in my local place, not in the R library):
install.packages("gggplot2",lib="C:/Program Files/RRO/RRO-3.2.2-for-RRE-7.5.0/R-3.2.2/library") but if I try load the package through external script (having the line library(ggplot2) there) then I got:

....
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called 'gtable'
In addition: Warning message:
package 'ggplot2' was built under R version 3.2.3
Error: package or namespace load failed for 'ggplot2' ....

O, mein gott, what I can do here???

Hilmar Buchta
Hilmar
Buchta
Do, 14.01.2016 - 17:53

Sounds tricky, please try this:
Install Rtools, then run
install.packages('ggplot2', dep = TRUE)
from the R shell. Hope it helps.

ram_a
Di, 19.01.2016 - 16:01

Thank for your suggestions,
here is the case:

&gt; install.packages("RTools")
Installing package into ‘C:/Users/MyName/Documents/R/win-library/3.2’
(as ‘lib’ is unspecified)
Warning message:
package ‘RTools’ is not available (for R version 3.2.2)
&gt; library(RTools)
Error in library(RTools) : there is no package called ‘RTools’

Hilmar Buchta
Hilmar
Buchta
Mi, 06.01.2016 - 10:31

Hi,
Did you try to change the line

lonMod <- lm(y ~ pretest + presdays, data = InputDataSet);

to

lonMod <- lm(y ~ pretest + presdays, data = model1);

after removing the --?

ram_a
Mi, 06.01.2016 - 16:43

Thank you, very much, Hilmar, it was a good point.

May I ask you the following - I continue to implement your approach and I have a table of (mentioned before) data (it is an analog of your iris_data)

SELECT
[presentDays] as presdays
,[fcat_Science_SS_FY15] as y
,[PreTest Score] as pretest
INTO model1
FROM [SOMETHING...]
WHERE [fcat_Science_SS_FY15] IS NOT NULL

So, I use a stored proc

CREATE PROC generator_model
AS
BEGIN

DECLARE @inquery nvarchar(max) = N'SELECT
presdays,y,pretest
FROM model1
'
print @inquery
exec sp_execute_external_script
@language = N'R'
, @script = N'
lonMod &lt;- lm(y ~ pretest + presdays, data =model1);
summary(lonMod);

fcat &lt;- data.frame(model=as.raw(serialize(lonMod, NULL)));
&#039;
,@input_data_1 = @inquery
,@input_data_1_name = N&#039;model1&#039;
, @output_data_1_name = N&#039;fcat&#039;
WITH RESULT SETS ((model varbinary(max)));
END;
make a table for the series:

CREATE TABLE [dbo].[model2](
[model] [varbinary](max) null
)
GO
insert into model2 exec generator_model

and, at last make a final step! prediction!! (Of course, I installed package(e1071) before doing this step, so I skipped library(1071) command - by the way if I leave this command in the script then sql 2016 quarrels badly)

declare @model varbinary(MAX)
select @model=model from model2

EXEC sp_execute_external_script
@language = N&#039;R&#039;
, @script = N&#039;

model &lt;-unserialize(as.raw(reg_model))
pred&lt;-predict(model, model1)
result&lt;-cbind(model1, pred)&#039;

, @input_data_1 = N&#039;SELECT
presdays, y , pretest
from model1&#039;
, @input_data_1_name = N&#039;model1&#039;
, @output_data_1_name = N&#039;result&#039;
, @params = N&#039;@reg_model varbinary(MAX)&#039;
, @reg_model= @model
WITH RESULT SETS ((y int , presdays int, pretest int, Y_Predicted float))

I cried, I never saw anything like this before.

But the prediction looks very suspicious - I cannot insert a picture, but it is obvious that there is a systematic bias in predictions, here is a little part of the results:
y presdays pretest Y_predicted
184 193 36 198.214...
184 207 30 195.126...
184 224 23 191.524...
184 206 50 205.420...
184 204 40 200.273...
184 220 50 205.420...
184 194 50 205.420...
NULL 208 46 NULL
184 215 63 212.111...
182 200 70 215.138...
184 200 70 215.714...

and so on - as you can see the variable &quot;y&quot; doesn&#039;t have a big volatility.
The regression must have an intercept (constant in the regression formula). Do you think that this overestimating comes because of some other reasons?

ram_a
Mi, 06.01.2016 - 16:54

And I am ashamed asking that, but how to save these results into the sql table?

Hilmar Buchta
Hilmar
Buchta
Mi, 06.01.2016 - 17:09

It's difficult to understand the fit without having the data. Maybe a linear regression model doesn't work well in the shown subset of the data. If you run the script from within R (for example RStudio) you can see the parameters for the linear model, calculate accuracy or plot the results.
Also library(e1071) tells the script to use the library, not to install the library. If you get an error with loading the library you should check this too, i.e. make sure that within R install.packages('e1071') results in no errors.
In order to save the values to a SQL table you can use this syntax:
INSERT INTO my_result_table Exec sp_execute_external_script ...
Hope this helps.

ram_a
Mi, 06.01.2016 - 17:15

I apologize for the last question. Please, disregard it, kindly. I was too affected.

ram_a
Mi, 06.01.2016 - 17:17

Dear Hilmar,
Thank you very much for your kind responses!

ram_a
Mi, 06.01.2016 - 17:39

Philosophical afterword, if you don't mind. I have every (!), every "y" overestimated. But my two-dimensional regression model means that just to place a plane through the three-dimensional cloud of points - by the way the plane must pass the gravity center of the cloud, i.e., some point must be under, some above the plane. But all I have - above the plane. Yes, it is my problem.
But, again, thank you!

ram_a
Fr, 08.01.2016 - 20:26

Hi, Hilmar!
Now I know what was happened - and this is good to be aware of - I (wrongly!) changed the order of results because of my senile marasmatic brain disfunction. Thank you again, it was nice to learn from you.

ram_a
Mo, 11.01.2016 - 17:33

HI, Hilmar!
You wrote: "...If you run the script from within R (for example RStudio) you can see the parameters for the linear model, calculate accuracy or plot the results..."
Does it mean that there is no hope to extract this info as tables to sqlserver2016 after executing Rscript?

Hilmar Buchta
Hilmar
Buchta
Mo, 11.01.2016 - 18:32

Actually you can pass most of the Information as a data frame to SQL Server. You can even serialize and return a plot from your script call. However, in R GUI or RStudio the process is more interactively. You can easily try different parameters, try transformations, use plots etc. to build up your "final" script which you can then call from within SQL Server (more static). In other words, you can return this Information to SQL Server but the process would be more likely to build the script in an interative environment and then to deploy the script to SQL Server for production.

ram_a
Mi, 13.01.2016 - 23:10

Thank you, Hilmar,

here is an illustration for your words:

DECLARE @inquery nvarchar(max) = N' SELECT
X, Y
FROM DATTA
'
print @inquery
exec sp_execute_external_script
@language = N'R'
, @script = N'

image_file = tempfile();
jpeg(filename = image_file, width=500, height = 700);
print(plot(DATTA$X, DATTA$Y, data = DATTA, color = I("red"), size =I(4),
xlab = "X", ylab = "Y",
main = "X vs. Y"));
dev.off();
OutputDataSet &lt;- data.frame(data=readBin(file(image_file, &quot;rb&quot;), what=raw(), n=1e6));
&#039;
, @input_data_1 = @inquery
,@input_data_1_name = N&#039;DATTA&#039;
with result sets ((plot varbinary(max)));

It gives a series (of course, we should choose X,Y and DATTA appropriately). As I understand, putting this code in a stored procedure we can write the results as a FileTable.

But, I have one strange predicament: I already mentioned that putting
&#039;library(e1071)&#039; in the code caused a bad response from sql - it says that there no e1071. The same is here - I wished to use qplot and installed for that purpose a package &#039;ggplot2&#039; in my revolutionary RRO (64bit architecture) and loaded it - but when I write a line
library(ggplot2) then sql2016 says:

An external script error occurred:
Error in library(ggplot2) : there is no package called &#039;ggplot2&#039; and on and on...
I am so frustrated... Do you know what can it be there?

ram_a
Di, 05.01.2016 - 23:45

Dear Hilmar,
trying your receipts I got the following:
the following proc must make a model (using linear regression), there are three variables - one is dependent, two are independent, variables - are the columns from the table model1. But I don't understand the following, if I run this:

CREATE PROC generator_model
AS
BEGIN

DECLARE @inquery nvarchar(max) = N'SELECT
presdays,y,pretest
FROM model1
'
print @inquery
exec sp_execute_external_script
@language = N'R'
, @script = N'
lonMod &lt;- lm(y ~ pretest + presdays, data = InputDataSet);
summary(lonMod);

fcat &lt;- data.frame(model=as.raw(serialize(lonMod, NULL)));
&#039;
,@input_data_1 = @inquery
---,@input_data_1_name = N&#039;model1&#039;
, @output_data_1_name = N&#039;fcat&#039;
WITH RESULT SETS ((model varbinary(max)));
END;

then commands completed successfully, and I can make a table for the serializing:
CREATE TABLE [dbo].[model2](
[model] [varbinary](max) null
)
GO
insert into model2 exec generator_model

But if I remove reference sign &quot;---&quot; before @input_data_1_name - sql starts quarreling :
SELECT
presdays,y,pretest
FROM model1

Msg 39004, Level 16, State 20, Line 5
A &#039;R&#039; script error occurred during execution of &#039;sp_execute_external_script&#039; with HRESULT 0x80004004.
Msg 39019, Level 16, State 1, Line 5
An external script error occurred:
Error in is.data.frame(data) : object &#039;InputDataSet&#039; not found
Calls: source ... -&gt; model.frame.default -&gt; is.data.frame

Error in ScaleR. Check the output for more information.
Error in sqlSatelliteCall() :
Error in executing R code: transform function
Calls: sqlSatelliteCall -&gt; .Call
Execution halted

(0 row(s) affected)

I don't understand that. But I believe you can. Please, explain what is happening in this situation...

Neuen Kommentar schreiben

Der Inhalt dieses Feldes wird nicht öffentlich zugänglich angezeigt.

Klartext

  • Keine HTML-Tags erlaubt.
  • HTML - Zeilenumbrüche und Absätze werden automatisch erzeugt.
  • Web page addresses and email addresses turn into links automatically.
Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden