Innovation Lab: Data Scientist & Predictive BI
The ORAYLIS Innovation Lab is an Institution within our company that focuses on new technologies and methods. Each Lab is centered one new topic and together with interested colleagues we try to evaluate the Business Intelligence value and to establish use cases for our customers.
2) The language “R” together with R Studio and the Tableau integration
3) The new Azure ML cloud service
As a test scenario we predicted the survival chances for passengers of the Titanic disaster following the Kaggle competition “Titanic: Machine Learning from Disaster” , which is good starting point even for people who may not have a lot experience in data science and machine learning. The provided test- and training- data for the case is highly structured and there are tutorials available like Excel (Pivot) or several for R. What is also very helpful is that Kaggle offers within its competition an evaluation part, which is nice if you want to compare several algorithms and their related predictions – you just have to upload a csv file and Kaggle tells you if you have improved your latest submission and on which rank you are on the overall competition.
At first Lab participants did go with the Excel (Pivot) tutorial to get a quick understanding of the data and the process of competing within kaggle.
The provided train.csv already has the column to predict “survived” – with it you can find your approach, because you could directly evaluate the effect of a hypothesis of influences of the provided passenger attributes and their potential clusters.
In Excel this train.csv looks like this:
Working on this data with the Excel pivot technique quickly shows a major pattern:
You could have guessed it: almost 75% of the females survived and only 19% of males. From this you already can make your first prediction based on this pattern.
So now it’s time for feature exploration and feature engineering. It’s all about finding the key influencers in your data relevant for the prediction. Among the sex of the passenger (male/female), which attributes have impact and in what attribute cluster the impact is the largest? For example is the age of the passenger important or the name? The full name, the last name or just the title?
This is the part of data science which has the most impact on the prediction result. You could use the most modern algorithm with not good prepared data versus an old algorithm with excellent base data – guess how wins.
And then comes the part with the necessary warning: be careful when using data mining in real life – mining algorithms will ALWAYS give an output, even if you provide case non-relevant data! For a good prediction of a case it takes all three disciplines for you to master: Hacking Skills, Math & Statistics Knowledge and Substantive Expertise.
Substantive Expertise for the Titanic case can be found online easily. For example the route the ship did take to understand the sequence of embarking.
Or the position of the decks within the ship.
Yes – this is still an article about predictive analytics, but one can’t stress enough the fact that it is NOT enough sufficient to just use a state of the art machine learning tool. Substantial Prediction is focused around the understanding of the case and equivalent preparation of data.
All 3 evaluated Prediction enviroments supply some help for feature- exploration and engineering.
First let’s have a look at the good old Excel with the Data Mining Plug-in, and a first look at the dataset without feature engineering:
Several original attributes seem to be relevant: Sex, Passenger-Class, the Fare the passenger paid, the number of siblings and even the harbour where they embarked. The passenger name is not listed.
Then comes the feature exploration and –engineering. E.g. not all attributes where filled for every passenger, e.g. the age. Is this hole a relevant info or just missing data and can it be filled derived from the other passengers?
What other attributes can be derived from the existing ones? Let’s see what we can do with the passenger name:
You can split the Fullname in Lastname, Title and Prenames and even count the number of Prenames.
From the Title you could derive if the passenger is a mother and so on. We now have several new attributes, which ones should we keep – again the Key Influencer Analysis can help find the dataset most relevant for the prediction of survival (survived=1). Excel is very handy for this kind of tasks as most of us are very used to working with Excel and its Formulas.
So let’s have a look at “R”. Of course this kind of tasks are supported there too, here is a influencer analysis based on R.
- # Survival Rate based on Passenger Classes
- Pclass_survival <- table(trainData$Survived, trainData$Pclass)
- barplot(Pclass_survival, xlab = "Cabin Class", ylab = "Number of People",
- main = "survived and deceased between male and female")
- Pclass_survival / (Pclass_survival + Pclass_survival)
- Pclass_survival / (Pclass_survival + Pclass_survival)
- Pclass_survival / (Pclass_survival + Pclass_survival)
R is very adaptable with lots of plugins and therefore is very feature-rich. But the scripting approach of R is not as easy as using tools like Excel or Azure ML. On the other hand you will find the function you would like to have somewhere. Here and complete overview of the impact of one attribute to the others:
scatterplotMatrix(train, plot.points=F, diagonal="histogram")
This view gives you also hints how you could potentially cluster your attributes – e.g. take a look at the bar chart for the Age. Often clustered attributes helps the mining algorithms to not take irrelevant details into account.
There is also an R integration with the BI Tool Tableau. While Tableau is already very useful for (visual) data exploration it can also be connected with R, but the interface is up to now quite limited.
Tableau can only integrate R through calculated fields using the SCRIPT_* functions which can address function calls in R. You can’t exchange data in any other way, so if you use R for your enhancements of your dataset, you would have to reimport the dataset again traditionally.
Azure ML is different than the already mentioned tools for being completely cloud-based, which means you do not have to install anything locally, so called experiments can easily be shared with colleagues or can be published with one click e.g. to the public available azure market place, and just set a price per usage. In comparison to the on premise solution with Analysis Services and the Mining Algorithms in Azure ML are more modern, than the classic ones in SQL Server. The algorithm names are quite as fancy as product names of Starbucks coffees – while in Analysis Services you talk about a “Decision Tree” in Azure ML you have for example a “Two-Class Boosted Decision Tree”. Of course you also have R integration where you could execute R Scripts within your Azure ML “Experiment”.
What does Azure ML provide for feature- exploration and engineering? For example a good overview of the data with standard evaluations:
There is also a task to evaluate the key influencer attributes and of course several possibilities to engineer or derive attributes - “Custom Columns” - or to fill gaps in the data “Missing Values Scrubber”. The whole handling is quite equivalent to working with Microsoft SQL Server Integration Services:
Back to the Titanic case: in the end it turns out that the two major attributes influencing the outcome of survival are: the title and the passage fee. You may wonder why the sex is left behind, but indeed the sex is in the title as well as if a female is a mother and so on.
Now that you have your prepared dataset for the case it comes to the prediction part. Here all 3 environments offer to several mining algorithm with some differences. In Analysis Services and the potentially connected remote control in form of the Excel Data Mining Plug-In you are bound to 9 algorithms – the classic ones like: neuronal net, association rules, decision tree, naïve bayes and so on. Up to date there is no possibility to enrich with other algorithms e.g. with an R implementation, like Azure ML can do. In R itself you are very flexible and can import lots of different libraries, but still you bound to scripting. Let’s compare the algorithm model viewer for the decision tree:
In Excel / Analysis Services it looks like this:
And in R it can look like this:
And in Azure ML…. up to now there is no model viewer available, but Microsoft announced to deliver them.
In the Prediction process you would now try different algorithms and then compare them to find the best fit for your prediction problem – this is usually done in a Lift Chart. Because all three tools provide this kind of evaluation I only show the Analysis Service / Excel Plug-In version:
When you have found the best fitting algorithm then you want to publish your model to production. For transactional request of your model, e.g. fraud detection of a transaction it is nowhere easier than in Azure ML, it is just one button to publish your “Experiment” as a Web Service.
And another switch to push it to production, from where it can be made generally available in the Azure Market Place and set a price per usage just as easy.
Of course this means the usage is via the cloud. For usage On Premise the SQL Server Analysis Services would be preferable while with R you are very flexible but still this means more complexity in finding the right stack of libraries and infrastructure to meet your demands.
To summarize the findings of the comparison, here is an overview in form of a Matrix:
The bottom line is: Azure ML is a huge step forward for Microsoft in the field of Data Mining, Predictive Analysis or Machine Learning. A web based visual modern graphic user interface, strong Algorithms and the possibility to go to production via a one click web service with optional tight integration with the Azure Market Place to commercialize the solution on a pay per use construct levels this environment as a major enrichment for this sector. With the possibilities to integrate (existing) R solutions and to operationalize R as a web service also existing players in the market should evaluate Azure ML. Of course Azure ML is still in preview mode – not fully feature complete – but already a very strong. Take this use case as an example: Sentiment Analysis In Excel With Azure Machine Learning And Power Query
If you would like to know more about Prediction Use Cases, Machine Learning and Azure ML here is a good entry point from Microsoft: Machine Learning for Analysts: The Basics
Try Azure ML for free – no time limits, subscriptions, or credit cards needed – on www.azure.com/ml .
To help you get started, there are several pre-baked sample models that can help you explore some common business scenarios for machine learning:
- Credit risk : predict whether an applicant is a good credit risk
- Customer churn, upsell, and purchase propensity : use customer data to predict customer behavior
- Network intrusion detection : use network feature data to detect which network activities are part of an intrusion or attack
- Customer sentiment analysis : predict how customers will rate a product using plan text review data
If you’re interested in stepping through more detailed machine learning tutorials, explore these videos:
- Getting Started with Azure Machine Learning – Part 1
- Getting Started with Azure Machine Learning – Part 2