Part 1: Predictive Modeling using R and SQL Server Machine Learning Services

To R or not to R?

A few months ago, I asked myself an important question – which language to learn first – R or Python? From my research, I found that R is regarded as more old-school and difficult to learn, but Python is more popular. It made perfect sense to learn R first 🙂

For the uninitiated, R is a programming language that makes statistical and mathematical computation easy, and is useful for machine learning/predictive analytics/statistics work.

Along the way, I found the following courses useful.
https://www.edx.org/course/introduction-to-r-for-data-science
https://www.edx.org/course/programming-in-r-for-data-science

The goal has always been to explore Machine Learning Services in SQL Server, and dive deeper thereafter. This 3-part series is a walk through/review of Microsoft’s tutorial on Predictive Modeling using R and SQL Server. The first part deals with preparing data, training a model and using it for prediction.

What is Predictive Modeling?

Predictive Modeling uses statistics to predict outcomes.

In our example, we will use Machine Learning Services for SQL Server 2017 to predict number of rentals for a future date in a ski rental business. This brings up an important question – why do we want to predict? In this case, prediction will help the business be prepared from a stock, staff and facilities perspective. Prediction has numerous, life-changing applications. Read about application of AI in predicting cardiovascular disease by Microsoft for Apollo Hospitals, India.

For the ski rental prediction, we will use test data provided by MS, SQL Server 2017 with Machine Learning Services, and R Studio IDE. Please check the following URL and follow the simple instructions to set up your environment. If you have any question, please use Comments section to drop me a note.
https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction

PredictiveModeling-Steps1

Steps involved in building the predictive model in SQL Server

  1. Getting data
  2. Preparing data
  3. Training models
  4. Comparing results and choosing a model
  5. Deploying the Machine Learning script to SQL Server

1. Getting Data: Okay, let’s get started. In the first step, we will restore sample database – TutorialDB – using the database backup file provided by MS. After restoring the database using SSMS, we will take a look at rental data we will use for training the model.

All scripts used will be provided in the Resources section at bottom of the page.

002_RestoreSampleDatabase-TutorialDB

003_ExamineTrainingData

You will see that we have 453 rows of rental stats. Data is in place, so we are good to move on to Step 2.

2. Preparing Data: Now that database is restored and data available in SQL Server, we will load data to R and transform it. Open R Studio and execute the rental data load script. After loading data, we will examine a few rows/observations and inspect data types. We will then proceed to change types of a few columns to factor.

004_loadrentaldatafromsqlserver.jpg

005_DataPreparation

3. Training Models: Now that data is prepared, we will chose a model that best describes dependency between variables in our dataset. During training, we provide the variables along with the outcome so that our model can train to predict the outcome. Here, we will compare predictions by two different models and choose the more accurate one as our predictive model.
For clarity, let me state that we are trying to predict RentalCount for the year 2015, given the variables – Month, Day, Weekday, Holiday and Snow.

The challenge in Machine Learning is in knowing what various models mean, and when a particular model might be more suitable. MS recommends this cheat sheet as a guide.

006_SplitDatasetToTrainingAndTest

007_TrainingUsingLinearRegressionAndDecisionTreeModels

008_RentalDataPrediction

Comparing results: Here, we compare results to figure out which model predicted more accurately. Decision Tree performed better in this case and we will use the model to deploy our Machine Learning Solution to SQL Server in Part 2

009_RentalDataPlotDifferencePredictedAndActual

Resources:

Scripts for Part 1 (zip file): https://drive.google.com/file/d/1Tzs4qzFXXL-NgxFlKQcuVHKwx90Imr8u/view?usp=sharing
SQL Server R tutorials: https://docs.microsoft.com/en-us/sql/advanced-analytics/tutorials/sql-server-r-tutorials?view=sql-server-2017
Gitghub repo for rental prediction: https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction/
Machine Learning cheat sheet, use with caution 🙂 https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice#the-machine-learning-algorithm-cheat-sheet

6 thoughts on “Part 1: Predictive Modeling using R and SQL Server Machine Learning Services

  1. Great post, Arjun, very clear explanations. In the section ‘2. Preparing Data’ you mention ‘We will then proceed to change types of a few columns to factor’. But in the screenshot I don’t see the R commands that do this. What were those commands ?

    Thanks again.

    • Thanks for your comment LondonDBA. Well spotted indeed, I missed that one! I have now added the screenshot, and included the corresponding script in the zip file in Resources.

      The command is factor(). You would use it like factor(rentaldata$WeekDay). One of the reasons I like R is that it adds another dimension to how we look at data. R is full of such handy little methods 🙂

  2. […] Part 1 of Predictive Modeling using R and SQL Server Machine Learning Services covered an overview of Predictive Modeling and the steps involved in building a Predictive Model. Using our sample dataset – Ski Resort rental data – we wanted to predict RentalCount for the year 2015, given the variables – Month, Day, Weekday, Holiday and Snow. […]

Leave a comment