Predicting Daily Demand via a Random Forest Regression

Picture a scenario when the head of marketing asks you if they should launch a promo campaign next week for a specific city, to increase demand to hit the monthly target.

If you have built a predictive model that can forecast demand accurately you can reply with an informed answer.

You have the following data points to help us to give this informed answer:

1. Weekday [Monday to Sunday]

2. Day period [Early morning to Overnight]

3. Holiday [1/0]

4. Weather continuous variables [Temperature, humidity, wind speed and precipitation]

Demand is affected by an array of factors, some of which are quite hard to find and, or model for, the more influential independent variables we can find the better, a simple linear regression or ARIMA time series won’t achieve an accurate prediction of demand.

Algorithms, Regression in R, fitting the data for the best model:

Multiple Linear Regression: lm.outbound <- lm(sale_id ~ weekday + day_period + holiday + temperature + humidity + wind_speed, data = trainset) #fitting linear regression

Random Forest: rf.outbound <- randomForest(sale_id ~ weekday + day_period + holiday + temperature + humidity + wind_speed, data = trainset) #fitting random forest regression

Through testing the algorithms, we might find that the Random Forest regression is the most accurate, with the lowest MAE and RMSE. Duke University recommends the following steps for choosing the correct regression model:

If there is any one statistic that normally takes precedence over the others, it is the root mean squared error (RMSE), which is the square root of the mean squared error. When it is adjusted for the degrees of freedom for error (sample size minus number of model coefficients), it is known as the standard error of the regression or standard error of the estimate in regression analysis or as the estimated white noise standard deviation in ARIMA analysis. This is the statistic whose value is minimized during the parameter estimation process, and it is the statistic that determines the width of the confidence intervals for predictions. It is a lower bound on the standard deviation of the forecast error (a tight lower bound if the sample is large and values of the independent variables are not extreme), so a 95% confidence interval for a forecast is approximately equal to the point forecast “plus or minus 2 standard errors”–i.e., plus or minus 2 times the standard error of the regression.

However, there are a number of other error measures by which to compare the performance of models in absolute or relative terms:

  • The mean absolute error (MAE) is also measured in the same units as the data, and is usually similar in magnitude to, but slightly smaller than, the root mean squared error.  It is less sensitive to the occasional very large error because it does not square the errors in the calculation.

The regression output will not usually calculate this for you, so you will need to request this in R. Here’s an example.

Here is code to calculate RMSE and MAE in R and SAS.

RMSE (root mean squared error), also called RMSD (root mean squared deviation), and MAE (mean absolute error) are both used to evaluate models by summarizing the differences between the actual (observed) and predicted values. MAE gives equal weight to all errors, while RMSE gives extra weight to large errors.

First, in R:

# Function that returns Root Mean Squared Error
rmse <- function(error)
{
    sqrt(mean(error^2))
}
# Function that returns Mean Absolute Error
mae <- function(error)
{
    mean(abs(error))
}
# Example data
actual <- c(4, 6, 9, 10, 4, 6, 4, 7, 8, 7)
predicted <- c(5, 6, 8, 10, 4, 8, 4, 9, 8, 9)
# Calculate error
error <- actual - predicted
# Example of invocation of functions
rmse(error)
mae(error)
# Example in a linear model
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
rmse(lm.D9$residuals) # root mean squared error

 

Other recommended ways to compare models:

After fitting a number of different regression or time series forecasting models to a given data set, you have many criteria by which they can be compared:

  • Error measures in the estimation period: root mean squared error, mean absolute error, mean absolute percentage error, mean absolute scaled error, meanerror, mean percentage error
  • Error measures in the validation period (if you have done out-of-sample testing): Ditto
  • Residual diagnostics and goodness-of-fit tests: plots of actual and predicted values; plots of residuals versus time, versus predicted values, and versus other variables; residual autocorrelation plots, cross-correlation plots, and tests for normally distributed errors; measures of extreme or influential observations; tests for excessive runs, changes in mean, or changes in variance (lots of things that can be “OK” or “not OK”)
  • Qualitative considerations: intuitive reasonableness of the model, simplicity of the model, and above all, usefulness for decision making!

With so many plots and statistics and considerations to worry about, it’s sometimes hard to know which comparisons are most important. What’s the real bottom line?

 

 

 

 

Why Python?

A hot data science skill on all the job boards is knowing Python. It sits brightly besides SQL and R as a need to have but why is it so popular? As a Marketing Data Scientist, how will it help me crunch statistics and other maths questions?

Python.org gives us an executive summary, but this is vague, more of a quick marketing promo, searching elsewhere one of top results is from Datacamp, it gives a good starting point with 40+ statistics problems, including my personal favourite Statistics:

The topics that this blog post includes and will list resources for are:

I’ll let you know how I progress!

 

How to become a Marketing Scientist

Need to know skills and subject matter from Viviane Meeren (PhD Student).

Data science is a large field covering everything from data collection, cleansing, standardization, analysis, visualization and reporting. Marketing and CRM (customer relationship management) science is part of this larger field, employed by companies to deliver higher returns on marketing investment (spend), ROI.

Be it for online or offline  marketing campaigns, keeping existing customers engaged or acquiring new customers, this field could save companies millions on their marketing spend by employing simple models like segmenting their customers base by recency, frequency and value (RFV). More complex models could save much more and give high marketing spend companies an edge over their competitors

Popular models include for customer acquisition and retention:

  • Lifetime value (LTV)
  • Marketing Mix Modelling and Attribution
  • Retention (likelihood of a customer to churn and time to next purchase)
  • Next Best Action, recommendation engines

Whilst there are many statistical methods that could be employed on each model, some more accurate than others. For its simplicity and ease of application, econometric regression modelling such as logistic regression, linear regression and survival analysis. With these models, a marketing data scientist can score their company’s customer database and quickly action for numerous campaigns.

In 2018, marketing data scientists are not only experts in modelling but also in extracting and forming complex datasets, as well as visualising this data. The core skills include SQL (i.e for Hadoop and Teradata), R, Python or SAS for modelling and Statistics, Tableau and Salesforce for visualisation and reporting.

In my blog, I will explore different areas related to marketing data science, as well as discussing new marketing technologies. Stay tuned in for more.