BUSINESS

Introduction

This is a report about which model to use when trying to predict the number of social media shares of an article with a topic of business based on a myriad of predictors.

The data to build this report originally comes from Mashable, and includes a variety of data about articles on their site during a two year period. Within this dataset, we have 50 predictors. Three of these predictors are character variables - url of the article (url), the day the article was published (day), and the topic of the article (topic). We won’t include url as a predictor since that seems to be more of an identifier variables rather than an attribute of the article. We also are analyzing each of the topics separately from each other. The rest of the predictors are numeric and include attributes such as how long the article is, the polarity of the article, and how long ago it was published as well as a variety of other things.

To read more about all of the data included in the dataset read here.

In this report we make models that include some of the predictor variables and others that include all of the predictor variables. Some variables we found to be especially helpful in predictions are shown below with their definitions:

LDA_02 - closeness to LDA Topic 2.
LDA_03 - closeness to LDA Topic 3.
num_imgs - Number of Images in the article.
kw_avg_avg - Avg. keyword (avg. shares).
n_tokens_content - Number of words in the article.
average_token_length - Average length of words in the article.
rate_negative_words - Rate of positive words in the article.
rate_negative_words - Rate of negative words in the article.
day - A variable we create to signify what day of the week an article was published.
kw_max_avg - Avg. keyword (max. shares).
title_subjectivity - The subjectivity of the title of the article.

In this report we create the following models:

1 Linear Model with all predictor variables.
2 Linear Models with some selected predictor variables.
1 Logistic Linear Model with all predictor variables.
1 Random Forest Model with some selected predictor variables.
1 Boosted Tree Model with all predictor variables.
1 Boosted Tree Model with some selected predictor variables.

At the end of this report we decide on which model was the best for the business using root mean square error (RMSE). The RMSE is not the only possible measure of how good a model is, but it is a pretty good choice if we want to measure how good our model is at predicting the number of shares. The RMSE is found by finding the root of the sum of the squared Euclidean distance between the predicted value of shares given some values of the predictors and the actual value of shares given those values.

The packages used in this report are:

library(tidyverse)
library(caret)
library(rmarkdown)

# used for randomForests before I run on CV
library(randomForest)

# use this for parallel computing
library(parallel)
# also for parallel computing. I will look to see if it's the same later
library(doParallel)

# To use Kable for good looking data print out
library(knitr)

# For boosted tree
library(gbm)

Reading in the data

Data Processing

Below we read in the data, add new variables, remove old ones, filter specifically for the business topic, and then we go ahead and split the data into the test and train data sets for model training and testing.

# read in raw data
data <- read_csv("OnlineNewsPopularity.csv")

# add in day of week variable called "day"
# check every indicator variable for a 1, once it hits a 1
# the new variable day is defined as whatever day it is
# if no 1 are present saves the day variable as "error".

# This same process takes place with topic.
data <- data %>%
  mutate(
    day = if_else(weekday_is_monday == 1, "monday",
                  if_else(weekday_is_tuesday == 1, "tuesday",
                          if_else(weekday_is_wednesday == 1, "wednesday",
                                  if_else(weekday_is_thursday == 1, "thursday",
                                          if_else(weekday_is_friday == 1, "friday",
                                                  if_else(weekday_is_saturday == 1, "saturday",
                                                          if_else(weekday_is_sunday == 1, "sunday","error"))))))),
    topic = if_else(data_channel_is_lifestyle == 1, "lifestyle", 
                  if_else(data_channel_is_entertainment == 1, "entertainment", 
                          if_else(data_channel_is_bus == 1, "business", 
                                  if_else(data_channel_is_socmed == 1, "socialMedia", 
                                          if_else(data_channel_is_tech == 1, "tech",
                                                  if_else(data_channel_is_world == 1, "world", "other"))))))
    )

# remove old day and topic indicators
data <- data %>%
  select(!starts_with("weekday_is_"), -starts_with("data_channel_is_"))

# change to factor
data$day <- as.factor(data$day)
data$topic <- as.factor(data$topic)


#check structure to make sure is successful 
str(data)

#double check for "errors"
table(data$day)

# these actually don't have a topic!
data %>% filter(topic == "other")

Subsetting Data

Here we are subsetting the data by selecting for those observations that have business as their topic.

filtered_data <- data %>% filter(topic == params$topics) %>% select(!topic, -url) %>%
  mutate(day = factor(day, levels = c("sunday", "monday", "tuesday", "wednesday", 
                                        "thursday", "friday", "saturday")))

set.seed(214)

# Splitting Data set into a 70% training set and 30% for a testing set 
train <- sample(1:nrow(filtered_data), size = nrow(filtered_data)*0.7)
test <- setdiff(1:nrow(filtered_data), train)

# Subsetting the data based on our split
train_set <- filtered_data[train, 1:48]
test_set  <- filtered_data[test, ]

# Log transformation used for one of the linear regression models:
train_log <- train_set %>% mutate(shares = log(shares))
test_log <- test_set %>% mutate(shares = log(shares))

Exploratory Data Analysis

We can start by looking at the general qualities of the response variable, number of shares, for this data set. It’s pretty clear that we have heavily right-skewed data. For example, the boxplot of business’s shares shows that we can’t interpret a majority of the data due to the number of high-value outliers. To better visualize the population, we applied a log transformation to the response variable. We will also use the log-transformed data once more in one of the linear regression models. The log transformation works well with this data because number of shares is always positive and it is heavily right skewed.

# I think this boxplot says a lot, plenty of outliers that just dwarf the
# majority of the shares
  ggplot(data = train_set, aes(shares)) +
  geom_boxplot() +
  labs(x = "Number of Shares", title = "Boxplot of Shares")

# transform the shares variable by log
# makes a huge difference in readability of these graphs
ggplot(data = train_log, aes(shares)) +
  geom_boxplot() +
  labs(x = "Number of ln(Shares)", title = "Boxplot of ln(Shares)")

The boxplot in the top left is a boxplot of the number of shares, and the boxplot on the right represents the distribution of log(shares). Even with the log transformations, we can still see some right skew-ness in number of shares. I have also included a histogram of number of log(shares) faceted by the day the article was published and a scatter plot of number of log(shares) by how long ago they were published. What we must remember when we look at these graphs is how the natural log affects data. For example \(ln(2) \approx 0.69\), \(ln(20) \approx 3.00\), and \(ln(200) \approx 5.29\). The natural log increases by a smaller amount as the input increases.

ggplot(data = train_log, aes(timedelta, shares)) +
  geom_point(aes(color = day)) +
  labs(x = "Days since Article was Published", y = "Number of ln(Shares)",
       title = "Time Since Article Published vs Number of Shares")

There is also a scatter plot showing the the number of shares on the y-axis and the time in days between when the article was published and when the data was collected. I did expect to see a stronger negative correlation, but that doesn’t seem to be the case. We can see a cluster of points between 550 and 600 days, that all have a very small number of shares. There are also plenty of highly shared articles pretty evenly horizontally across time. That seems to indicate that there are just always highly viral articles being shared.

Next, we will take a look at the correlation of all variables for the business topic with shares. We want to know what predictors are most correlated with our response, shares.

# get names of all columns except day because that isn't numeric, and shares as that is the target variable
col_names <- names(filtered_data)[1:46]

# Build a nice looking tibble for checking correlation
for(i in 1:length(col_names)){
  
  #check correlation between shares and other columns
  correlate <- cor(filtered_data$shares,filtered_data[,i])
  
  # save name of other column
  cor_name <- col_names[i]
  
  # create row of a tibble
  correlation_row <- tibble_row(cor_name,correlate)
  
  # if first iteration of loop, correlation_tibble is the correlation_row
  # if not first iteration, correlation_tibble is binded with the new row info
  if(i == 1){
    correlation_tibble <- correlation_row
  }else{
    correlation_tibble <- rbind(correlation_tibble,correlation_row)
  } # end if else statement
  
  } # end for loop
correlation_tibble  

# Add new column for absolute value of correlation and sort by this column in descending order
# To see which variables have the highest impact on 
correlation_tibble <- correlation_tibble %>%
  mutate(abs_cor = abs(correlate)) %>%
  arrange(-abs_cor)


# change names for better printing
names(correlation_tibble) <- c("Variable","Correlation","Absoulte Correlation")

# use kable for good looking print out
kable(correlation_tibble)

Variable	Correlation	Absoulte Correlation
self_reference_min_shares	1.108296e-01	1.108296e-01
self_reference_avg_sharess	1.049782e-01	1.049782e-01
kw_avg_avg	8.721093e-02	8.721093e-02
self_reference_max_shares	7.571042e-02	7.571042e-02
LDA_03	6.443598e-02	6.443598e-02
avg_negative_polarity	-6.208934e-02	6.208934e-02
min_negative_polarity	-6.060169e-02	6.060169e-02
kw_max_avg	5.458650e-02	5.458650e-02
num_videos	5.049253e-02	5.049253e-02
global_subjectivity	4.781618e-02	4.781618e-02
num_hrefs	4.055156e-02	4.055156e-02
kw_avg_min	3.842540e-02	3.842540e-02
kw_min_avg	3.677652e-02	3.677652e-02
kw_max_min	3.625572e-02	3.625572e-02
global_rate_negative_words	3.099482e-02	3.099482e-02
kw_avg_max	2.872173e-02	2.872173e-02
n_tokens_content	2.794650e-02	2.794650e-02
num_imgs	2.754498e-02	2.754498e-02
average_token_length	-2.672607e-02	2.672607e-02
max_negative_polarity	-2.662182e-02	2.662182e-02
num_keywords	2.567395e-02	2.567395e-02
LDA_04	-2.304772e-02	2.304772e-02
LDA_02	-1.826237e-02	1.826237e-02
is_weekend	1.809472e-02	1.809472e-02
abs_title_subjectivity	1.790837e-02	1.790837e-02
num_self_hrefs	1.771858e-02	1.771858e-02
max_positive_polarity	1.747359e-02	1.747359e-02
min_positive_polarity	-1.615307e-02	1.615307e-02
global_rate_positive_words	1.556270e-02	1.556270e-02
abs_title_sentiment_polarity	1.488785e-02	1.488785e-02
rate_negative_words	1.436160e-02	1.436160e-02
title_subjectivity	1.308270e-02	1.308270e-02
rate_positive_words	-1.265746e-02	1.265746e-02
global_sentiment_polarity	-1.265111e-02	1.265111e-02
n_tokens_title	1.067347e-02	1.067347e-02
kw_max_max	9.766739e-03	9.766739e-03
avg_positive_polarity	8.483034e-03	8.483034e-03
timedelta	4.931198e-03	4.931198e-03
n_non_stop_unique_tokens	4.519792e-03	4.519792e-03
title_sentiment_polarity	4.084580e-03	4.084580e-03
kw_min_min	-3.768910e-03	3.768910e-03
kw_min_max	-3.656645e-03	3.656645e-03
LDA_01	-2.460413e-03	2.460413e-03
n_non_stop_words	1.796671e-03	1.796671e-03
LDA_00	-2.872553e-04	2.872553e-04
n_unique_tokens	3.322602e-05	3.322602e-05

From checking the correlation of all variables it appears that the following five variables have the highest correlation with shares are:

self_reference_min_shares
self_reference_avg_sharess
kw_avg_avg
self_reference_max_shares
LDA_03

Lets build some scatter plots to visualize this better! We’ll want to build a function to make plotting easier.

# Function to build scatter plot with a lm smooth line. 
# cor_num corresponds to the place the place in the correlation tibble created earlier,
# so if cor_num = 1, that is the highest absolute value correlation etc.
plot_func <- function(cor_num){
  
# Which column in the filtered_data tibble is this correlation
col_num <- which(names(filtered_data) %in% as.character(correlation_tibble[cor_num,1]))

# Create new tibble with shares and only the column needed for this plot
data_for_plot <- filtered_data %>%
  select(shares,col_num)

# Rename the second column for easy plotting
names(data_for_plot)[2] <- "need"

# Create plot, this will create a scatter plot between the two variables, with a lm geom smooth line included.
# this will also define the labels and titles correctly.
scat <- data_for_plot %>%
  ggplot(aes(x = shares, y = need))+
  geom_point()+
  geom_smooth(method = "lm")+
  labs(x = "Number of Shares",
       y = str_to_upper(as.character(correlation_tibble[cor_num,1])),
       title = paste0("Visualizing Correlation Between Shares and ",str_to_upper(as.character(correlation_tibble[cor_num,1]))))

# Return the scatter plot
return(scat)
} # end function

Now that we have built the function lets check out these plots!

# Build a plot with the highest absolute value correlation
plot_func(1)

Looking at this plot between the number of shares of an article and self_reference_min_shares we can see that as self_reference_min_shares goes up, the number of times an article is shared generally increases.

# Build a plot with the 2nd highest absolute value correlation
plot_func(2)

Looking at this plot between the number of shares of an article and self_reference_avg_sharess we can see that as self_reference_avg_sharess goes up, the number of times an article is shared generally increases.

# Build a plot with the 3rd highest absolute value correlation
plot_func(3)

Looking at this plot between the number of shares of an article and kw_avg_avg we can see that as kw_avg_avg goes up, the number of times an article is shared generally increases.

# Build a plot with the 4th highest absolute value correlation
plot_func(4)

Looking at this plot between the number of shares of an article and self_reference_max_shares we can see that as self_reference_max_shares goes up, the number of times an article is shared generally increases.

# Build a plot with the 5th highest absolute value correlation
plot_func(5)

Looking at this plot between the number of shares of an article and LDA_03 we can see that as LDA_03 goes up, the number of times an article is shared generally increases.

Now that we have looked at the numeric variables, we want to check out the day variable to see if the day of the week has any effect on the number of times an article was shared. To do this we preformed a Chi-Squared test! However, before we can preform that test however, we are going to create a categorical version of the shares variable called shares_quart that assigns the number of shares to its quartile value, i.e. if an articles shares fall into the first quartile it will be assigned “Q1” etc.

Then we build a contingency table between what day of the week an article was published and what quartile it ends up in. Finally, we preform the Chi-Squared test. We’ve also included a histogram of the number of ln(shares) facetted by which day of the week it was. To be very clear though, the chi-square test is calculated with shares.

ggplot(data = train_log, aes(shares)) +
  geom_histogram(aes(fill = day)) + 
  facet_wrap(vars(day)) +
  labs(x = "Number of ln(Shares)", title = "Histogram of ln(Shares)", 
       fill = "Day")

# Keep just the day and number of shares
data_for_chi <- filtered_data %>%
  select(day, shares)

# split data into quartiles
quarts <- summary(data_for_chi$shares)

# add column for which quartile the shares fall into.
data_for_chi <- data_for_chi %>%
  mutate(shares_quart = if_else(shares <= quarts[2],"Q1",
                                if_else(shares <= quarts[3],"Q2",
                                        if_else(shares <= quarts[5],"Q3",
                                                if_else(shares <= quarts[6],"Q4","error")))))

# Show contingency table between the day and the shares quartile
kable(table(data_for_chi$shares_quart,data_for_chi$day))

	sunday	monday	tuesday	wednesday	thursday	friday	saturday
Q1	11	290	338	372	351	196	7
Q2	78	317	328	349	333	230	29
Q3	106	252	259	287	297	214	84
Q4	148	294	257	263	253	192	123

# preform chi squared test
chi_tst <- chisq.test(x = data_for_chi$day, y = data_for_chi$shares_quart)

# Show results
print(chi_tst)

## 
##  Pearson's Chi-squared test
## 
## data:  data_for_chi$day and data_for_chi$shares_quart
## X-squared = 319.51, df = 18, p-value < 2.2e-16

The Chi-Squared test resulted in a P-Value of 4.60433880751948e-57. Using a signifigance level of 0.05 it appears that the day of the week that an article is published does have a statistically significant impact on the amount of times an article is shared.

Model Analysis

Linear Modelling

Linear regression is a simple supervised learning method for quantitative response variables. These models are well-liked because they are easy to interpret. The linear model assumes that the true relationship between the response and predictors is linear, although we can perform non-linear transformations on the model’s predictors. For example, we can include higher power predictors, such as quadratics, we can include interaction terms, and we can take logarithmic transformations on the predictors. The ‘fit’ of a linear model is done by minimizing the sum of the squared residuals, which the model . We hope to use this linear model to predict the number of shares an article can achieve based on some predictors.

# Build Full Model
lm_full_fit <- train(shares ~ . , data = train_set,
              preProcess = c("center","scale"),
              method = "lm")


lm_full_pred <- predict(lm_full_fit,newdata = test_set)
# check RMSE of prediction
lm_full_rmse <- sqrt(mean((lm_full_pred-test_set$shares)^2))
Adj_R_lm <- summary(lm_full_fit)[["r.squared"]]

# creating residual plots from the full linear regression model
lm_full_plot <- lm(shares ~ . , data = train_set)
plot(lm_full_plot)

# best model found
lm_log_fit <- lm(shares~., data = train_log)
Adj_R_log<-summary(lm_log_fit)[["r.squared"]]
# still not a great linear model plot, but it is better than before
plot(lm_log_fit)

lm_log_pred <- predict(lm_log_fit, test_log)
lm_log_rmse <- sqrt(mean((lm_log_pred-test_log$shares)^2))

When we look at the residual plots from the linear regression we do have a goal in mind. For the residual plot (top left), we want to see no trends. The normal QQ plot should have points that generally follow the prescribed line on the plot. Many QQ plots do have points that stray at the beginning of the QQ Plots line. In the scale-location plots, the residuals should be evenly spread. Finally, our Cook’s plot should have all of its points within Cook’s distance.

The plots from our linear model did seem that a logistic regression would improve them. The linear regression from log(shares) is below and we can see some improvement visually and a major improvement of \(Adjusted R^2\). The \(Adjusted R^2\) from the shares linear regression model is 0.0411607, and the \(Adjusted R^2\) from the shares linear regression model is 0.1653928. Each of these topic models show a larger \(Adjusted R^2\) from the log(Shares) linear regression model.

Because we eventually want to compare how each of the models that were trained on the training data set performs on the data set, we need to preserve the same scale. We will be going through several more models and comparing the Root Mean Square Error (RMSE), which is found using Euclidean distance. We will not use the log(shares) response in any of the rest of the models produced but the rest of the models created used center and scaled data. We also found two more linear regression models to compare to the model with all the predictors, lm_full_fit. lm_full_fit’s RMSE is 9828.753562`.

The next linear model regression we created uses some of the significant predictors from the full linear model. It includes nine predictors. I’m not too concerned about what the predictors stand for because we are not concerned about the interpretability of the model. We only care about prediction. We also use these same predictors in the random forest model. This model will be called “Selected Linear Model 1”.

lm_9_fit <- train(shares ~ n_tokens_title + n_non_stop_words + num_hrefs + num_imgs +
                average_token_length + kw_min_avg + kw_max_avg +
                kw_avg_avg + max_negative_polarity,
                data = train_set,
                preProcess = c("center","scale"),
                method = "lm")


lm_9_pred <- predict(lm_9_fit,newdata = test_set)

lm_9_rmse <- sqrt(mean((lm_full_pred-test_set$shares)^2))

For this next linear model, we started with the five of the highest correlation variables, and then added a few more that seemed to increase the models effectiveness. Including some interactions between them. This model will be called “Selected Linear Model 2”.

# Fit model
lm2_fit <- train(shares ~ LDA_02 + LDA_03 + num_imgs*n_tokens_content  + average_token_length+
                   I(average_token_length^2) + kw_avg_avg + rate_negative_words*rate_positive_words 
                 + kw_max_avg + day + title_subjectivity ,
              data = train_set,
              preProcess = c("center","scale"),
              method = "lm")

# Use model to predict
lm2_pred <- predict(lm2_fit,newdata = test_set)

# check RMSE of prediction
lm2_rmse <- sqrt(mean((lm2_pred-test_set$shares)^2))

Random Forests

Random forests is an ensemble learning method that works by averaging across many trees. Random forests select a set of m predictors, which typically will be \(\sqrt{p}\) or \(p/3\), where \(p\) is the number of predictors that we have. Random forests are generally less correlated to each other than some other methods, and this helps reduce variance. In the plot printed below, we want to find the number of predictors that minimizes the root mean square error. The resulting model chosen is from trees created using 5-fold cross validation and are created with a range of values that \(m\) can take on. Called mtry in the R program, I chose a range of predictors that may minimize RMSE. Some of the number of predictors were chosen on general best practices. \(\sqrt{49} = 7\) so I chose to include that as the number of predictors. \(49/3 \approx 17\). Finally, I included the low numbers of 1 and 3 as values for \(m\) just for the variety. The previous runs that selected 1 as the best, and the suggested

Setting up parallel processing to reduce the time it takes to run random forests.

# these mtry's were chosen based on previous runs that selected 1 as the best, and the suggested 
# sqrt(p) -7 and 8, depending on how we count non-numeric vars and p/3 - mtry = 16, 17, and 18, 
# that are usually some of the best tuning parameters 
rf_SigPred <- train(shares ~ n_tokens_title + n_non_stop_words + num_hrefs 
                    + num_imgs + average_token_length + kw_avg_avg + max_negative_polarity,
             data = train_set, method = "rf", 
             trControl = trainControl(method = "cv", number = 10), 
             preProcess = c("center", "scale"), 
             tuneGrid = data.frame(mtry = c(1, 3, 7, 16, 17)))
rf_SigPred

## Random Forest 
## 
## 4380 samples
##    7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3943, 3941, 3942, 3943, 3942, 3943, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE     
##    1    11489.59  0.01665201  2931.697
##    3    12274.63  0.01780650  3088.248
##    7    12743.00  0.01819932  3088.419
##   16    12924.40  0.01804274  3117.137
##   17    12765.31  0.01959277  3096.239
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 1.

# Plot representing the 
plot(rf_SigPred, xlab = "Number of Predictors in RF Model", ylab = "Root Mean Square Error (Found through CV)", 
     main = "Minimizing RMSE")

# Smallest model for this one is mtry = 1
rf_SigPred_Pred <- predict(rf_SigPred, newdata = select(train_set, -shares))
rf_MSE <- sqrt(mean((rf_SigPred_Pred-train_set$shares)^2))

Above, we also have a plot of the RMSE by the number of randomly selected predictors that we tried. We want to look for the smallest RMSE for the best number of predictors (called mtry within the train() function).

Boosted Tree

A boosted tree is another type of ensemble based supervised learning method. Much like with random forests, this model creates a bunch of different trees and then averages the predictions. The difference comes in with the way trees are grown, in this boosted tree model the next tree grown is a result of the last tree. This results in the model learning as it goes. The boosted tree continually tries to improve on the previous model created, so in the end the final or “best” model is a modified version of the originally created model.

Below we create two boosted tree models, the first is a model with every predictor variable included, the second is with some selected variables to try and improve on the model with all predictors. For both models we used 10 fold cross validation, number of trees ranging from 100 to 500, an iteration depth from 1 to 10, shrinkage of 0.1, and min obs in node of 10.

# Define all tuning parameters
n.trees <- c(100,200,300,400,500)
interaction.depth <- c(1:10)
shrinkage <- 0.1
n.minobsinnode <- 10

# Select train control methods
trctrl <- trainControl(method = "cv", number = 10)

# build a data set with the tuning parameters.
tune_df <- expand.grid(n.trees,interaction.depth,shrinkage,n.minobsinnode)

# Give the tuning parameters the right names
names(tune_df) <- c("n.trees","interaction.depth","shrinkage","n.minobsinnode")

# Build full model

boost_tree_full_fit <- train(shares ~ . ,
                       data = train_set,
                       method = "gbm",
                       trControl = trctrl,
                       preProcess = c("center","scale"),
                       tuneGrid = tune_df)

# Predict full model
boost_tree_full_pred <- predict(boost_tree_full_fit,test_set)

# Check RMSE for full model
boost_tree_full_rmse <- sqrt(mean((boost_tree_full_pred - test_set$shares)^2))



# Build selected model
boost_tree_fit <- train(shares ~ LDA_02 + LDA_03 + num_imgs + kw_avg_avg + average_token_length +
                          rate_negative_words + day ,
                       data = train_set,
                       method = "gbm",
                       trControl = trctrl,
                       preProcess = c("center","scale"),
                       tuneGrid = tune_df)

# Predict selected mode
boost_tree_pred <- predict(boost_tree_fit,test_set)

# Check RMSE for full model
boost_tree_rmse <- sqrt(mean((boost_tree_pred - test_set$shares)^2))

plot(boost_tree_full_fit, xlab = "Max Tree Depth (by Interaction)", 
     ylab = "Root Mean Square Error (Found through CV)", 
     main = "All Predictors - Minimizing RMSE")

plot(boost_tree_fit, xlab = "Max Tree Depth (by Interaction)",
     ylab = "Root Mean Square Error (Found through CV)", 
     main = "Subsetted Predictors - Minimizing RMSE")

Like we did in the random forests, we have a plot of the RMSE by the number of randomly selected predictors that we tried. We want to look for the smallest RMSE to identify the best tree depth and the best number of iterations to perform. Each different line, represents a different number of trees used in the cross validation. Adding more trees usually reduces the error on the training set, but also may lead to overfitting.

Best Model

To determine the best model to use for predicting the amount of shares of an article about business we have decided to measure them against each other using the Root Mean Square Error (RMSE). To read about RMSE we suggest this link as it provides a general explanation of what it is and also has a video!

Before we can determine the best model, we have to compare them.To do so, we establish the name of each model, and then we add the name and their corresponding RMSEs to a tibble for comparison!

# Names of models
model_names <- c("Full Linear Model","Selected Linear Model 1","Selected Linear Model 2",
                 "Random Forest Model", "Full Boosted Tree Model", "Selected Boosted Tree Model") 

# RMSE values
model_rmses <- c(lm_full_rmse, lm_9_rmse, lm2_rmse, rf_MSE, boost_tree_full_rmse,
                 boost_tree_rmse)
# Combine into tibble
model_comp <- tibble(model_names,model_rmses)

#Arrange by RMSE value in ascending order
model_comp <- model_comp %>% arrange(model_rmses)

# Change names for a good looking print out.
names(model_comp) <- c("Name of Model","RMSE of Model")

# Print table.
kable(model_comp)

Name of Model	RMSE of Model
Selected Linear Model 2	9009.664
Selected Boosted Tree Model	9344.506
Full Boosted Tree Model	9590.969
Full Linear Model	9828.754
Selected Linear Model 1	9828.754
Random Forest Model	11212.605

From this table we can see that for the business topic the best model is the Selected Linear Model 2 with an RMSE value of 9009.6642

Article Share Prediction

Samantha Hunter & Evan Patton

10/17/2021