Linear Regression 2

Linear Regression’s output variable is generally a continuous numerical variable.

Be careful of collinearity! This is when two input variables are highly correlated, which can lead to unexpected results.

Note: we will not typically use:

Predicting on new data

We often will use separate training and test data. The training data is used to generate the model, and the model is measured with test data.

Minimal Example

library(tidyverse)

t_data <- tibble(
  from = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  to = c(10, 10, 8, 8, 7, 6, 5, 4, 2, 1)
)

# Create a linear regression model predicting `to` with `from` and using the entire dataset.
model <- lm(to ~ from, data = t_data)

# Show a summary of the linear model. Pay attention to the  R^2,  the % of variation explained by the model.
summary(model)
## 
## Call:
## lm(formula = to ~ from, data = t_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.654 -0.554  0.370  0.403  0.430 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)  11.6667     0.3732    31.3 0.0000000012 ***
## from         -1.0121     0.0601   -16.8 0.0000001576 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.546 on 8 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.969 
## F-statistic:  283 on 1 and 8 DF,  p-value: 0.000000158
# new data
t_new_data <- tibble(
  from = c(3, 5, 10, 100),
  to = c(7, 4, 2, 0)
)


# Predict the results with new data, storing as a vector
predicted_results <- predict(model, newdata = t_new_data)

# Place this back into the original tibble to measure accuracy.
t_new_data <- t_new_data %>% 
  mutate(prediction = predicted_results)

ggplot(data = t_new_data) +
  geom_point(mapping = aes(x = prediction, y = to))

Measuring accuracy

Residual Values

library(tidyverse)

t_data <- tibble(
  from = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  to = c(10, 10, 8, 8, 7, 6, 5, 4, 2, 1)
)

# Create a linear regression model predicting `to` with `from` and using the entire dataset.
model <- lm(to ~ from, data = t_data)

# Predict the results, storing on the tibble.
t_data <- t_data %>% 
  mutate(prediction = predict(model, newdata = t_data))

# Place this back into the original tibble to visualize the actual error. Note that we should have x = y
ggplot(data = t_data) +
  geom_point(mapping = aes(x = prediction, y = to))

Residual Error

library(tidyverse)

t_data <- tibble(
  from = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  to = c(10, 10, 8, 8, 7, 6, 5, 4, 2, 1)
)

# Create a linear regression model predicting `to` with `from` and using the entire dataset.
model <- lm(to ~ from, data = t_data)

# Predict the results, storing on the tibble.
t_data <- t_data %>% 
  mutate(predict = predict(model, newdata = t_data),
         error = to - predict)

# Place this back into the original tibble to visualize the actual error. Note that we should have x = y
ggplot(data = t_data) +
  geom_pointrange(mapping = aes(x = predict,
                                y = error,
                                ymin = 0, 
                                ymax = error))

RMSE

The RMSE works by taking the prediction less actual, squaring it, taking the mean, and then taking the sqrt of the mean. It should be read as the typical prediction error. Note that it will never be a negative number!

One useful measure to compare it against is the standard deviation of the data.

library(tidyverse)

t_data <- tibble(
  from = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  to = c(10, 10, 8, 8, 7, 6, 5, 4, 2, 1)
)

# Create a linear regression model predicting `to` with `from` and using the entire dataset.
model <- lm(to ~ from, data = t_data)

# Predict the results, storing on the tibble.
t_data <- t_data %>% 
  mutate(predict = predict(model, newdata = t_data),
         residuals = to - predict)

# Take the mean error squared, and then take the sqrt.
error <- sqrt(mean(t_data$residuals ^2 ))

print(error)
## [1] 0.489

R-squared

R^2 is a measure of how well the model fits the data. It will be a number between 0-1. It is the percentage of the total variance accounted for by the model.

library(tidyverse)

t_data <- tibble(
  from = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  to = c(10, 10, 8, 8, 7, 6, 5, 4, 2, 1)
)

# Create a linear regression model predicting `to` with `from` and using the entire dataset.
model <- lm(to ~ from, data = t_data)

# Predict the results, storing on the tibble.
t_data <- t_data %>% 
  mutate(predict = predict(model, newdata = t_data),
         error = to - predict,
         error2 = error ^ 2)

# Take the `sum` of the squared error
residual_sum_of_squares <- sum(t_data$error2)

# Find the difference between the mean and every row,
# i.e., the total variation.
total_variation_in_model <- t_data$to - mean(t_data$to)

# Take the sum of the squared variation
total_sum_of_squares <- sum(total_variation_in_model ^ 2)

# Calculate R^2
r2 <- 1 - (residual_sum_of_squares / total_sum_of_squares)


print(r2)
## [1] 0.973
summary(model)
## 
## Call:
## lm(formula = to ~ from, data = t_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.654 -0.554  0.370  0.403  0.430 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)  11.6667     0.3732    31.3 0.0000000012 ***
## from         -1.0121     0.0601   -16.8 0.0000001576 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.546 on 8 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.969 
## F-statistic:  283 on 1 and 8 DF,  p-value: 0.000000158

Training / Test Split

We generally will split our test/training data to make sure that we are not over-learning from our data.

Split data

We split data by generating a new column, filled with 0 or 1 values.

library(tidyverse)

# Our dataset.
t_data <- tibble(
  id = 1:1000,
  values = rep( c('A', 'B'), 500)
)

# Create a vector of 1 and 0s
#   x is a vector of values you want to get.
#   size is the number of rows (match your tibble)
#   replace says to not remove each value from X
#   prob sets how likely each of the values should be.
select <- sample( x = c('test', 'train'),
        size = 1000,
        replace = TRUE,
        prob = c(0.1, 0.9))

# Add to your tibble
t_data <- mutate(t_data, select = select)

# Create test  and train datasets
t_test <- filter(t_data, select == 'test')
t_train <- filter(t_data, select == 'train')

# Train your model on t_test
# Use predict with t_train

Cross-validation

Cross validation is a standard technique to split our data. It will generate multiple small splits, and then see how we do on each split.

When we are done, we will generate a final model with all of the data. The cross-validation will only help us with the accuracy of the modeling process. It can not tell us the final model accuracy. We will only know that with new data that it hasn’t been tested on yet.

Create splits

Splits are created with vtreat, by selecting the row numbers for each split.

library(tidyverse)
library(vtreat)

t_data <- tibble(
  from = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  to = c(10, 10, 8, 8, 7, 6, 5, 4, 2, 1)
)

nRows = nrow(t_data)

# Create 3 splits
splits <- vtreat::kWayCrossValidation(nRows = nRows, nSplits = 3)

# splits is now 3 lists, each with a train and an app.
# Values are the indexes to select for train/test.
str(splits)
## List of 3
##  $ :List of 2
##   ..$ train: int [1:7] 1 2 4 5 6 8 9
##   ..$ app  : int [1:3] 10 7 3
##  $ :List of 2
##   ..$ train: int [1:6] 3 5 6 7 8 10
##   ..$ app  : int [1:4] 9 2 1 4
##  $ :List of 2
##   ..$ train: int [1:7] 1 2 3 4 7 9 10
##   ..$ app  : int [1:3] 8 5 6
##  - attr(*, "splitmethod")= chr "kwaycross"

Apply splits

To apply, the splits, we have to put lm instead of a loop. We will go into each split, create new test/train datasets, and then run lm.

each split.

library(tidyverse)
library(vtreat)

t_data <- tibble(
  from = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  to = c(10, 10, 8, 8, 7, 6, 5, 4, 2, 1)
)

nRows <-  nrow(t_data)
nSplits <-  3

# Create 3 splits
# Pass NULL as the 3rd and 4th argument.
splits <- vtreat::kWayCrossValidation(nRows = nRows, 
                                      nSplits = nSplits,
                                      NULL, NULL)

# For each split,
#   Save our current split in i,
#   Start with index of 1, 
#   Go until we hit the max number of splits
for (i in 1:nSplits) {
  # Make it easy to grab the current split
  # Note that [[]] braces
  #    splits[[1]] returns the value
  #    splits[1] returns a list with one item.
  split <- splits[[i]]
  
  # Build model
  # Note that we use train.
  # Note that we use [ c(1, 2, 4, 9), ]... to select
  #   the indexes of the rows we want. 
  #   We must have the trailing comma! Otherwise, returns 
  #   the wrong selection.
  model <- lm(to ~ from, data = t_data[split$train, ])
  
  # Make predictions!
  predictions <- predict(model, newdata = t_data[split$app, ])
  print(model)
  
  # Calculate residuals
  residuals <- t_data[split$app, ]$to - predictions
  print(residuals)
  
  #  We will probably want to calculate RSME.
  rsme <- sqrt( mean(residuals ^ 2) )
  
  print(rsme)
}
## 
## Call:
## lm(formula = to ~ from, data = t_data[split$train, ])
## 
## Coefficients:
## (Intercept)         from  
##      11.216       -0.966  
## 
##     1     2     3 
## 0.716 0.615 0.649 
## [1] 0.661
## 
## Call:
## lm(formula = to ~ from, data = t_data[split$train, ])
## 
## Coefficients:
## (Intercept)         from  
##       12.58        -1.15  
## 
##      1      2      3      4 
## -1.428  0.486 -1.123  0.638 
## [1] 0.993
## 
## Call:
## lm(formula = to ~ from, data = t_data[split$train, ])
## 
## Coefficients:
## (Intercept)         from  
##      11.217       -0.884  
## 
##      1      2      3 
## -1.377  0.087 -1.261 
## [1] 1.08

Transforming variables

Categorical variables

We deal with categorical variables by converting them into 0/1 hot one variables. The key is that we convert all but one value into a new column. If we included all of the values, then we would end up with multicollinearity.

library(tidyverse)

# Our dataset.
t_data <- tibble(
  from = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  to = c(10, 0, 0, 0, 7, 6, 5, 4, 2, 1),
  group = c('light', 'heavy', 'heavy', 'light', 'light', 'light', 'light', 'heavy', 'light', 'light')
)

# Create a linear regression model predicting `to` with `from` and using the entire dataset.
model <- lm(to ~ from + group, data = t_data)

# Predict the results, storing on the tibble.
t_data <- mutate(t_data, prediction = predict(model, newdata = t_data))


# Place this back into the original tibble to visualize the actual error. Note that we should have x = y
ggplot(data = t_data) +
  geom_point(mapping = aes(x = prediction, y = to, color = group))

summary(model)
## 
## Call:
## lm(formula = to ~ from + group, data = t_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.142 -1.954 -0.216  2.054  3.974 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    2.878      2.549    1.13     0.30
## from          -0.357      0.383   -0.93     0.38
## grouplight     3.689      2.399    1.54     0.17
## 
## Residual standard error: 3.35 on 7 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.0682 
## F-statistic: 1.33 on 2 and 7 DF,  p-value: 0.324

Log transform (dependent variable)

A log transform will convert highly-right-disorted data into more of a normal distribution.

library(tidyverse)

# Our dataset.
t_data <- tibble(
  from = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4),
  to = c(1, 2, 1, 10, 11, 9, 100, 110, 90, 1000, 900, 1100) 
)

hist(log(t_data$to))

# Create a linear regression model predicting `to` with `from` and using the entire dataset.
model <- lm(to ~ from, data = t_data)

summary(model)
## 
## Call:
## lm(formula = to ~ from, data = t_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -342.1 -166.4   23.4  186.3  359.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -493.7      186.5   -2.65   0.0244 * 
## from           308.6       68.1    4.53   0.0011 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264 on 10 degrees of freedom
## Multiple R-squared:  0.672,  Adjusted R-squared:  0.64 
## F-statistic: 20.5 on 1 and 10 DF,  p-value: 0.00109
# Use a log transform on the non-normal data
model2 <- lm(log(to) ~ from, data = t_data)
summary(model2)
## 
## Call:
## lm(formula = log(to) ~ from, data = t_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1958 -0.1343 -0.0376  0.0565  0.5324 
## 
## Coefficients:
##             Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)  -2.0715     0.1467   -14.1 0.0000000622322 ***
## from          2.2323     0.0536    41.7 0.0000000000015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.207 on 10 degrees of freedom
## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 
## F-statistic: 1.74e+03 on 1 and 10 DF,  p-value: 0.00000000000151
# Predicted values, converted back into normal form.
exp( predict(model2, newdata = t_data))
##      1      2      3      4      5      6      7      8      9     10     11     12 
##   1.17   1.17   1.17  10.95  10.95  10.95 102.03 102.03 102.03 951.01 951.01 951.01

Log transform error

Tracking the error of a log variable is a little different. We generally want error as a ratio to be minimized, rather than error as an absolute number.

As a result, we want to calculate the relative RMSE.

library(tidyverse)

# Our dataset.
t_data <- tibble(
  from = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4),
  to = c(2, 3, 4, 10, 11, 9, 100, 110, 90, 1000, 900, 1100) 
)
t_data <- mutate(t_data, to_log = log(to))

# Use a log transform on the non-normal data
model <- lm(to_log ~ from, data = t_data)

t_data <- t_data %>% 
  mutate(prediction = predict(model, t_data),
         residual = prediction - to_log,
         prediction_non_log = exp(prediction))

rmse <- sqrt(mean(t_data$residual ^ 2))
rmse_relative <- sqrt(mean((t_data$residual / t_data$to_log)^2))

print(t_data)
print(exp(rmse))
## [1] 1.39
print(exp(rmse_relative))
## [1] 1.21

Exponent transform (independent variable)

We can also transform an input variable. You can use the I() function to do math, but it is often easier to modify the base tibble.

library(tidyverse)

# Our dataset.
t_data <- tibble(
  from = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4),
  to = c(1, 2, 1, 4, 5, 4, 9, 10, 11, 16, 14, 15) 
)

# Create a squared input
t_data <- t_data %>% 
  mutate(from_squared = from * from)


# Create a linear regression model predicting `to` with `from` and using the entire dataset.
model <- lm(to ~ from_squared, data = t_data)

summary(model)
## 
## Call:
## lm(formula = to ~ from_squared, data = t_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.486 -0.536 -0.247  0.524  1.954 
## 
## Coefficients:
##              Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)    0.7674     0.4583    1.67         0.12    
## from_squared   0.9199     0.0487   18.88 0.0000000038 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.958 on 10 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.97 
## F-statistic:  357 on 1 and 10 DF,  p-value: 0.00000000376