course_model

Predicting Numerical Value

Some models predict a continuous variable. For example, can we predict a student’s test score based on hours studied?

Outcomes:

Links:

Other Resources:

Correlation

A simple model for predicting numerical value is correlation. Correlation measures the strength and direction of a linear relationship between two continuous variables.

Interpretation:

Correlation is not causation! Two variables may be correlated, but that does not mean one causes the other. There may be a third variable causing both, or it may be a coincidence.

Every person who confuses correlation with causation eventually dies

Measuring Error

In a classical approach to statistics, we measure error with p-values.

p-value: the probability of observing the data if the null hypothesis is true

In newer ML approaches, we will measure error by splitting out data into test and training sets. After training our model, we will evaluate it using the test set. This will be covered in more detail in later modules.

A correlation has both:

Problems

There are a variety of problems that can affect correlation:

Example: Predicting Grade

Imagine we are trying to predict a customer’s total sales based on advertisements viewed. We find a positive correlation of 0.5, meaning that for every 2 advertisements a person sees, they will increase their total sales by 1.

We will test this model. We need to calculate error, or the difference between the actual result and our model.

We have 3 customers:

RMSE

RMSE is the squared difference of each error. Here is a good reference.

Calculate the squared difference of each point, (10 - 10)^2 + (12 - 10)^2 + (8 - 10)^2 = 8

Divide by the number of observations, and take the square root. (20 / 3) ^ .5 = 2.58

Residuals

We may also want to see the difference between our prediction and actual values. (10 - 10), (12 - 10), (8 - 10) –> (0, 2, -2)

So, the RMSE is the square root of the variance. This is essentially the average distance between predicted and actual values.

R-squared (R^2)

The coefficient of determination tells us the proportion of variance in our dependent variable that can be explained by our independent variables.

It ranges from 0 to 1. Generally, the higher the number the better the prediction. This will be more fully explained in the regression sections.