Introduction

This content relates to the Datacamp series called Machine Learning for Business.

Data Lifecyle

Types

Error

All models have error. This can come from 3 sources:

As an example, consider asking college students for their favorite artist. We grab 10 students at random, and ask them for a list of their top 3 artists. This has the following errors:

Categorical Prediction

Imagine we are predicting which customers will drop our service.

We have 6 customers:

  • A Dropped, Predicted Dropped (success!)

  • B Dropped, Predicted Dropped (success!)

  • C Dropped, Not predicted

  • D Stayed, Predicted Dropped

  • E Stayed, Predicted Dropped

  • F Stayed, Not predicted (success!)

  • Accuracy

    • Correctly classified / all cases
    • (3 including A, B, & F) / 6 = 50% accuracy
  • Precision:

    • Predicted Correctly / Predicted
    • (A, B) / (A, B, D, E) = 2 / 4 = 50% precision
  • Recall

    • Dropped customers predicted / Dropped customers
    • (A, B) / (A, B, C) = 2/3 = 66% recall

Continuous Prediction

We need to calculate error, or the difference between the actual result and our model.

We have 6 customers:

  • A $10, predicted $10
  • B $12, predicted $10
  • C $8, predicted $10

RSME

RSME is the squared difference of each. https://www.statology.org/how-to-interpret-rmse/

Calculate the squared difference of each point, (10 - 10)^2 + (12 - 10)^2 + (8 - 10)^4 = 20

Divide by the number of observations, and take the square root. (20 / 3) ^ .5 = 2.58

Residuals

We may also want to see the difference between our prediction and actual values. (10 - 10), (12 - 10), (8 - 10) –> (0, 2, -2)

So, the RSME is the square root of the variance. This is the average distance between observed data values and predicted.

R-squared

The coefficient of determination tells us the proportion of variance in our dependent variable that can be explained by our independent variables.

It ranges from 0 to 1. Generally, the higher the number the better the prediction. This will be more fully explained in the regression sections.

Jobs

Inference v. Prediction