
This content relates to the Datacamp series called Machine Learning for Business.

Data Lifecyle



All models have error. This can come from 3 sources:

As an example, consider asking college students for their favorite artist. We grab 10 students at random, and ask them for a list of their top 3 artists. This has the following errors:

Categorical Prediction

Imagine we are predicting which customers will drop our service.

We have 6 customers: - A Dropped, Predicted Dropped (success!) - B Dropped, Predicted Dropped (success!) - C Dropped, Not predicted - D Stayed, Predicted Dropped - E Stayed, Predicted Dropped - F Stayed, Not predicted (success!)

  • Accuracy
    • Correctly classified / all cases
    • (3 including A, B, & F) / 6 = 50% accuracy
  • Precision:
    • Predicted Correctly / Predicted
    • (A, B) / (A, B, D, E) = 2 / 4 = 50% precision
  • Recall
    • Dropped customers predicted / Dropped customers
    • (A, B) / (A, B, C) = 2/3 = 66% recall

Continuous Prediction

We need to calculate error, or the difference between the actual result and our model.

We have 6 customers:

  • A $10, predicted $10
  • B $12, predicted $10
  • C $8, predicted $10


RSME is the squared difference of each. (Reference)[]

Calculate the squared difference of each point, (10 - 10)^2 + (12 - 10)^2 + (8 - 10)^4 = 20

Divide by the number of observations, and take the square root. (20 / 3) ^ .5 = 2.58


We may also want to see the difference between our prediction and actual values. (10 - 10), (12 - 10), (8 - 10) –> (0, 2, -2)

So, the RSME is the square root of the variance. This is the average distance between observed data values and predicted.


The coefficient of determination tells us the proportion of variance in our dependent variable that can be explained by our independent variables.

It ranges from 0 to 1. Generally, the higher the number the better the prediction. This will be more fully explained in the regression sections.


Inference v. Prediction