Introduction

Cluster dataset: - Board Game Geek Data - https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12

Error in weather predictions https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-12-20

k-means

k-means clustering is an technique that automatically assigns individual data points to groups. It’s useful for finding natural groups in your data.

Here are some resources:

We will be using the iris dataset extensively.

Wikipedia Picture of an Iris
Wikipedia Picture of an Iris

Iris Data

We are going to try and use the sepal/petal lengths/widths to automatically identify the 3 different types of iris flowers.

Below shows one sample plot, which compares the Sepal Length against the Petal Length. But, we have a lot of possible combinations.

library(tidyverse)

# Load iris data into a tibble
data("iris")
t_iris <- tibble(iris)

# Plot petals
ggplot( data=t_iris, mapping = aes( x=Sepal.Length, y=Petal.Length, color=Species)) +
  geom_point() + 
  labs( title = 'Iris dataset: species variable')

k-means algorithm result

We can use k-means to see which combination of attributes lets us have the highest accuracy.

library(tidyverse)

# Load iris data into a tibble
data("iris")
t_iris <- tibble(iris)

# Run the kmeans algorithm:
#   x is a tibble, which should only have numeric data
#   centers is the number of clusters we want
#   nstart says how many random starting points we should try out (default to 10)
kresult <- kmeans(
  x = select(t_iris, Sepal.Length, Petal.Length),
  centers = 3,
  nstart = 10
)

print(kresult)
## K-means clustering with 3 clusters of sizes 41, 51, 58
## 
## Cluster means:
##   Sepal.Length Petal.Length
## 1         6.84         5.68
## 2         5.01         1.49
## 3         5.87         4.39
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 1 3 3 3 3 3 3 3 3 3 3
##  [64] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1
## [127] 3 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 1 1 1 3 1 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 20.41  9.89 23.51
##  (between_SS / total_SS =  90.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Look at the results of k-means to see how well the algorithm worked.

There are several useful outputs:

  • cluster: vector of numbers, one per input row, showing the cluster for each row. Use this to assign each of your original data points to a cluster.
  • centers: matrix of cluster centers. Use this to plot the center of each cluster on a chart.
  • tot.withiness: total within cluster sum of squares, i.e. sum(withinss). We want this to be as small, so each cluster is tightly defined.
  • between: between-cluster sum of squares. We want this to be large, showing that each cluster is far away from the other clusters.
  • size: number of points in each cluster
  • iter: number of times the algorithm ran before stopping

To see a visualization of sum of square, see https://shiny.rit.albany.edu/stat/visualizess/

Minimal example

This is a minimal example, but can be significantly improved in the full template.

library(tidyverse)

# Load iris data into a tibble
data("iris")
t_iris <- tibble(iris)

# Run the kmeans algorithm:
kresult <- kmeans(
  x = select(t_iris, Sepal.Length, Petal.Length),
  centers = 3,
)

# Add the new kmeans2 column, using kresult's cluster vector.
t_iris <- mutate(t_iris, kmeans = (kresult$cluster))

# Plot petals
ggplot(data = t_iris) + 
  geom_point(mapping = aes( x = Sepal.Length, y = Petal.Length, color = kmeans) 
  )

Full example

library(tidyverse)

# Load iris data into a tibble
data("iris")
t_iris <- tibble(iris)

# Give a seed value so that we get the same results each time
set.seed(1)

# Run the kmeans algorithm:
#   x is a tibble, which should only have numeric data
#   centers is the number of clusters we want
#   nstart says how many random starting points we should try out (default to 10)
kresult <- kmeans(
  x = select(t_iris, Sepal.Length, Petal.Length),
  centers = 3,
  nstart = 10
)

# Add the new kmeans2 column, using kresult's cluster vector.
# Use factor to turn the cluster from a number to a categorical variable.
t_iris <- mutate(t_iris, kmeans = factor(kresult$cluster))

# Plot petals
ggplot() + 
  # Plot each point
  geom_point(
    data = t_iris, 
    mapping = aes( x = Sepal.Length, y = Petal.Length, color = kmeans) 
  ) +
  # Plot the center of each cluster 
  geom_point( 
    data = as_tibble(kresult$centers), 
    mapping = aes( x = Sepal.Length, y = Petal.Length), 
    size = 6, 
    shape = 3) +
  labs( 
    title = 'Iris dataset: k-means clustering',
    subtitle = paste('Error:', round(kresult$tot.withinss,2))
    )

Decision Trees

Decision-trees automatically create questions to split our data into groups. They are very useful for creating human-understandable models of a problem space.

The r2d3 website is a great resources for visualizing how a tree is built.

Other helpful resources:

Concept

We follow this process:

  • Select an output variable
  • Go through each input variable to find the best split in our output
  • Split the dataset
  • For each split, repeat until we reach a good ending condition

Minimal Example

Below creates a visualization of a decision tree.

Note the output for each node gives the 0.3 (proportion) of items in each class, as well as the % of rows that find their way to this node.

library(tidyverse)
library(rpart)
library(rpart.plot)

# Load data into a tibble
data(iris)
t.iris <- iris

# Create the model
#   formula: output_variable ~ input_field_a + input_field_b + ...
#   data: your tibble, excluding the output variable.
#   method: what type of problem are we working with?
#     class - predicting a discrete variable
#     anova - regression for a value
m <- rpart(formula = Species ~ Sepal.Length + 
                      Sepal.Width + Petal.Length + Petal.Width, 
           data = t.iris,
           method = "class")

# Plot results of model
rpart.plot(m)

What if we want the output of the plot as a number showing its accuracy? Add the following lines

# Give the tibble to the model, and generate a vector of predicted
# output class
predicted <- predict(m, t.iris, type = 'class')

# Add the predicted results to each row in the tibble as a field.
predicted_as_str <- paste('Predicted', predicted)
t.iris <- mutate( t.iris, predicted = predicted_as_str )

# Show a confusion matrix
table(t.iris$predicted, t.iris$Species)
##                       
##                        setosa versicolor virginica
##   Predicted setosa         50          0         0
##   Predicted versicolor      0         49         5
##   Predicted virginica       0          1        45

Full example

Several other features are very helpful in pruning our decision tree.

library(tidyverse, rpart, rpart.plot)

# Load data into a tibble
data(iris)
t.iris <- iris


## Split the data into training / testing

# Find the number of rows
count_of_rows <- length(t.iris$Species)
# Create a random vector of 1s and 0s the same length
select01 <- rbinom(count_of_rows, 1, .5)


## Create a training / test set

# Add the 01 vector to the original tibble
t.iris <- mutate(t.iris, test01 = select01) 
# Create a test and train tibble with this as a filter
t.iris.test <- filter(t.iris, test01 == 1)
t.iris.train <- filter(t.iris, test01 == 0)


## Create the model with TRAIN tibble
# 
#   minsplit: a number with the min number of rows required for a split
#   minbucket: a number wiht the min number of rows required for a bucket
m <- rpart(formula = Species ~ Sepal.Length + 
                      Sepal.Width + Petal.Length + Petal.Width, 
           data = t.iris.train,
           method = "class",
           minsplit = 2,
           minbucket = 2)

# Plot results of model with the TRAINING data
rpart.plot(m)

## Predict on the TEST tibble

# Create a vector of prediction results
predicted <- predict(m, t.iris.test, type = 'class')
# Add to our train tibble.
t.iris.test <- mutate(t.iris.test, predicted = predicted)

# Show results in a confusion matrix / table.
# This shows a comparison between the the actual values, and what our algorithm
# predicted (shown in upper case for clarity).
table(str_to_upper(predicted), t.iris.test$Species)
##             
##              setosa versicolor virginica
##   SETOSA         14          0         0
##   VERSICOLOR      0         20         2
##   VIRGINICA       0          3        21

Principal Component Analysis (pca)

We use principal component analysis to reduce the number of dimensions in our data. In simpler terms, we can reduce the number of values in our data.

See this visualization before reading the below.

Some good resources are:

Prior Concept

Before getting into PCA, make sure you’re familiar with the concept of correlations. The code below walks you through an example. The pairs command will show a plot of various fields against each other. You can also use the custom function corrplot2.

# Create a set of pairs of scatterplots showing the relationship
# between all variables.
#   argument1 is a tibble with all numeric columns to show in the plot.
#   col is the color of each species
pairs(
  x = select(iris, -Species),
  col = iris$Species)

# View the correlations between each variable.
# Again, we use the iris dataset without the species column (text), giving
#   the chart only the numeric columns
# Positive correlations shown in blue, negative in red.
# Items not meeting statistical significance are hidden.
corrplot2(select( iris, -Species))

Concept Walk-through

We will run pca on the iris dataset.

# Give prcomp numeric values in the iris tibble.
iris_pca_results <- prcomp(select(tibble(iris), -Species))

# Show new components
summary(iris_pca_results)
## Importance of components:
##                          PC1    PC2    PC3     PC4
## Standard deviation     2.056 0.4926 0.2797 0.15439
## Proportion of Variance 0.925 0.0531 0.0171 0.00521
## Cumulative Proportion  0.925 0.9777 0.9948 1.00000
# Show how they are generated from our original data
print(iris_pca_results)
## Standard deviations (1, .., p=4):
## [1] 2.056 0.493 0.280 0.154
## 
## Rotation (n x k) = (4 x 4):
##                  PC1     PC2     PC3    PC4
## Sepal.Length  0.3614 -0.6566  0.5820  0.315
## Sepal.Width  -0.0845 -0.7302 -0.5979 -0.320
## Petal.Length  0.8567  0.1734 -0.0762 -0.480
## Petal.Width   0.3583  0.0755 -0.5458  0.754

The summary shows each of our new principal components. Each PC is a new variable, created by testing various formulas weighting the original variables.

Each new PC (primary component) is made by multiplying existing values by the new coefficients. For example, PC1 is

\(.36 * Sepal.Length + -.08 * Sepal.Width + .85 * Petal.Length + .35 * Petal.Width\)

Each new PC variable flattens the data across one dimension. The exact weights for each variable are chosen to maximize the amount of variation in the original dataset that is maintained in the new variable.

The second PC then does the same process again. However, now that the first has already taken up a bunch of the variation in the dataset, the new only tries to capture the remaining variation in the data. Think of the first PC as removing a bunch of the variation, and the following PCs each try to grab more of the variation until none remains.

For interpreting the data, we want each PC to have the highest proportion of variance (meaning it capture most of our data). Then, look at the weights for that PC. What variables drive a PC? That helps you name it.

Concept Illustration

Now that we have a set of PC, let’s visualize the new dimension on our data. The chart below visualizes our first primary component.

# Run PCA analysis
iris_pca_results <- prcomp(select(tibble(iris), -Species))

# The results have a new list called X, which gives the PC for each of our
# original dataset. Let's create a new tibble with the pcs, and then plot it.
plot_data <- tibble(
  pc1 = iris_pca_results$x[,1],
  pc2 = iris_pca_results$x[,2],
  pc3 = iris_pca_results$x[,3],
  Species = iris$Species
)

# Plot PC1
ggplot(data = plot_data) +
  geom_point(mapping = aes( x = pc1, y = 1, color = Species),
             alpha = .2,
             size = 2) +
  labs(title = "PC1 versus species")

What about plotting 2 components?

ggplot(data = plot_data) +
  geom_point(mapping = aes( x = pc1, y = pc2, color = Species),
             alpha = .3,
             size = 2) +
  labs(title = "PC1 and PC2 versus species")

We can also more closely examine how the new PCs are formed. Examine the rotation variable. It gives us the degree to which each variable drives the PCs

# Turn into a usable tibble and add the items name
pca_tibble <- as_tibble(iris_pca_results$rotation) %>% 
  mutate(items = rownames(iris_pca_results$rotation) )

# Turn into long format
pca_tibble_long <- pivot_longer(pca_tibble, 
                                cols = starts_with('PC'), 
                                names_to = 'PCItem', 
                                values_to = 'PCValue')

# Remove values near zero
pca_tibble_long <- filter( pca_tibble_long, PCValue > .1 | PCValue < -.1 )

# View PCs as a basic dot plot.
ggplot( data = pca_tibble_long ) +
  geom_point( mapping = aes( y = items, x = PCValue, color = PCValue), size = 5) +
  facet_wrap( ~ PCItem, nrow = 1)  +
  scale_color_gradient(low = 'red', high = 'green')

Full example

In the full example, we generally want to scale the data. This is needed, as if we have radically different numerical ranges, the biggest numbers will have more influence on the model. Scaling all of our inputs according to mean/sd will give them equal impact on the outcome.

library(tidyverse)

# Load data into a tibble
data(iris)
t_iris <- iris
t_iris_numbers <- select(iris, -Species)

# Give prcomp numeric values in the iris tibble.
# scale: adjust the data so that sd = 1?
# center: should we center data around 0?
iris_pca_results <- prcomp(t_iris_numbers,
                           scale = TRUE,
                           center = TRUE)

# Plot our results
biplot(iris_pca_results)

# Show new components
summary(iris_pca_results)
## Importance of components:
##                         PC1   PC2    PC3     PC4
## Standard deviation     1.71 0.956 0.3831 0.14393
## Proportion of Variance 0.73 0.229 0.0367 0.00518
## Cumulative Proportion  0.73 0.958 0.9948 1.00000
# Show how they are generated from our original data
print(iris_pca_results)
## Standard deviations (1, .., p=4):
## [1] 1.708 0.956 0.383 0.144
## 
## Rotation (n x k) = (4 x 4):
##                 PC1     PC2    PC3    PC4
## Sepal.Length  0.521 -0.3774  0.720  0.261
## Sepal.Width  -0.269 -0.9233 -0.244 -0.124
## Petal.Length  0.580 -0.0245 -0.142 -0.801
## Petal.Width   0.565 -0.0669 -0.634  0.524