boxplot from computed statistics

Make a Boxplot in R Using Already Computed Statistics

A boxplot is usually created from a continuous variable in a data frame. This is the preferred approach because it allows us to plot the outlying values separately on the graph. But sometimes the dataset is not available and all we have to work with is the five summary statistics: minimum, first quartile, median, quartile, and the maximum. The function geom_boxplot() in r allows us to generate a boxplot using only these five numbers.

The data

Consider the following summary statistics from the iris dataset in r. let’s assume that we did not have the individual observations in the data frame, all we have are the summary statistics. The data frame to be used in ggplot to generate the boxplots should be in the following structure: that is, the statistics are presented in columns and species in rows.

SpeciesMinQ1MedianQ3Max
Setosa0.10.20.20.30.6
Versicolor1.01.21.31.51.8
Virginica1.41.82.02.32.5
Summary Statistics of Petal Width from the Iris Dataset

The boxplot

The name of the data frame that contains the five summary statistics is statsPL. The summary statistics needed for the boxplot are specified in the second aes(), and stat = “identity” should also be specified. The first aes() indicates the x-axis and fill=Species, used to fill each box with a different color.

ggplot(statsPL, aes(x=Species, fill=Species)) +
  geom_boxplot(
    aes(ymin = minPW, lower = Q1PW, middle = medPW, upper = Q3PW, ymax = maxPW),
 stat = "identity") +

  
    #labeling the axes
    labs(x = "Iris Species", y = "Petal Width (cm)") +
  
    #adding colors and styling respectively
    #make sure fill=Species is also specified above to get different color per species
    scale_fill_manual(values=c("#84d8aa", "#a1c3e4", "#be8ca3"))+
    theme_classic()
Boxplot generated in r from already computed statistics
Boxplot created from already computed statistics

Note that the whiskers of the above boxplots represent the minimum and maximum values of the whole dataset rather than the usual minimum and maximum of non-outlying values only.

Similar Posts

Leave a comment