How to Create a Cumulative Frequency Graph in R

The focus of this page is to create cumulative frequency graphs in R using the stat_ecdf() function in the ggplot2 package, and the survfit() function in the survival package. The cumulative frequency graph is also called the empirical cumulative distribution curve.

What is a Cumulative Frequency Graph?

The cumulative frequency graph shows the proportion of the data points that are below (or above, if the curve starts at 1 or 100%) a chosen value of the variable being plotted.

Create Cumulative Frequency Graphs with stat_ecdf()

In the first example, we generated a simple cumulative frequency graph of the sepal width variables from the iris dataset in R. In the second example, we reverse the curve to start at 1 instead of 0 (using the parameter y = 1 – ..y..). Finally, the third example shows the y-axis in percentage instead of proportions.

#Graph A. Default output
ggplot(cdf01, aes(Sepal.Width)) + 
  stat_ecdf(geom = "step", color="purple")

#Graph B. Reverse cumulative distribution curve using 1-y
v <- ggplot(cdf01, aes(Sepal.Width, y = 1 - ..y..)) + 
        stat_ecdf(geom = "step", color="purple", size=1)
v

#Graph C. Reverse cumulative distribution curve using 1-y. And presenting the y-axis in percentage
v + labs(x="Sepal Width", y="%") +
  scale_y_continuous(breaks=seq(0,1,0.1), labels = seq(0,100,10))
cumulative distribution curve  in R - default graph
A
B
C

The next example generates multiple cumulative frequency graphs, one for each Species in the iris dataset. The curves were generated again using the parameter y = 1 – ..y.., to reverse the curves, to get them to start from 100% instead of 0%.

#Multiple CDF for the different iris species
ggplot(cdf01, aes(Sepal.Width, color=Species)) + 
       stat_ecdf(geom = "step", size=1) + 
       labs(x="Sepal Width", y="%") +
       scale_y_continuous(breaks=seq(0,1,0.1), labels = seq(0,100,10)) + 
       scale_color_manual(values=c("#96ceb4", "#ff6f69", "#ffcc5c")) + 
       theme_minimal()

#Multiple reverse CDF for the different iris species
ggplot(cdf01, aes(Sepal.Width, y = 1 - ..y.., color=Species)) + 
       stat_ecdf(geom = "step", size=1) + 
       labs(x="Sepal Width", y="%") +
       scale_y_continuous(breaks=seq(0,1,0.1), labels = seq(0,100,10)) + 
       scale_color_manual(values=c("#96ceb4", "#ff6f69", "#ffcc5c")) + 
       theme_minimal()
How to Interpret Cumulative Frequency Graphs

We can see from the above figures that the average width of sepals is higher for the Setosa species than the other two species. And the average width for Virginica is higher than that of Versicolor. The median sepal width is about 2.7, 3.0, and 3.4 cm for Versicolor, Virginica, and Setosa, respectively.

The graph on the left (the one that starts from 0%) shows the percentage of the data points below a value of sepal width, say 2.5 cm. Roughly 2% of the Setosa values are below 2.5 cm. About 10% and 27% of virginica and Versicolor data points are below 2.5 cm, respectively.

The graph on the right (the one that starts from 100%) shows the percentage of the data points above a value of sepal width, say 2.5 cm. Roughly 98% of the Setosa values are above 2.5 cm. About 90% and 73% of virginica and Versicolor data points are above 2.5 cm, respectively.

Create Cumulative Frequency Graphs with survfit()

Cumulative frequency graphs can also be created with the survfit() function in the survival package. The survival function without censoring should be the same as the cumulative distribution function. Therefore, the estimates can be calculated first using the survfit() function and then plotted using the plot() function.

#Thesame dataset as above status=1 is used to indicate no censoring
cdf02 <- cdf01 %>% mutate(status = 1)

#Graph D: calculate the estimates for plotting later
fit2 <- survfit( Surv(Sepal.Width, status) ~ 1, data = cdf02)
plot(fit2, conf.int = FALSE, xlab="Sepal Width", ylab="%", axes=FALSE, 
     col="purple", lwd=2, xlim=c(2,4.5), ylim=c(0,1))

#customize the axes
axis(1, at = seq(2,4.5,0.5))
axis(2, at = seq(0,1,0.1), labels = seq(0,100,10))

#Graph E: Reverse the plot so that it starts from 0% instead of 100%
plot(fit2, conf.int = FALSE, xlab="Sepal Width", ylab="%", axes=FALSE, 
     col="purple", lwd=2, xlim=c(2,4.5), ylim=c(0,1), fun="event")
axis(1, at = seq(2,4.5,0.5))
axis(2, at = seq(0,1,0.1), labels = seq(0,100,10))

#Graph F: Calculate the estimates per Species, to generate separate plot per species
fit3 <- survfit( Surv(Sepal.Width, status) ~ Species, data = cdf02)
#generate separate plot per Species
plot(fit3, conf.int = FALSE, xlab="Sepal Width", ylab="%", axes=FALSE, fun="event",
     lwd=2, xlim=c(2,4.5), ylim=c(0,1), col=c("#96ceb4", "#ff6f69", "#ffcc5c"))
axis(1, at = seq(2,4.5,0.5))
axis(2, at = seq(0,1,0.1), labels = seq(0,100,10))
legend(3.5, 0.5, legend=c("Setosa", "Versicolor", "Virginica"),
       col=c("#96ceb4", "#ff6f69", "#ffcc5c"), lty=c(1,1), cex=0.8)
CDF with survfit() and plot() functions
D
Reverse CDF
E
reverse cumulative frequency graphs using survfit() and plot() functions
F

Leave a comment

One Comment