How to Create a Histogram in R
There are multiple ways to generate a histogram in R. The function hist() that comes in base R can be used to create a histogram, but it might be better to go for a more powerful and more customizable option; the geom_histogram() function in the GGPLOT2 package.
The following code will generate a histogram of the sepal length variable in the iris dataset. The iris dataset is available in base R for all to use.
#creates a histogram in R
ggplot(data=iris, aes(Sepal.Length)) +
geom_histogram()
With the above two lines, we get the default plot. The number of bins can be chosen by specifying the width of the bin. We choose a binwidth of 0.25 in the next plots.
To flip the graph to a horizontal histogram, the sepal length variable should be mapped to the y-axis by specifying y=Sepal.Length.
#specifying the number of bins
#and flipping the graph to a horizontal histogram
ggplot(data=hist01, aes(y=Sepal.Length)) +
geom_histogram(binwidth = 0.25)
At this stage adding some colors would improve the quality of the graph. A theme is also added (theme_minimal) to further improve the overall quality. This also removes the grey background from the plot.
#adding a fill color and using black as line color
ggplot(data=hist01, aes(Sepal.Length)) +
geom_histogram(binwidth = 0.25,color="black", fill="#c396a9") +
theme_minimal()
Multiple Histograms in One Graph
If we are interested in visualizing the distribution of different levels of categorical variables, we might want to plot the different levels as separate histograms in the same graph. For example, let us compare the distribution of the sepal length of the different species of the iris flower.
The parameters color=Species and fill=Species are used to indicate the different levels of Species in the same plot. Alpha=0.3 specifies how transparent the bars should be.
There are three main options to plot multiple histograms on the same graph, using the position= parameter in the geom_histogram() function. The first option is to stack the bars of the different levels on top of each other, using position=”stacked”. This is the default option.
#Bars stacked on each other
ggplot(data=hist01, aes(Sepal.Length, color=Species, fill=Species)) +
geom_histogram(binwidth = 0.15, alpha=0.3) +
theme_minimal()
To generate multiple histograms overlaid on each other, position=”identity” is used in the geom_histogram() function:
#Multiple histograms for the different levels of Species overlaid
ggplot(data=hist01, aes(Sepal.Length, color=Species, fill=Species))+
geom_histogram(binwidth = 0.15, position="identity", alpha=0.3) +
theme_minimal()
Finally, position=”Interleaved” can be used to insert spaces between the bars and avoid overlap:
#Position="dodge" is used to generated Interleaved bars
ggplot(data=hist01, aes(Sepal.Length, color=Species, fill=Species)) +
geom_histogram(binwidth = 0.15, position="dodge", alpha=0.3) +
theme_minimal()
Choose the Colors
The functions scale_fill_manual() and scale_color_manual() can be used to manually choose the colors of the histograms. In the figure below we are now plotting the Petal Width variable of the different species. We now also specify custom labels for the x- and y-axis.
ggplot(data=hist01, aes(Petal.Width, fill=Species, color=Species)) +
geom_histogram(binwidth = 0.15, position="identity", alpha=0.6) +
scale_fill_manual(values=c("#96ceb4", "#ff6f69", "#ffcc5c")) +
scale_color_manual(values=c("#96ceb4", "#ff6f69", "#ffcc5c")) +
labs(x="Petal Width", y="Frequency") +
theme_minimal()