ggplot()
The ultimate goal of any analysis is to communicate your results and understandings. Plotting your data is the most effect way to communicate a summary of it to an audience. Today will focus on the basics on how to do that in R, using the tidyverse package ggplot, with all examples with the built-in iris dataset.
You have to specify a data table, at least one column from the table, and a geometry.
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) + geom_point()
The scale of the axes are automatically determined from the data and they’re labelled witht the column name.
Adding an aesthetic like color modifies all the points and automatically adds a legend.
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point()
The legend title, like the axes labels, is the name of the column given to color =
and the legend labels are whatever is in that column. For a continuous scale ggplot()
will have a bar showing the range of colors and what values they correspond to.
You just have to change the geom to change the plot to another geometry/type
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_boxplot()
Histogram
ggplot(iris, aes(x = Sepal.Width)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Be careful with histograms! By default, when you add fill, color, etc, geom_histogram()
stacks the different bars on top of each other. For overlapping histograms, which are more effective visually and what people are used to seeing, use position = 'identity'
inside of geom_histogram()
# default histogram
ggplot(iris, aes(x = Sepal.Width, fill = Species)) + geom_histogram(alpha = 0.8)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# use position = 'identity' for overlapping histograms
ggplot(iris, aes(x = Sepal.Width, fill = Species)) + geom_histogram(position = 'identity', alpha = 0.8)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Density plot
ggplot(iris, aes(x = Sepal.Width, color = Species)) + geom_density()
Bar plot
ggplot(iris, aes(x = Species, fill = Species)) + geom_bar()
Scatter plot
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_point()
Line plot
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_line()
Adding a line of best fit
If you want to fit a straight line (or other type of fit) to your scatter plot, add on geom_smooth()
. Use method = lm
inside it to add a line (see the geom_smooth()
documentation for other fit methods). Also, it automatically supplies confidence intervals with whatever is fit.
# with line of best fit
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length)) + geom_point() + geom_smooth(method = lm)
# if you color by something, each subgroup will automatically get it's own line
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_point() + geom_smooth(method = lm)
Boxplot
ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) + geom_boxplot()
Violin plot
A violin plot is a mirrored density plot displayed like a boxplot. It gives you a better sense of the distribution of the underlying data than the five number summary of the boxplot.
ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) + geom_violin()
Color
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_point()
Shape
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, shape = Species)) + geom_point()
Size
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, size = Species)) + geom_point()
## Warning: Using size for a discrete variable is not advised.
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species, shape = Species, size = Species)) + geom_point()
## Warning: Using size for a discrete variable is not advised.
aes()
Whether you add color (or shape, size, etc.) inside or outside of ‘aes()’ has a different outcome. Inside aes()
ggplot modifies points differently, but outside aes()
it applies the same thing to all points.
# where we've had it
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point()
# I just want my points to be another color
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) + geom_point(color = 'deepskyblue')
# BUT if I put color = 'deepskyblue' instead an aes() it won't work; ggplot thinks it's a data feature
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = 'deepskyblue')) + geom_point()
You can also put aes()
inside either ggplot()
or whatever geom you pick
# inside ggplot()
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point()
# inside geom_point()
ggplot(iris) + geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species))
The alpha =
parameter sets the transparency of the plot. Alpha ranges from 0 to 1, where 0 is completely transparent and 1 is completely solid. Making your plots partially transparent is helpful when you have overplotting or any overlapping. (I made the points bigger to make the difference in alpha easier to see)
# without transparency
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_point(size = 4)
# with transparency
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_point(size = 4, alpha = 0.6)
Sometimes you might want to make multiple plots based on an element in your data like: significant/not significant, sample, phenotype, etc. If it’s a label in your table, you can add a facet to automatically split it.
ggplot(iris, aes(x = Sepal.Width, fill = Species)) + geom_density() + facet_wrap( ~ Species)
The best practice is to have your column names and labels in your data table formatted nicely so you can plot and not think about it. But someimes that isn’t possible or you don’t think about it and you need to rename axis, legend, etc. The easiest, with labs()
# plot before labelling
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_point()
# with labels added
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_point() +
labs(x = 'sepal width (cm)', y = 'sepal length (cm)', color = 'species')
You can also modify the labels (and make more extensive modifications but only labels are shown here) with ’scale_??()`. The syntax is scale + + plot part to modify + type of data (discrete or continuous mainly). These are the scale modifiers you’ll use most often:
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_point() +
scale_x_continuous(name = 'sepal width (cm)') +
scale_y_continuous('sepal length (cm)') +
scale_color_discrete('species')
Viridis is a package of color scales https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html designed to address several problems. Viridis is
If it’s not already installed, install the package
#install.packages('viridis')
You have to load the package before you can use the color scale
library(viridis)
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + geom_point() + scale_color_viridis_d()
The viridis scale has better contrast than the default ggplot color scale for continuous coloring
# default ggplot() continuous color scale
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Petal.Length)) + geom_point()
# viridis continuous color scale
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Petal.Length)) + geom_point() + scale_color_viridis()