Revising my grammar of graphics.
In 2018, whilst still an R newbie, I participated in the RLadies Melbourne community lightning talks and talked about how to visualise volcano plots in R. Volcano plots are probably an obscure concept outside of bioinformatics, but their construction nicely showcases the elegance of ggplot2
.
In the last two years, a number of small and handy functions have been added to dplyr
and ggplot2
, which this post has been updated to reflect. 1
Let’s get started then.
# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
tidyverse,
janitor, # Cleaning column names
scales, # Transform axis scales
ggrepel) # Optimise plot label separation
This is a dataset with four columns:
Every row represents a unique gene expression fold change, which fulfills tidy data requirements for creating data visualisations.
Note: The data used originates from Fu et al. Nat Cell Biol. 2015 and a copy of the original dataset can be found here.
# Import and clean dataset -----------------------------------------------------
diseased_vs_healthy <- read.delim(here("data", "luminal-pregnant-vs-lactate.txt"),
header = TRUE,
sep = "\t")
diseased_vs_healthy <- janitor::clean_names(diseased_vs_healthy)
diseased_vs_healthy <- diseased_vs_healthy %>%
mutate(fold_change = 2^log_fc) %>%
select(entrezid,
symbol,
fold_change,
adj_p_val)
entrezid | symbol | fold_change | adj_p_val |
---|---|---|---|
14367 | Fzd5 | 2.0716416 | 0.0542520 |
244144 | Usp35 | 0.4197075 | 0.0598292 |
100216534 | Snord96a | 1.2373480 | 0.9999999 |
66151 | Prr13 | 1.4296409 | 0.9999999 |
229445 | Ctso | 1.0506197 | 0.9999999 |
A basic version of a volcano plot depicts:
log2(fold_change)
-log10(adj_p_val)
Note: The y-axis depicts -log10(adj_p_val), which allows the points on the plot to project upwards as the fold change greatly increases or decreases. This is more intuitive to visualise, the data points at the edges of the ‘volcano spray’ are the most interesting ones.
The versatility of ggplot2
also means that you don’t need to store data transformations as separate variables for plotting. You can apply transformations directly inside ggplot(data, aes(x, y))
or alternatively by using scale_x_continuous(trans = "...")
or coord_trans(x, y)
.
Note: For single layer plots, use %>%
pipes with ggplot2
functions for convenience and readability.
The functions geom_hline()
and geom_vline()
can be used to add extra horizontal and vertical lines on your plot respectively. In this example, I am interested in constructing boundaries for genes which have adj_p_val <= 0.05
and fold_change <= 0.5
or fold_change >= 2
.
Volcano plots should have a symmetrical x-axis. One way you can do this is by manually setting the limits of the x-axis using xlim(min, max)
.
# Identify xlim() values -------------------------------------------------------
diseased_vs_healthy %>%
pull(fold_change) %>%
min() %>%
log2() %>%
floor()
#> [1] -10
diseased_vs_healthy %>%
pull(fold_change) %>%
max() %>%
log2() %>%
ceiling()
#> [1] 8
max(abs(-10), abs(8))
#> [1] 10
# Change xlim() ----------------------------------------------------------------
# Manually specify x-axis limits
vol_plot +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = c(log2(0.5), log2(2)),
linetype = "dashed") +
xlim(-10, 10)
You can also change the limits of the x-axis inside scale_x_continuous
. This method also gives you the flexibility to fine-tune the spacing and labelling of axis tick marks.
# Modify scale_x_continuous() --------------------------------------------------
vol_plot +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = c(log2(0.5), log2(2)),
linetype = "dashed") +
scale_x_continuous(breaks = c(seq(-10, 10, 2)), # Modify x-axis tick intervals
limits = c(-10, 10))
Note: The value specified inside the argument scale_continuous_x(limits = ...)
supersedes the range of values specified inside the argument scale_continuous_x(breaks = ...)
.
To visualise different groups of genes using different colours, point sizes, shapes or transparencies, you need to categorise genes into different groups and store these categories as a new parameter i.e. new column of data.
I am interested in labelling genes into the following groups:
# Create new categorical column ------------------------------------------------
diseased_vs_healthy <- diseased_vs_healthy %>%
mutate(gene_type = case_when(fold_change >= 2 & adj_p_val <= 0.05 ~ "up",
fold_change <= 0.5 & adj_p_val <= 0.05 ~ "down",
TRUE ~ "ns"))
# Obtain gene_type counts ------------------------------------------------------
diseased_vs_healthy %>%
count(gene_type)
gene_type | n |
---|---|
down | 1245 |
ns | 13578 |
up | 981 |
In ggplot2
, you also have the option to visualise different groups by point colour, size, shape and transparency by modifying parameter like scale_color_manual()
etc. A tidy way of doing this is to separately store your manual specifications as vectors.
# Check gene_type categories ---------------------------------------------------
diseased_vs_healthy %>%
distinct(gene_type) %>%
pull()
#> [1] "down" "up" "ns"
# Add colour, size and alpha (transparency) to volcano plot --------------------
cols <- c("up" = "#ffad73", "down" = "#26b3ff", "ns" = "grey")
sizes <- c("up" = 2, "down" = 2, "ns" = 1)
alphas <- c("up" = 1, "down" = 1, "ns" = 0.5)
diseased_vs_healthy %>%
ggplot(aes(x = log2(fold_change),
y = -log10(adj_p_val),
fill = gene_type,
size = gene_type,
alpha = gene_type)) +
geom_point(shape = 21, # Specify shape and colour as fixed local parameters
colour = "black") +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = c(log2(0.5), log2(2)),
linetype = "dashed") +
scale_fill_manual(values = cols) + # Modify point colour
scale_size_manual(values = sizes) + # Modify point size
scale_alpha_manual(values = alphas) + # Modify point transparency
scale_x_continuous(breaks = c(seq(-10, 10, 2)),
limits = c(-10, 10))
You can also overlay subplots on top of your main plot. This is useful when you want to highlight a subset of your data using different colours, shapes and etc. When overlaying plots, you should not use %>%
pipes but use global ggplot(
data = “…”)
and local geom_point(data = ...)
arguments instead.
# Add subplot layer to the main volcano plot -----------------------------------
ils <- str_subset(diseased_vs_healthy$symbol, "^[I|i]l[0-9]+$")
il_genes <- diseased_vs_healthy %>%
filter(symbol %in% ils)
ggplot(data = diseased_vs_healthy, # Original data
aes(x = log2(fold_change), y = -log10(adj_p_val))) +
geom_point(colour = "grey", alpha = 0.5) +
geom_point(data = il_genes, # New layer containing data subset il_genes
size = 2,
shape = 21,
fill = "firebrick",
colour = "black")
Note: Unless local aesthetics are specified, secondary geom_point()
functions will inherit global ggplot aesthetics.
You can also label a subset of data using geom_text()
, geom_label()
, geom_text_repel()
or geom_label_repel
and by specifying which column to display as text using the local argument geom_text(aes(label = ...))
.
Note: adjusting the parameters for optimal text separation using geom_text_repel
can be a bit fiddly. I generally start by modifying force and then deciding which region of the plot I want to nudge my text or labels towards. You can read this vignette for more tips on adjusting geom_text_repel
parameters.
# Layer more subplots ----------------------------------------------------------
sig_il_genes <- diseased_vs_healthy %>%
filter(symbol %in% c("Il15", "Il34", "Il24"))
up_il_genes <- diseased_vs_healthy %>%
filter(symbol == "Il24")
down_il_genes <- diseased_vs_healthy %>%
filter(symbol %in% c("Il15", "Il34"))
ggplot(data = diseased_vs_healthy,
aes(x = log2(fold_change),
y = -log10(adj_p_val))) +
geom_point(aes(colour = gene_type),
alpha = 0.2,
shape = 16,
size = 1) +
geom_point(data = up_il_genes,
shape = 21,
size = 2,
fill = "firebrick",
colour = "black") +
geom_point(data = down_il_genes,
shape = 21,
size = 2,
fill = "steelblue",
colour = "black") +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = c(log2(0.5), log2(2)),
linetype = "dashed") +
geom_label_repel(data = sig_il_genes, # Add labels last to appear as the top layer
aes(label = symbol),
force = 2,
nudge_y = 1) +
scale_colour_manual(values = cols) +
scale_x_continuous(breaks = c(seq(-10, 10, 2)),
limits = c(-10, 10))
If you need to change the order of categorical figure legend values, you will need to factor()
and re-level your categorical variable. This can be done using the forcats
package, which allows you to easily modify factor levels.
# Modify legend labels by re-ordering gene_type levels -------------------------
diseased_vs_healthy <- diseased_vs_healthy %>%
mutate(gene_type = fct_relevel(gene_type, "up", "down"))
# Recreate volcano plot --------------------------------------------------------
ggplot(data = diseased_vs_healthy,
aes(x = log2(fold_change),
y = -log10(adj_p_val))) +
geom_point(aes(colour = gene_type),
alpha = 0.2,
shape = 16,
size = 1) +
geom_point(data = up_il_genes,
shape = 21,
size = 2,
fill = "firebrick",
colour = "black") +
geom_point(data = down_il_genes,
shape = 21,
size = 2,
fill = "steelblue",
colour = "black") +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = c(log2(0.5), log2(2)),
linetype = "dashed") +
geom_label_repel(data = sig_il_genes,
aes(label = symbol),
force = 2,
nudge_y = 1) +
scale_colour_manual(values = cols) +
scale_x_continuous(breaks = c(seq(-10, 10, 2)),
limits = c(-10, 10))
The last finishing touches include modifying plot labels and the plot theme.
The function labs()
is a handy way of organising all plot labels inside a single function. You can assign labels as NULL
to prevent them from being displayed.
A plot can be further improved by changing its theme()
and/or by modifying individual theme()
parameters.
# Add plot labels and modify plot theme ----------------------------------------
final_plot <- ggplot(data = diseased_vs_healthy,
aes(x = log2(fold_change),
y = -log10(adj_p_val))) +
geom_point(aes(colour = gene_type),
alpha = 0.2,
shape = 16,
size = 1) +
geom_point(data = up_il_genes,
shape = 21,
size = 2,
fill = "firebrick",
colour = "black") +
geom_point(data = down_il_genes,
shape = 21,
size = 2,
fill = "steelblue",
colour = "black") +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = c(log2(0.5), log2(2)),
linetype = "dashed") +
geom_label_repel(data = sig_il_genes,
aes(label = symbol),
force = 2,
nudge_y = 1) +
scale_colour_manual(values = cols) +
scale_x_continuous(breaks = c(seq(-10, 10, 2)),
limits = c(-10, 10)) +
labs(title = "Gene expression changes in diseased versus healthy samples",
x = "log2(fold change)",
y = "-log10(adjusted P-value)",
colour = "Expression \nchange") +
theme_bw() + # Select theme with a white background
theme(panel.border = element_rect(colour = "black", fill = NA, size= 0.5),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
final_plot
Note: You can specify panel.grid... = element_line(linetype = "dotted")
inside theme()
to create dotted gridlines along the x and/or y axis. Major gridline positions are inherited from the values of axis breaks.
You can add more descriptions to a plot by using the function annotate()
to display text.
# Annotate text inside plot ----------------------------------------------------
final_plot +
annotate("text", x = 7, y = 10,
label = "3 interleukins of interest", color = "firebrick")
ggplot
cheatsheetggplot2
axis transformations
The original coding logic should still be attributed to Chuanxin Liu, my former PhD student. I also recommend the excellent RStudio Cloud ggplot2 tutorials, which have taught me a few new tricks.↩︎