Volcano plots with ggplot2

data visualisation ggplot2 tidyverse R

Revising my grammar of graphics.

Erika Duan
01-02-2021

Introduction

In 2018, whilst still an R newbie, I participated in the RLadies Melbourne community lightning talks and talked about how to visualise volcano plots in R. Volcano plots are probably an obscure concept outside of bioinformatics, but their construction nicely showcases the elegance of ggplot2.

In the last two years, a number of small and handy functions have been added to dplyr and ggplot2, which this post has been updated to reflect. 1

Let’s get started then.

# Load required packages -------------------------------------------------------  
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,  
               tidyverse, 
               janitor, # Cleaning column names  
               scales, # Transform axis scales   
               ggrepel) # Optimise plot label separation  

Import a test dataset

This is a dataset with four columns:

Every row represents a unique gene expression fold change, which fulfills tidy data requirements for creating data visualisations.

Note: The data used originates from Fu et al. Nat Cell Biol. 2015 and a copy of the original dataset can be found here.

# Import and clean dataset ----------------------------------------------------- 
diseased_vs_healthy <- read.delim(here("data", "luminal-pregnant-vs-lactate.txt"),
                                  header = TRUE,
                                  sep = "\t")  

diseased_vs_healthy <- janitor::clean_names(diseased_vs_healthy)  

diseased_vs_healthy <- diseased_vs_healthy %>%
  mutate(fold_change = 2^log_fc) %>%
  select(entrezid,
         symbol,
         fold_change,
         adj_p_val)  
entrezid symbol fold_change adj_p_val
14367 Fzd5 2.0716416 0.0542520
244144 Usp35 0.4197075 0.0598292
100216534 Snord96a 1.2373480 0.9999999
66151 Prr13 1.4296409 0.9999999
229445 Ctso 1.0506197 0.9999999

Create a simple volcano plot

A basic version of a volcano plot depicts:

Note: The y-axis depicts -log10(adj_p_val), which allows the points on the plot to project upwards as the fold change greatly increases or decreases. This is more intuitive to visualise, the data points at the edges of the ‘volcano spray’ are the most interesting ones.

The versatility of ggplot2 also means that you don’t need to store data transformations as separate variables for plotting. You can apply transformations directly inside ggplot(data, aes(x, y)) or alternatively by using scale_x_continuous(trans = "...") or coord_trans(x, y).

# Create a simple volcano plot -------------------------------------------------
vol_plot <- diseased_vs_healthy %>%
  ggplot(aes(x = log2(fold_change),
             y = -log10(adj_p_val))) + 
  geom_point() 

vol_plot # Visualise simple volcano plot  

Note: For single layer plots, use %>% pipes with ggplot2 functions for convenience and readability.

Add horizontal and vertical plot lines

The functions geom_hline() and geom_vline() can be used to add extra horizontal and vertical lines on your plot respectively. In this example, I am interested in constructing boundaries for genes which have adj_p_val <= 0.05 and fold_change <= 0.5 or fold_change >= 2.

# Plot extra quadrants ---------------------------------------------------------
vol_plot + 
  geom_hline(yintercept = -log10(0.05),
             linetype = "dashed") + 
  geom_vline(xintercept = c(log2(0.5), log2(2)),
             linetype = "dashed")   

Modify the x-axis and y-axis

Volcano plots should have a symmetrical x-axis. One way you can do this is by manually setting the limits of the x-axis using xlim(min, max).

# Identify xlim() values -------------------------------------------------------
diseased_vs_healthy %>%
  pull(fold_change) %>%
  min() %>%
  log2() %>%
  floor() 
#> [1] -10   

diseased_vs_healthy %>%
  pull(fold_change) %>%
  max() %>%
  log2() %>%
  ceiling()
#> [1] 8  

max(abs(-10), abs(8))
#> [1] 10  

# Change xlim() ----------------------------------------------------------------
# Manually specify x-axis limits   
vol_plot + 
  geom_hline(yintercept = -log10(0.05),
             linetype = "dashed") + 
  geom_vline(xintercept = c(log2(0.5), log2(2)),
             linetype = "dashed") + 
  xlim(-10, 10) 

You can also change the limits of the x-axis inside scale_x_continuous. This method also gives you the flexibility to fine-tune the spacing and labelling of axis tick marks.

# Modify scale_x_continuous() --------------------------------------------------
vol_plot + 
  geom_hline(yintercept = -log10(0.05),
             linetype = "dashed") + 
  geom_vline(xintercept = c(log2(0.5), log2(2)),
             linetype = "dashed") +
  scale_x_continuous(breaks = c(seq(-10, 10, 2)), # Modify x-axis tick intervals    
                     limits = c(-10, 10)) 

Note: The value specified inside the argument scale_continuous_x(limits = ...) supersedes the range of values specified inside the argument scale_continuous_x(breaks = ...).

Add colour, size and transparency

To visualise different groups of genes using different colours, point sizes, shapes or transparencies, you need to categorise genes into different groups and store these categories as a new parameter i.e. new column of data.

I am interested in labelling genes into the following groups:

# Create new categorical column ------------------------------------------------ 
diseased_vs_healthy <- diseased_vs_healthy %>%
  mutate(gene_type = case_when(fold_change >= 2 & adj_p_val <= 0.05 ~ "up",
                               fold_change <= 0.5 & adj_p_val <= 0.05 ~ "down",
                               TRUE ~ "ns"))   

# Obtain gene_type counts ------------------------------------------------------           
diseased_vs_healthy %>%
  count(gene_type)
gene_type n
down 1245
ns 13578
up 981

In ggplot2, you also have the option to visualise different groups by point colour, size, shape and transparency by modifying parameter like scale_color_manual() etc. A tidy way of doing this is to separately store your manual specifications as vectors.

# Check gene_type categories ---------------------------------------------------
diseased_vs_healthy %>%
  distinct(gene_type) %>%
  pull()  
#> [1] "down" "up"   "ns"    
# Add colour, size and alpha (transparency) to volcano plot --------------------
cols <- c("up" = "#ffad73", "down" = "#26b3ff", "ns" = "grey") 
sizes <- c("up" = 2, "down" = 2, "ns" = 1) 
alphas <- c("up" = 1, "down" = 1, "ns" = 0.5)

diseased_vs_healthy %>%
  ggplot(aes(x = log2(fold_change),
             y = -log10(adj_p_val),
             fill = gene_type,    
             size = gene_type,
             alpha = gene_type)) + 
  geom_point(shape = 21, # Specify shape and colour as fixed local parameters    
             colour = "black") + 
  geom_hline(yintercept = -log10(0.05),
             linetype = "dashed") + 
  geom_vline(xintercept = c(log2(0.5), log2(2)),
             linetype = "dashed") +
  scale_fill_manual(values = cols) + # Modify point colour
  scale_size_manual(values = sizes) + # Modify point size
  scale_alpha_manual(values = alphas) + # Modify point transparency
  scale_x_continuous(breaks = c(seq(-10, 10, 2)),       
                     limits = c(-10, 10))  

Layer subplots

You can also overlay subplots on top of your main plot. This is useful when you want to highlight a subset of your data using different colours, shapes and etc. When overlaying plots, you should not use %>% pipes but use global ggplot(data = “…”) and local geom_point(data = ...) arguments instead.

# Add subplot layer to the main volcano plot -----------------------------------
ils <- str_subset(diseased_vs_healthy$symbol, "^[I|i]l[0-9]+$")  

il_genes <- diseased_vs_healthy %>%
  filter(symbol %in% ils) 

ggplot(data = diseased_vs_healthy, # Original data  
       aes(x = log2(fold_change), y = -log10(adj_p_val))) + 
  geom_point(colour = "grey", alpha = 0.5) +
  geom_point(data = il_genes, # New layer containing data subset il_genes       
             size = 2,
             shape = 21,
             fill = "firebrick",
             colour = "black")     

Note: Unless local aesthetics are specified, secondary geom_point() functions will inherit global ggplot aesthetics.

Label points of interest

You can also label a subset of data using geom_text(), geom_label(), geom_text_repel() or geom_label_repel and by specifying which column to display as text using the local argument geom_text(aes(label = ...)).

Note: adjusting the parameters for optimal text separation using geom_text_repel can be a bit fiddly. I generally start by modifying force and then deciding which region of the plot I want to nudge my text or labels towards. You can read this vignette for more tips on adjusting geom_text_repel parameters.

# Layer more subplots ----------------------------------------------------------
sig_il_genes <- diseased_vs_healthy %>%
  filter(symbol %in% c("Il15", "Il34", "Il24"))

up_il_genes <- diseased_vs_healthy %>%
  filter(symbol == "Il24")

down_il_genes <- diseased_vs_healthy %>%
  filter(symbol %in% c("Il15", "Il34"))

ggplot(data = diseased_vs_healthy,
       aes(x = log2(fold_change),
           y = -log10(adj_p_val))) + 
  geom_point(aes(colour = gene_type), 
             alpha = 0.2, 
             shape = 16,
             size = 1) + 
  geom_point(data = up_il_genes,
             shape = 21,
             size = 2, 
             fill = "firebrick", 
             colour = "black") + 
  geom_point(data = down_il_genes,
             shape = 21,
             size = 2, 
             fill = "steelblue", 
             colour = "black") + 
  geom_hline(yintercept = -log10(0.05),
             linetype = "dashed") + 
  geom_vline(xintercept = c(log2(0.5), log2(2)),
             linetype = "dashed") +
  geom_label_repel(data = sig_il_genes, # Add labels last to appear as the top layer  
                   aes(label = symbol),
                   force = 2,
                   nudge_y = 1) +
  scale_colour_manual(values = cols) + 
  scale_x_continuous(breaks = c(seq(-10, 10, 2)),     
                     limits = c(-10, 10))  

Modify legend label positions

If you need to change the order of categorical figure legend values, you will need to factor() and re-level your categorical variable. This can be done using the forcats package, which allows you to easily modify factor levels.

# Modify legend labels by re-ordering gene_type levels -------------------------
diseased_vs_healthy <- diseased_vs_healthy %>%
  mutate(gene_type = fct_relevel(gene_type, "up", "down")) 

# Recreate volcano plot --------------------------------------------------------
ggplot(data = diseased_vs_healthy,
       aes(x = log2(fold_change),
           y = -log10(adj_p_val))) + 
  geom_point(aes(colour = gene_type), 
             alpha = 0.2, 
             shape = 16,
             size = 1) + 
  geom_point(data = up_il_genes,
             shape = 21,
             size = 2, 
             fill = "firebrick", 
             colour = "black") + 
  geom_point(data = down_il_genes,
             shape = 21,
             size = 2, 
             fill = "steelblue", 
             colour = "black") + 
  geom_hline(yintercept = -log10(0.05),
             linetype = "dashed") + 
  geom_vline(xintercept = c(log2(0.5), log2(2)),
             linetype = "dashed") +
  geom_label_repel(data = sig_il_genes,     
                   aes(label = symbol),
                   force = 2,
                   nudge_y = 1) +
  scale_colour_manual(values = cols) + 
  scale_x_continuous(breaks = c(seq(-10, 10, 2)),     
                     limits = c(-10, 10))   

Modify plot labels and theme

The last finishing touches include modifying plot labels and the plot theme.

The function labs() is a handy way of organising all plot labels inside a single function. You can assign labels as NULL to prevent them from being displayed.

A plot can be further improved by changing its theme() and/or by modifying individual theme() parameters.

# Add plot labels and modify plot theme ----------------------------------------
final_plot <- ggplot(data = diseased_vs_healthy,
       aes(x = log2(fold_change),
           y = -log10(adj_p_val))) + 
  geom_point(aes(colour = gene_type), 
             alpha = 0.2, 
             shape = 16,
             size = 1) + 
  geom_point(data = up_il_genes,
             shape = 21,
             size = 2, 
             fill = "firebrick", 
             colour = "black") + 
  geom_point(data = down_il_genes,
             shape = 21,
             size = 2, 
             fill = "steelblue", 
             colour = "black") + 
  geom_hline(yintercept = -log10(0.05),
             linetype = "dashed") + 
  geom_vline(xintercept = c(log2(0.5), log2(2)),
             linetype = "dashed") +
  geom_label_repel(data = sig_il_genes,   
                   aes(label = symbol),
                   force = 2,
                   nudge_y = 1) +
  scale_colour_manual(values = cols) + 
  scale_x_continuous(breaks = c(seq(-10, 10, 2)),     
                     limits = c(-10, 10)) +
  labs(title = "Gene expression changes in diseased versus healthy samples",
       x = "log2(fold change)",
       y = "-log10(adjusted P-value)",
       colour = "Expression \nchange") +
  theme_bw() + # Select theme with a white background  
  theme(panel.border = element_rect(colour = "black", fill = NA, size= 0.5),    
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank()) 

final_plot 

Note: You can specify panel.grid... = element_line(linetype = "dotted") inside theme() to create dotted gridlines along the x and/or y axis. Major gridline positions are inherited from the values of axis breaks.

Annotate text

You can add more descriptions to a plot by using the function annotate() to display text.

# Annotate text inside plot ----------------------------------------------------
final_plot + 
  annotate("text", x = 7, y = 10,
           label = "3 interleukins of interest", color = "firebrick")

Other resources


  1. The original coding logic should still be attributed to Chuanxin Liu, my former PhD student. I also recommend the excellent RStudio Cloud ggplot2 tutorials, which have taught me a few new tricks.↩︎