How to label points on a scatterplot with R (for lattice)

The famous ggplot2 package for R has numerous packages extending its basic plot functions, including ggrepel that draws nice text labels for each point of a scatterplot. But I am using lattice for many years now, and came to like its look and customization options. However, one thing I was missing all the time is a simple one-line function to add small text labels to points in a scatter plot (a.k.a. dot plot, a.k.a. xyplot() in lattice).

It’s not that there are no functions for text labels out there, it’s just that lattice plots are not compatible with the ggplot universe, and other more hacky solutions are not really appealing. Finally I got so frustrated that I wrote my own panel function for text labels of points. Here is what I’ve gone through.


The problem

In computational biology (just as in any other data science field) we very, very often encounter the situation that we want to compare two biological conditions to each other, such as a control condition versus a treatment (e.g. bacteria grown on substrate A versus substrate B). But for each condition we have several thousands of measurements obtained in parallel, thanks to the ’Omics revolution. Now the question is, which single protein or transcript is the most interesting (or differentially expressed) between the two conditions? One of the easiest ways to look at the data is to plot condition A versus B.

# attach packages
library(lattice)
library(latticeExtra)
library(directlabels)

# simulate data with a couple of outliers
df <- data.frame(
  gene = paste0(
    sample(letters, 100, replace = TRUE),
    sample(letters, 100, replace = TRUE),
    sample(letters, 100, replace = TRUE)),
  cond_A = c(rnorm(90), rnorm(10, 6)),
  cond_B = rnorm(100),
  pathway = rep(c("transcript", "translat", 
    "carbon", "nitrogen", "unknown"), 20)
)

head(df)
##   gene      cond_A      cond_B    pathway
## 1  ynl  0.81610752  0.50860298 transcript
## 2  vcb  0.77903975  0.07852623   translat
## 3  bik -0.70903387 -0.73387667     carbon
## 4  bym -0.04559753 -0.31385190   nitrogen
## 5  bft  1.04192395  1.35110333    unknown
## 6  qit  0.35243408  1.92016239 transcript
# change default plot symbol
theme <- trellis.par.get()
theme$superpose.symbol$pch = 19

# plot using lattice
xyplot(cond_A ~ cond_B, df,
  groups = pathway, 
  par.settings = theme,
  auto.key = list(columns = 3))

The hacky approach

Now we want to know quickly what these points are that appear to be up-regulated under condition A. The hacky way is to use a panel function that we construct on the fly.

xyplot(cond_A ~ cond_B, df,
  groups = pathway, 
  par.settings = theme,
  auto.key = list(columns = 3),
  panel = function(x, y, ...) {
    panel.grid(h = -1, v = -1, col.line = grey(0.9))
    panel.xyplot(x, y, ...)
    # function that puts text labels above/below points
    panel.text(x, y, labels = df$gene, 
      col = grey(0.5), cex = 0.7, pos = 3, offset = 1)
  }
)

The problems are obvious. It’s inconvenient to think about proper placement of labels for every new plot (below or above points, how far away?); Labels are overlapping with each other due to the rigid way we place them; To plot only a few labels we would have to make tedious manual selection of points using additional variables; Grouping is ignored for the labels and it’s not straight forward to implement it, so we have to go with the simple solution of painting them all grey. I often found myself moving labels around in Inkscape and deleting the unwanted ones, which is really not efficient if you do it twice per week.

The directlabels package

There is at least one sophisticated package for drawing text labels in lattice and ggplot2, the directlabels package (link). The idea behind this package is to provide functions for labeling points, lines or other objects in a variety of plots, not only scatterplots. It is a well-made and comprehensive package with many options for customization, but it’s not the right tool for our problem: It builds heavily on the idea of grouping variables and will only place one label per group, not per point.

dotplot <- xyplot(cond_A ~ cond_B, df,
  groups = pathway,
  par.settings = theme,
  panel = function(x, y, ...) {
    panel.grid(h = -1, v = -1, col.line = grey(0.9))
    panel.xyplot(x, y, ...)
  }
)

direct.label(dotplot)

That’s not really what I want, although it is quite useful for other purposes. In fact we can also set groups = gene and will get individual gene labels, but they are distributed over the entire plot area, and we lose our grouping by pathway.

The dedicated panel function

In the end I made my own panel function, panel.directlabel that you can find in my lattice-tools package on github (you might also just download only the function itself if you want).

require(devtools)
devtools::install_github("https://github.com/m-jahn/lattice-tools")

The function has all the different options that I want to customize text labels, like connecting lines to the points, boxes around labels, flexible sizing, subsetting based on x and y thresholds, and consistency with grouping. The placement of labels is determined using the method smart.grid from directlabels. And here is the final plot using some of the custom options.

library(latticetools)

xyplot(cond_A ~ cond_B, df,
  groups = pathway,
  par.settings = theme,
  labels = df$gene, cex = 0.75,
  auto.key = list(columns = 3),
  panel = function(x, y, ...) {
    panel.grid(h = -1, v = -1, col.line = grey(0.9))
    panel.xyplot(x, y, ...)
    panel.directlabel(x, y, y_boundary = c(4, 10), 
      draw_box = TRUE, box_line = TRUE, ...)
  }
)

Now we can quickly see what our points of interest are. Plus, colors from grouping are preserved for lines, boxes, and text of the label. Labels don’t overlap because the underlying method for placement is ‘smart’ enough to push them to free areas. I just selected a subset of interesting points to be labelled using the y_boundary option. And of course, the appearance of boxes, lines, and so on can be easily customized.