0.1 Description

This R notebook is a bioinformatics pipeline to process and analyze MS based peptide/protein abundance data for the chemolithoautotroph Ralstonia eutropha (a.k.a. Cupriavidus necator).

Proteomics data was obtained using the following work flow (to be added…).

0.2 Libraries

# loading libraries
library(lattice)
library(latticeExtra)
library(latticetools)
library(tidyverse)
library(stringi)

0.3 Data import

Define the data source directories. Some of them are external in the sense of not included in the accompanying data folder of this R notebook. The main proteomics data is loaded from the R ShinyProt directory that can also be found on github and interactively browsed and searched in the ShinyProt web app.

Reutropha_proteomics <- "../data/input/Ralstonia_eutropha.Rdata"
load(Reutropha_proteomics)

0.4 Overview on detected peptides and proteins

0.4.1 Total number of quantified proteins and missing proteome

In total, the following number of proteins was quantified, out of theoretical total of 6,614 proteins (Uniprot reference genome, July 31 2020). That represents roughly 81 % by number and much more by mass (see below for estimation)

n_quantified_prot <- Ralstonia_eutropha %>% pull(uniprot) %>% unique %>% length
print(n_quantified_prot)
[1] 5357
print(n_quantified_prot/6614*100)
[1] 80.99486

We can also estimate the coverage in terms of protein mass, by assuming that the 1257 missing proteins are of average mass or lower than average mass than the detected proteins. The following simple calculation simulates missing protein mass with an average abundance of the lower quantile of detected proteins. For this purpose we simply pick a standard condition, such as growth on fructose.

quantified_protein <- Ralstonia_eutropha %>%
  
  # pick a certain condition
  filter(substrate == "fructose", growthrate == 0.25) %>%
  pull(mean_intensity)

# determine quantiles of raw quantification intensity
quantified_protein %>% quantile(na.rm = TRUE)
          0%          25%          50%          75%         100% 
7.244750e+04 8.117648e+07 3.395687e+08 1.480863e+09 4.888941e+11 

Now we just simulate that the 1257 non-detected proteins have an average mass similar to that of the protein with 25% lowest abundance. The missing protein abundance will then sum up to less 1% of the total estimated proteome, meaning we can detect more than 99% of the proteome by mass.

missing_protein <- 1257 * quantile(na.rm = TRUE, quantified_protein)[2] %>% as.numeric()
missing_protein_percent <- missing_protein/(missing_protein + sum(quantified_protein, na.rm = TRUE))
paste("missing protein in % total mass:", round(missing_protein_percent*100, 3))
[1] "missing protein in % total mass: 0.88"

0.4.2 Number of quantified peptides per protein

plot_quant_pep <- xyplot(sort(n_peptides, decreasing = TRUE) ~ 
      1:length(protein),
    filter(Ralstonia_eutropha, substrate == "fructose", growthrate == 0.25),
    xlab = "protein", ylab = "n peptides",
    par.settings = custom.lattice, 
    ylim = c(0, 80), xlim = c(0, 5500),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.barplot(x, y, col = NA, fill = grey(0.8), fill_alpha = 1, ewidth = 0.6)
      xhalf = length(unique(x))/2
      panel.lines(x = c(0, xhalf, xhalf), y = c(y[xhalf], y[xhalf], 0), col = 1)
      panel.text(x = 10, y = y[xhalf]*3, pos= 4, cex = 0.8, 
        labels = paste0(round(xhalf), " proteins with >= ", y[xhalf], " peptides"))
    }
  )

print(plot_quant_pep)

0.4.3 Number of quantified peptides per protein, inverted

plot_quant_pep_2 <-  Ralstonia_eutropha %>%
  
  # rearrange n_peptides to have another type of overview
  filter(substrate == "fructose", growthrate == 0.25) %>%
  pull(n_peptides) %>% table %>% as_tibble %>%
      rename(., pep = `.`, prot = n) %>% mutate(pep = as.numeric(pep)) %>%
  
  # plot
  xyplot(prot ~ pep, .,
    xlab = "n peptides", ylab = "n proteins", ylim = c(-5, 1155),
    par.settings = custom.lattice, xlim = c(0, 80),
    panel = function(x, y,...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.barchart(x, y, horizontal = FALSE, box.width = 1,
        border = NA, col = grey(0.8), ...)
    }
  )

print(plot_quant_pep_2)

0.4.4 Number of protein quantifications per replicate

plot_quant_prot1 <- Ralstonia_eutropha %>%
  
  # protein quantifications per replicate
  gather(replicate, raw_intensity, R1:R4) %>%
  group_by(substrate, growthrate, replicate) %>%
  summarize(quant_proteins = sum(!is.na(raw_intensity))) %>%
  
  # and plot
  xyplot(quant_proteins ~ 1:length(quant_proteins), .,
    xlab = "sample", ylab = "quantified proteins",
    par.settings = custom.lattice, 
    ylim = c(0, 5500), xlim = c(0, 81),
      panel = function(x, y,...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.barplot(x, y, col = NA, fill = grey(0.8),
        fill_alpha = 1, ewidth = 0.5)
    }
  )

print(plot_quant_prot1)

0.4.5 Number of proteins quantified in every run

plot_quant_prot2 <- Ralstonia_eutropha %>%
  
  # protein quantifications per replicate
  gather(replicate, raw_intensity, R1:R4) %>%
  group_by(protein) %>%
  summarize(quant_in_runs = sum(!is.na(raw_intensity))) %>%
  pull(quant_in_runs) %>% table %>% enframe %>%
  
  # and plot
  xyplot(value ~ factor(name), .,
    xlab = "in number of runs", 
    ylab = "quantified proteins",
    par.settings = custom.lattice, 
    ylim = c(0, 3000), xlim = c(0, 79),
      panel = function(x, y,...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.barplot(x, y, col = NA, fill = grey(0.8),
        fill_alpha = 1, ewidth = 0.5)
      panel.abline(v = 70, col = 1)
      panel.text(x = 25, y = 1500, pos= 4, cex = 0.8, 
        labels = paste0(sum(y[70:79]), " proteins quantified\nin > 70 out of 80 runs"))
    }
  )

print(plot_quant_prot2)

0.5 Sample overview and quality control

0.5.1 Raw intensity per sample and replicate

Raw intensity here is the dimensionless MS ‘intensity’, that means the quantified area under the curve of MS1 spectra for peptides, summed up per protein. One replicate that was missing for condition Fructose, growth rate 0.1, R2, was temporarily replaced by R1 for this plot, because densityplot was otherwise giving an error message (because of missing values).

densityplot(~ log10(R1) + log10(R2) + log10(R3) + log10(R4) | condition, 
  Ralstonia_eutropha %>% mutate(R2 = case_when(
    condition == "FRC 0.1" ~ R1, TRUE ~ R2)), 
  auto.key = list(columns = 4), layout = c(5, 4), 
  par.settings = custom.colorblind(), xlab = "log10 intensity",
  scales = list(alternating = FALSE), as.table = TRUE,
  panel = function(x, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.superpose(x, ...)
  },
  panel.groups = function(x, ...) {
    panel.densityplot(x, plot.points = FALSE, ...)
    panel.abline(v = median(x, na.rm = TRUE), lty = 2, col = grey(0.5))
  }
)