1 Description

This R notebook is a bioinformatics pipeline to collect constraints for a genome scale, resource allocation model in the chemolithoautotroph Ralstonia eutropha (a.k.a. Cupriavidus necator).

A resource allocation model can be coarse-grained (few symbolic reactions) or have genome scale detail (all known biochemical reactions and their associated genes). However, both types of models need to be constrained by a set of parameters to make realistic predictions. Depending on the model frame work, constraints can be equality constraints (example: turnover number of an enzyme E kcatE = 100 s-1), or inequality constraints (0 s-1 <= kcatE <= 100 s-1).

This notebook has the purpose to collect constant and growth-rate dependent constraints as they are used in RBA models. In RBApy, apparent enzyme efficiencies (kapp), protein abundance, molecular machine abundance (protein/macromolecule complexes), and fluxes can be constrained. RBApy has the following possibilities for custom constraints.

Different types of data were used to constrain the resource allocation model. The primary data is protein abundance determined by mass spectrometry for R. eutropha using different growth rates and carbon sources. This data is used to estimate and constrain kapp, enzyme abundance, and non-enzyme protein abundance.

2 Libraries

suppressPackageStartupMessages({
  library(lattice)
  library(latticeExtra)
  library(latticetools)
  library(tidyverse)
  library(stringi)
})

3 kapp estimation using proteomics/fluxomics

3.1 Background

One approach to estimate kapp values is to use enzyme abundance data and estimated or measured fluxes. The kapp is a parameter that links enzyme fluxes to enzyme abundance, so that flux v = kapp * [E]. The kapp includes also the saturation s of the enzyme, so that kapp ~ kcat * s. The saturation is between 0 and 1 so that kapp is lower or equal to kcat.

However, the idea is that the kapp parameter is kept constant for different conditions, while the flux (and the enzyme abundance) is allowed to change. We can then find different scenarios for enzyme saturation states:

  • the flux through an enzyme and therefore its abundance increases in the model, but not in measured enzyme abundance: –> kapp is higher than estimated, enzyme not saturated
  • the flux through an enzyme and therefore its abundance increases in the model, and in measured enzyme abundance: –> kapp is correctly estimated, enzyme is saturated

3.2 Fluxomics and proteomics input

To estimate kapp from available data, we only need to reformulate the simplified rate equation to: kapp = v / [E]. For this purpose, we need two types of information:

  1. enzyme abundance in mmol gDCW-1 for known substrate uptakes rates (chemostat cultivations, maximum growth rate 0.25 h-1). The maximum tested growth rate was chosen to obtain realistic enzyme saturation. Protein abundance data was obtained by MS measurements. The relative protein mass fraction was determined by dividing MS intensity per protein by sum of all intensities. The mass fraction (g/g total protein) was then converted to protein concentration in mmol/gDCW by multiplying it with estimated protein concentration of 0.65 g protein/gDCW (done previously), and then dividing by molar mass of each protein (g/mol).
  2. flux per enzyme, in mmol h-1 gDCW-1 for the corresponding substrate uptake rates under point 1. This information is obtained from flux sampling (FBA) simulations that were constrained to realistic flux distributions using data from Alagesan et al., 2017. Input and output fluxes (for exchange reactions) were determined from chemostat cultivations.

Determine enzyme abundance for standard conditions. Import Ralstonia mass fractions for the four tested conditions, fructose, succinate, formate, and ammonium (N-) limitation (with C-source fructose). Growth rate in all conditions was fixed to µ = 0.25 h-1, the maximum growth rate used in chemostat experiments.

# import proteomics data
load("../data/input/Ralstonia_eutropha.Rdata")

# simplify condition strings
Ralstonia_eutropha <- Ralstonia_eutropha %>% ungroup %>%
  mutate(condition = sapply(condition, function(x){
    unlist(stri_extract_all_words(x))[1]}
  ))

# make new data frame to merge fluxomics and proteomics data
df_flux_protein <- Ralstonia_eutropha %>%
  
  # select only required columns and filter for highest mu
  mutate(condition = recode(condition, `FA` = "formate", `NLIM` = "ammonium", `FRC` = "fructose", `SUC` = "succinate")) %>%
  select(condition, locus_tag, growthrate, mass_g_per_gDCW, MolWeight) %>%
  filter(growthrate == 0.25) %>% 
  
  # calculate protein concentration in mmol/gDCW.
  # MW in kDa must be converted to g/mol, and concentration to mmol
  mutate(
    conc_mmol_gDCW = mass_g_per_gDCW * 1000 / (MolWeight * 1000),
    conc_mmol_gDCW = replace_na(conc_mmol_gDCW, 0)
  )

head(df_flux_protein)

# export to csv
for (cond in unique(df_flux_protein$condition)) {
  filter(df_flux_protein, condition == cond) %>% 
    select(-condition) %>%
    write_csv(paste0("../data/simulation/kapp_fitting/protein_concentration_", cond, ".csv"))
}

Determine reaction fluxes for same condition. Several approaches were tested to obtain minimal and maximal fluxes per enzyme. The best approach turned out to be flux sampling with the additional constraint of reaching at least 95% of the maximum growth rate found by FBA. Flux sampling with 100 iterations was performed using COBRApy. We can see that there is a threshold of standard deviation ~ 1 above which variation gets very high for some reactions. These are artificial cycles. The exception to this is N-limitation that can have higher flux variability due to the surmount carbon supply (fructose is not limiting).

# import flux sampling results for four conditions, max tested growth rate
df_sampling <- lapply(list.files("../data/simulation/kapp_fitting/", pattern = "^FSA", full.names = TRUE), 
  read_csv, col_types = cols()) %>% bind_rows(.id = "condition") %>%
  mutate(condition = recode(condition, `1` = "formate", `2` = "succinate", 
    `3` = "ammonium", `4` = "fructose")) %>%
  select(-2) %>% 
  gather(key = "reaction_id", value = "flux", -condition) %>%
  
  group_by(condition, reaction_id) %>% summarize(
    flux_mmol_gDCW_h = median(flux), 
    sd_flux = sd(flux), 
    min_flux = min(flux), 
    max_flux = max(flux),
    CV = abs(sd_flux/flux_mmol_gDCW_h)) %>%
  filter(!grepl("EX_", reaction_id)) %>%
  arrange(desc(sd_flux)) %>%
  
  # join with single optimal solution from loopless FBA.
  # Will help us to identify free running cycles in FSA
  left_join(
    lapply(list.files("../data/simulation/kapp_fitting/", pattern = "^FBA", full.names = TRUE),
    read_csv, col_types = cols()) %>% bind_rows(.id = "condition") %>%
    mutate(condition = recode(condition, `1` = "formate", `2` = "succinate", 
      `3` = "ammonium", `4` = "fructose")) %>%
    rename(reaction_id = X1, flux_mmol_gDCW_h_FBA = loopless)
  )
`summarise()` has grouped output by 'condition'. You can override using the `.groups` argument.
Joining, by = c("condition", "reaction_id")
  
# plot results; 
plot_sampling <- lapply(list(A = c(-50, 550), B = c(-3, 28)), function(ylim) {
  xyplot(sd_flux ~ seq_along(sd_flux) | condition,
    filter(df_sampling, sd_flux != 0), 
    between = list(x = 0.5, y = 0.5),
    groups = condition, xlab = NULL, ylim = ylim,
    scales = list(alternating = FALSE),
    par.settings = custom.colorblind(), layout = c(4, 1),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.abline(h = 1, lty = 2, col = grey(0.5))
      panel.xyplot(x, y, ...)
      panel.key(..., points = FALSE, corner = c(0.95,0.95))
    }
  )
})

print(plot_sampling[[1]], split = c(1,1,1,2), more = TRUE)
print(plot_sampling[[2]], split = c(1,2,1,2))

# process flux distribution, e.g. by removing extremely high fluxes
df_sampling2 <- df_sampling %>%
  
  # Filter out summary and outdated reactions
  filter(!reaction_id %in% c("Biomass", "Maintenance", "Protein", "Carbohydrate", 
    "Phospholipid", "   Cofactors_and_vitamins", "CBBCYC", "PYK1", "PYK2", "PYK3",
    "DHFR2", "DHFR3", "DHFR2p", "DHFR3p")) %>%
  
  # replace extreme fluxes with a min and max estimated from loopless FBA
  mutate(
    min_flux = if_else(sd_flux > 1, -abs(flux_mmol_gDCW_h_FBA), min_flux),
    max_flux = if_else(sd_flux > 1,  abs(flux_mmol_gDCW_h_FBA), max_flux)
  ) %>%
  
  # add an error margin to the sampled min and max fluxes, to help the solver
  # find a feasible solution
  mutate(
    min_flux = min_flux-(abs(min_flux)*1), 
    max_flux = max_flux+(abs(max_flux)*1)
  ) %>%
  
  # re-formatting of table
  arrange(desc(abs(max_flux))) %>%
  mutate(reaction_id = paste0("R_", reaction_id))

head(df_sampling2)

# export to csv
for (cond in unique(df_sampling2$condition)) {
  filter(df_sampling2, condition == cond) %>% ungroup %>%
    select(-condition) %>%
    write_csv(paste0("../data/simulation/kapp_fitting/model_flux_sampling_", cond, ".csv"))
}

3.3 Determine kapp values

The final step is to merge both datasets by computing the enzyme abundance allocated to each reaction. The estimated kapp is then determined by dividing the flux through the enzyme abundance available for this reaction. This step was not performed in this R notebook but using the RBA built-in functions from the RBApy estim folder. Briefly, flux boundaries and protein concentration data was exported and serves as input for the script kapp.py (link). This script performs a series of FBA and RBA simulations constrained by the input fluxes and protein concentrations. It then gives an estimation of the kapp for a particular condition.

4 Fraction of protein per compartment

The RBA model takes as another input parameter (or constraint) the fraction of protein per compartment. This constraint is important as it allows the cell to have only a limited amount of protein in cytoplasm or membrane compartments, for example. This constraint can be constant or it can be growth rate dependent e.g. by a linear relationship.

The first step is to prepare a table with protein abundance and localization. Protein abundance can be in any unit according to the RBApy manual, but it’s best to use mol fraction instead of mass fraction, as all other RBApy functions also use mmol. The mol fraction is already available in the processed data set. The built-in estim functions don’t seem to be well supported in python3 and raise errors. We therefore do the estimation manually by fitting linear models to the growth rate-protein abundance relationship. The idea is similar to the RBA estim function but less complicated. We focus on the standard condition as a training data set (fructose as carbon source, no NH4+ limitation).

df_protein <- Ralstonia_eutropha %>% 
  
  # select only required columns and spread to long format
  select(condition, locus_tag, growthrate, Psortb_localization, mol_fraction) %>%
  set_names(c("condition", "protein", "growthrate", "location", "mol_fraction")) %>%
  mutate(condition = recode(condition, `FA` = "formate", `NLIM` = "ammonium", `FRC` = "fructose", `SUC` = "succinate")) %>%
  
  # match localization names to model, simplify
  mutate(location = recode(location, Unknown = "Cytoplasm", Cytoplasmic = "Cytoplasm")) %>%
  mutate(location = replace(location, location != "Cytoplasm", "Cell_membrane"))

head(df_protein)

Now we can summarize the data by taking the sum of mol fraction over condition and localization. A simple approach to finding linear functions, where all proteins of all locations sum to one for a specific growth rate, would be to fit linear models for all compartments except one (e.g. cytoplasmic proteins, the biggest compartment). This one will then get a linear model fitted from the residual protein mass. We would expect the error for Cytoplasm to be negligibly small as it is the biggest compartment. However it turned out that the linear models fitted to both compartments perfectly sum to one already (see below). An estimation from residual protein fraction is therefore not necessary.

# First retrieve the proteins that are not part of the model, which is needed
# to  calculate non-enzymatic fraction later on
model_genes <- c(
  read_csv("../data/input/model_reactions.csv", col_types = cols()) %>% separate_rows(genes, sep = ", ") %>%
    filter(!duplicated(genes)) %>% pull(genes),
  read_tsv("../data/simulation/macro_machines/ribosome.tsv", col_types = cols())[["Gene names"]],
  read_tsv("../data/simulation/macro_machines/chaperones.tsv", col_types = cols())[["Gene names"]],
  read_tsv("../data/simulation/macro_machines/transcription.tsv", col_types = cols())[["Gene names"]],
  read_tsv("../data/simulation/macro_machines/replication.tsv", col_types = cols())[["Gene names"]]
)
Warning: Missing column names filled in: 'X1' [1]
# extract locus_tags only
model_genes <- model_genes %>% stri_extract_first_regex(
  pattern = "H16_[A-Z][0-9]{4}|PHG[0-9]{3}")

# add new NE_protein column
df_prot_per_comp <- df_protein %>%
  mutate(NE_protein = !protein %in% model_genes) %>%
  group_by(condition, location, growthrate)

The following plot shows a slight increase in cytoplasmic proteins and decrease in cell membrane proteins with growth rate.

plot_prot_comp <- xyplot(prot_per_compartment ~ growthrate, 
  df_prot_per_comp %>%
    summarize(prot_per_compartment = sum(mol_fraction, na.rm = TRUE)),
  groups = location, ylim = c(-0.1, 1.1),
  par.settings = custom.colorblind(), cex = 0.7,
  scales = list(alternating = FALSE),
  between = list(x = 0.5, y = 0.5),
  xlab = expression("µ [h"^-1*"]"),
  ylab = "protein mass fraction",
  panel = function(x, y, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.superpose(x, y, ...)
    panel.key(corner = c(0.1, 0.55), ...)
  }, panel.groups = function(x, y, ...) {
    panel.xyplot(x, y, ...)
    panel.lmlineq(x, y, r.squared = TRUE,
      pos = 3, offset = 1, ...)
  }
)
`summarise()` has grouped output by 'condition', 'location'. You can override using the `.groups` argument.
print(plot_prot_comp)