1 Description

This R notebook is a bioinformatics pipeline to analyze fitness data obtained from a barcoded transposon library in Ralstonia eutropha a.k.a. Cupriavidus necator. For background and details regarding the method, see Wetmore at al., mBio, 2015 and Price et al., Nature, 2018).

One part of the data used here (growth on fructose, succinate, formate) was already published in a previous research article with focus on C. necator’s central carbon metabolism. The preprint is available on BioRxiv. The second part (autotrophic growth on hydrogen, anoxic growth with nitrate respiration) is new data that complements previous results. The focus of this pipeline is to study the energy metabolism of C. necator in diverse conditions, each of it challenging a different aspect/machinery of metabolism. This investigation tries to answer the following key questions:

  1. Which genes are constituting the essential parts of the hydrogenases and formate dehydrogenases, enabling autotrophic lifestyle? Many things about the structure and function of these enzymes is already known, where do these results show differences?
  2. Which genes are mandatory as accessory proteins, for example cofactor synthesis, insertion, and enzyme maturation?
  3. Why does KO of hydrogenases provides a growth benefit in non-autotrophic conditions?
  4. What are the essential parts of the electron transport chain? What complexes are used specifically for some growth conditions?
  5. Is essentiality/fitness correlated with protein abundance? Is there a dosage effect, such that KO of a high abundant iso-enzyme has stronger fitness effect compared to a low-abundant?

2 Libraries

Optionally install required libraries, particularly custom packages from github.

devtools::install_github("m-jahn/lattice-tools")
suppressPackageStartupMessages({
  library(lattice)
  library(latticeExtra)
  library(latticetools)
  library(tidyverse)
  library(forcats)
  library(dendextend)
  library(colorspace)
  library(stringi)
  library(vegan)
  library(tsne)
  library(directlabels)
  library(limma)
})

3 Data import and processing

Import the main data tables with gene fitness data for all conditions. Tables were obtained by processing sequencing data with a custom BarSeq pipeline. The 32 generation sequencing samples are removed due to low diversity, leading to overall lower average read count per gene.

load("../data/barseq/20201222_fru_fitness.Rdata")
df_fitness_frc <- fitness_gene %>% ungroup %>%
  filter(condition != "long pulse", time != 32) %>%
  mutate(ID = as.numeric(ID), substrate = "fructose",
    condition = str_remove(condition, "short "))

load("../data/barseq/20210407_suc_for_fitness.Rdata")
df_fitness_suc <- fitness_gene %>% ungroup %>%
  separate(condition, sep = "_", into = c("substrate", "condition"))

load("../data/barseq/20210624_H2_NO3_fitness.Rdata")
df_fitness_no3 <- fitness_gene %>% ungroup %>%
  filter(condition %in% c("NO3")) %>%
  mutate(substrate = "nitrate", condition = "continuous")

load("../data/barseq/20210928_H2_fitness.Rdata")
df_fitness_h2 <- fitness_gene %>% ungroup %>%
  mutate(substrate = "hydrogen", condition = "pulse")

load("../data/barseq/20220420_lowCO2_NO3.Rdata")
df_fitness_lowco2 <- fitness_gene %>% ungroup %>%
  filter(condition == "NO3_batch") %>%
  mutate(substrate = "nitrate", condition = "pulse")

# merge fitness tables
df_fitness <- bind_rows(df_fitness_frc, df_fitness_suc, df_fitness_no3,
  df_fitness_h2, df_fitness_lowco2) %>%
  rename(locus_tag = locusId)
rm("df_fitness_frc", "df_fitness_suc", "df_fitness_h2", "df_fitness_no3",
   "df_fitness_lowco2", "fitness_gene")

# import genome annotation
df_ref <- read_csv("../data/ref/Ralstonia_H16_genome_annotation.csv",
  col_types = cols()) %>%
  filter(!duplicated(locus_tag)) %>%
  mutate(eggNOG_name = if_else(is.na(eggNOG_name), gene_name, eggNOG_name))

# define standard colors
stdcol <- custom.colorblind()$superpose.line$col

4 Gene fitness analysis

4.1 Depletion over time (generations)

We can plot log2 FC or normalized gene fitness over generations. For this type of overview it is best to summarize individual replicates (4x) to the mean or median, per time point and condition. We also add genome annotation to the summary table. The plots shows that depletion of some strains is so strong already at 8 generations that fitness/log2 FC could not be quantified for 16 generations due to missing read counts. It is best to focus on 8 generations as because it provides a more complete picture.

nest_cols = c("time", "condition", "substrate")
df_ref_long <- bind_rows(df_ref, distinct(df_fitness %>% select(!!!nest_cols))) %>%
  complete(locus_tag, nesting(!!!lapply(nest_cols, sym))) %>%
  filter(!is.na(time)) %>% select(locus_tag, !!!nest_cols) %>%
  left_join(df_ref, by = "locus_tag")

df_fitness_summary <- df_fitness %>%
  group_by(locus_tag, scaffold, time, condition, substrate, strains_per_gene) %>%
  summarize(
    .groups = "drop",
    norm_gene_fitness_median = median(norm_gene_fitness, na.rm = TRUE),
    log2FC_median = median(log2FC, na.rm = TRUE),
    tstat_median = median(t, na.rm = TRUE)
  ) %>%
  full_join(df_ref_long) %>%
  mutate(cond_short = if_else(condition == "pulse", "P", "C")) %>%
  mutate(cond_short = paste0(substrate, " - ", cond_short) %>%
    factor(levels = paste0(
      rep(c("fructose", "succinate", "formate", "hydrogen", "nitrate"), each = 2),
      rep(c(" - P", " - C"), 5)
    )[-8])
  )
Joining with `by = join_by(locus_tag, time, condition, substrate)`
plot_hist_log2FC <- df_fitness_summary %>%
  group_by(condition, substrate) %>%
  slice(1:1000) %>%
  
  xyplot(log2FC_median ~ time | substrate * condition, .,
    groups = locus_tag, as.table = TRUE,
    col = stdcol[1], alpha = 0.2, layout = c(5, 2),
    xlab = "", ylab = expression("log"[2]*" FC"),
    par.settings = custom.colorblind(), type = c("l"),
    between = list(x = 0.5, y = 0.5),
    scales = list(alternating = FALSE), lwd = 2,
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
    }
  )

plot_hist_normfg <- df_fitness_summary %>%
  group_by(condition, substrate) %>%
  slice(1:1000) %>%
  
  xyplot(norm_gene_fitness_median ~ time | substrate * condition, .,
    groups = locus_tag, as.table = TRUE,
    col = stdcol[2], alpha = 0.2, layout = c(5, 2),
    xlab = "generations", ylab = "fitness",
    par.settings = custom.colorblind(), type = c("l"),
    between = list(x = 0.5, y = 0.5),
    scales = list(alternating = FALSE), lwd = 2,
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
    }
  )

print(plot_hist_log2FC, position = c(0,0.47,1,1), more = TRUE)
print(plot_hist_normfg, position = c(0,0,1,0.53))

null device 
          1 

4.2 Comparing gene fitness between conditions

Similarly to Figure 2 in Wetmore et al, mBIO, 2015, we can investigate condition-dependent gene fitness by comparing conditions and substrates one-on-one. The correlation between substrates and growth regimes (continuous, pulsed) is quite different. The strongest correlation is present in the pulsed vs continuous regime for each substrate (R = 0.64 to 0.87), and within the pulsed conditions the comparison between substrates (R = 0.62 to 0.72).

df_fitness_comp <- df_fitness_summary %>% filter(time == 8) %>%
  group_by(locus_tag) %>% mutate(tstat_median = min(tstat_median)) %>%
  select(locus_tag, cond_short, norm_gene_fitness_median, gene_name, COG_Process, tstat_median) %>%
  pivot_wider(names_from = cond_short, values_from = norm_gene_fitness_median) %>%
  filter(!is.na(locus_tag)) %>% ungroup

custom_splom(df_fitness_comp %>%
  select(matches(" - [CP]")), pch = 19, cex = 0.3, col = grey(0.4, 0.4))

null device 
          1 

5 Unsupervised clustering of genes by fitness

Most genes correlate in fitness value between conditions. That means, only some genes have a condition-specific fitness effect, i.e. increase or decrease fitness of the respective strain in one substrate or growth regime specifically. This section investigates specific genes and their functions that show such differential fitness between conditions. A simple comparison of one condition versus another will not be helpful as there two many possible combinations. To identify interesting sets of genes, cluster analysis can be performed. The first step is to generate a matrix from reformatted fitness data, and cluster genes by similarity using a generic clustering algorithm such as hclust(method = "ward.D").

# generate color palette for heatmap
heat_cols <- diverging_hcl(n = 7, h = c(255, 12), c = c(50, 80), l = c(20, 97), power = c(1, 1.3))

# create a matrix from wide fitness data for plotting heatmap
mat_heatmap <- df_fitness_comp %>% ungroup %>%
  filter(if_any(.cols = matches(" - [CP]"), ~ !between(., -2, 2))) %>%
  select(matches("locus_tag| - [CP]")) %>%
  # filter out NA rows, and replace extreme values
  drop_na %>% mutate(across(matches(" - [CP]"), 
    function(x) {y = replace(x, x > 6, 6); replace(y, y < -6, -6)})) %>%
  column_to_rownames(var = "locus_tag") %>%
  as.matrix

# create cluster for reordering
mat_cluster <- mat_heatmap %>% dist %>% hclust(method = "ward.D2")
mat_heatmap <- mat_heatmap[
  order.dendrogram(as.dendrogram(mat_cluster)),
  levels(df_fitness_summary$cond_short)
]

The second step is to find the optimal number of clusters for subdividing the data set. Silhouette analysis can be used for this purpose. However, the result is not entirely clear, but suggests a number of 5 to 10 clusters would give good separation. A wrapper for more or less fully automated silhouette analysis is available in my Rtools package (random collection of helper functions).

devtools::install_github("m-jahn/r-tools")
silhouette_result <- Rtools::silhouette_analysis(mat = mat_heatmap, n_clusters = 2:20, n_repeats = 10)
silhouette_result$plot_summary$par.settings = custom.colorblind()
silhouette_result$plot_summary
n_clusters = 7

svg(filename = "../figures/figure_silhouette.svg", width = 4, height = 4)
print(silhouette_result$plot_summary)
dev.off()
png 
  2 

# define a custom set of colors for clusters
custom_colors <- colorRampPalette(custom.colorblind()$superpose.line$col[c(2,3,1,5)])(n_clusters)

# plot heatmap
plot_heatmap <- levelplot(mat_heatmap[ ,rev(colnames(mat_heatmap))],
  par.settings = custom.colorblind(),
  col.regions = colorRampPalette(heat_cols)(16),
  at = seq(-6, 6, 1), aspect = "fill",
  xlab = "", ylab = "", scales = list(x = list(draw = FALSE)),
  panel = function(x, y, z, ...) {
    panel.levelplot(x, y, z, ...)
    panel.abline(h = 1:9+0.5, col = "white", lwd = 1.5)
  }
)

plot_cluster_dend <- mat_cluster %>% as.dendrogram %>%
  set("branches_k_col", custom_colors, k = n_clusters) %>%
  set("branches_lwd", 0.5) %>%
  as.ggdend %>%
  ggplot(labels = FALSE)

gridExtra::grid.arrange(
  plot_cluster_dend + 
    theme(plot.margin = unit(c(0.1, 0.05, -0.26, 0.085),"npc")),
  plot_heatmap,
  nrow = 2
)

null device 
          1 

5.1 Gene similarity, by cluster

The heat map clusters genes together that show similar expression over different conditions. Alternatively, we can use dimensionality reduction to identify clusters of genes that are similar, e.g. through PCA, NMDS, or t-SNE.

# set a seed to obtain same pattern for stochastic methods
set.seed(123)
regex_moa <- "mo[aedbg]([A-Z0-9]+)?"
regex_for <- "fd[swho][A-Z0-9]+"
regex_hyd <- "hox[A-Z0-9]+|hyp[A-Z0-9]+"
regex_etc <- "(qcr|cco|cta|cox|cyo|cyd)([A-Z0-9]+)?"
regex_nit <- "na[rsp]([A-Z0-9]+)?|nir([A-Z0-9]+)?|nor([A-Z0-9]+)?|nos([A-Z0-9]+)?"
regex_all <- paste(c(regex_moa, regex_for, regex_hyd, regex_etc, regex_nit), collapse = "|")

# run t-SNE analysis
SNE <- tsne(mat_heatmap %>% dist, max_iter = 500)
sigma summary: Min. : 0.443752156712937 |1st Qu. : 0.631509019280422 |Median : 0.707827611279302 |Mean : 0.768252158651903 |3rd Qu. : 0.880353013632023 |Max. : 1.38975561028632 |
Epoch: Iteration #100 error is: 14.0419952880744
Epoch: Iteration #200 error is: 0.473145722307766
Epoch: Iteration #300 error is: 0.411810046962138
Epoch: Iteration #400 error is: 0.391638658174712
Epoch: Iteration #500 error is: 0.390500309862624
df_tsne <- SNE %>% setNames(c("x", "y")) %>% as_tibble %>%
  mutate(locus = rownames(mat_heatmap)) %>%
  left_join(enframe(cutreeord(mat_cluster, k = n_clusters), "locus", "cluster")) %>%
  rename(locus_tag = locus) %>%
  left_join(select(df_fitness_comp, locus_tag, gene_name)) %>%
  mutate(
    gene_name = stri_extract_first_regex(gene_name, regex_all) %>%
    str_replace("mog", "mogA")
  )
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of
tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Joining with `by = join_by(locus)`
Joining with `by = join_by(locus_tag)`
plot_tsne <- df_tsne %>%
  xyplot(V2 ~ V1, .,
    groups = cluster, col = custom_colors,
    xlab = "tSNE 1", ylab = "tSNE 2",
    par.settings = custom.colorblind(),
    labels = .[["gene_name"]], selected = {!is.na(.[["gene_name"]])},
    panel = function(x, y, selected, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.repellabels(x, y, selected = selected, cex = 0.65,
        draw_box = TRUE, box_fill = grey(0.95, 0.5), ...)
      directlabels::panel.superpose.dl(x, y, ...)
    }
  )

print(plot_tsne)

5.2 Functions enriched in different clusters

Extract cluster number for each gene and perform GO or KEGG enrichment per cluster. The goal is to identify the most prevalent biological functions per cluster.

df_cluster <- cutreeord(mat_cluster, k = n_clusters) %>%
  enframe(name = "locus_tag", value = "cluster") %>%
  arrange(cluster)

We use the functions kegga for KEGG enrichment analysis and goana for GO term enrichment from the limma package. Both functions test for over or under-representation of genes associated with certain pathways or GO terms. The functions don’t take the strength of differential fitness into account (DF; the depletion/enrichment over time).

tryCatch(
  df_kegg_enrichment <- lapply(1:n_clusters, function(clust) {
    filter(df_cluster, cluster == clust) %>% pull(locus_tag) %>%
    kegga(species.KEGG = "reh") %>%
    mutate(cluster = clust)
  }) %>% bind_rows,
  error = function(e) print(e)
)

Visualize the pathways that are most enriched for the different gene clusters.

plot_keggenrich <- df_kegg_enrichment %>%
  mutate(
    kegg_pathway = tolower(Pathway),
    kegg_pathway = str_remove_all(kegg_pathway, ".?biosynthesis| of| metabolism"),
    kegg_pathway = str_sub(kegg_pathway, 1, 32)
  ) %>%
  filter(N >= 5, N <= 200) %>%
  mutate(log10_p_value = log10(P.DE), .keep = "unused") %>%
  filter(log10_p_value < 0) %>%
  group_by(cluster) %>%
  arrange(log10_p_value) %>%
  slice(1:3) %>% rowwise() %>%
  mutate(
    kegg_pathway = str_remove(kegg_pathway, " - Cupriavidus necator H16"),
    kegg_pathway = str_glue("{kegg_pathway} ({cluster})")
  ) %>% ungroup %>%
  
  xyplot(factor(kegg_pathway, rev(unique(kegg_pathway))) ~ log10_p_value, .,
    groups = cluster,
    par.settings = custom.colorblind(),
    col = custom_colors,
    xlab = expression("log"[10]*" p-value"), ylab = "",
    panel = function(x, y, labels, ...) {
      panel.grid(h = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
    }
  )

print(plot_keggenrich)

null device 
          1 

6 Genes essential in all conditions

6.1 Cluster 1

  • before looking at condition-specific essentiality, can also look at genes essential in all conditions
  • these genes are mostly present in cluster 1
  • we assume that these are amino acid biosynthesis genes
  • they become essential when transferring C. necator from rich to minimal/defined medium
  • in total 33 genes with depletion

Generalized function to plot (small) heatmaps.

heatmap_fitness <- function(data, key = TRUE) {
  
  # make some small adjustments to optical appearance
  data %>%
  mutate(norm_gene_fitness_median = norm_gene_fitness_median %>%
    replace(., . < -6, -6) %>% replace(., . > 6, 6)) %>%

  # plot heatmap
  levelplot(norm_gene_fitness_median ~ factor(gene_name) * fct_rev(cond_short), .,
    par.settings = custom.colorblind(), colorkey = key,
    col.regions = colorRampPalette(heat_cols)(25),
    at = seq(-6, 6, 0.5), aspect = "iso",
    xlab = "", ylab = "", scales = list(cex = 0.7, x = list(rot = 90)),
    panel = function(x, y, z, ...) {
      panel.levelplot(x, y, z, ...)
      panel.abline(v = seq_along(unique(x))+0.5, 
        h = seq_along(unique(y))+0.5, col = "white", lwd = 2)
    }
  )
}

Generalized function to plot genes for different regions of the genome.

genome_plot <- function(df, xlim = NULL, title = "", rot_labels = 0) {
  df <-  replace_na(df, list(strains_per_gene = 0))
  theme = custom.colorblind(col = c(grey(0.7), grey(0.85)))
  theme$axis.line$col = grey(0.3)
  theme$axis.text$col = grey(0.3)
  theme$axis.text$cex = 0.7
  
  if (!is.null(xlim))
    xscale = list(limits = xlim)
  else 
    xscale = list()
  xyplot(end/1000 ~ start/1000, df,
    groups = strand, cex = 0.7, lwd = 1,
    par.settings = theme, strains = df$strains_per_gene,
    scales = list(y = list(draw = FALSE), x = xscale),
    ylim = c(-3,2), xlab = "", ylab = "",
    gene_strand = df[["strand"]],
    gene_name = df[["gene_name"]],
    panel = function(x, y, strains, ...) {
      panel.geneplot(x, y, arrows = TRUE, 
        tip = 0.1, rot_labels = rot_labels, ...)
      panel.text((x+y)/2, rep(0, length(x)), labels = strains, cex = 0.7)
      panel.text(mean(xlim), 1.5, labels = title, col = 1)
    }
  )
}

Plot heat map of cluster 1 gene fitness.

plot_cl1 <- df_cluster %>%
  filter(cluster == 1) %>%
  left_join(df_fitness_summary, by = "locus_tag") %>%
  filter(time == 8) %>%
  group_by(COG_Process) %>%
  mutate(COG_n = n()) %>%
  ungroup() %>%
  arrange(desc(COG_n), COG_Process, gene_name) %>%
  mutate(gene_name = fct_inorder(gene_name)) %>%
  heatmap_fitness()

print(plot_cl1)

  • Which functional categories (COG) do genes belong to?
plot_cl1_function <- df_cluster %>%
  filter(cluster == 1) %>%
  left_join(df_fitness_comp, by = "locus_tag") %>%
  count(COG_Process) %>%
  arrange(desc(n)) %>%
  mutate(COG_Process = fct_inorder(COG_Process)) %>%
  mutate(vars = factor("COG process")) %>%
  xyplot(vars ~ n, .,
    stack = TRUE, horizontal = TRUE,
    auto.key = list(columns = 2),
    par.settings = custom.colorblind(),
    xlab = "count", ylab = "", lty = 0,
    xlim = c(0, 33),
    groups = COG_Process,
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.barchart(x, y, ...)
    }
  )

print(plot_cl1_function)

7 Supervised analysis of energy metabolism

7.1 Molybdenum cofactor biosynthesis

  • modAB = molybdenum transporter
  • moaACDE = molybdopterin backbone biosynthesis
  • moeA = molybdopterin molybdenum transferase
  • mobAB = molybdenum cofactor guanylyltransferase,
  • mog = molybdopterin biosynthesis protein (role unclear?)
# plot heatmap
df_moco <- df_fitness_summary %>% filter(time == 8, grepl("mo[aedbg]", gene_name)) %>%
  filter(!str_detect(gene_name, "moaA2")) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "mo[aedbg]([A-Z0-9]+)?") %>%
    str_replace("mog", "mogA")) %>%
  mutate(gene_name = factor(gene_name, c("moaA", "moaC", "moaD", "moaE", "moaF",
    "modA", "modB", "modC", "mogA", "moeA1", "moeA2", "moeA3", "mobA", "mobB", "mobB2", "mobB3")))
plot_moco_fit <- heatmap_fitness(df_moco)

# plot genomic locations
df_moco <- df_moco %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_moco_g1 <- genome_plot(df_moco, xlim = c(2455.9, 2458.6), title = "chr 1")
plot_moco_g2 <- genome_plot(df_moco, xlim = c(2787.8, 2791.2), title = "chr 1")

print(plot_moco_fit, position = c(0,0.21,1,1.05), more = TRUE)
print(plot_moco_g1, position = c(0.17,-0.1,0.57,0.4), more = TRUE)
print(plot_moco_g2, position = c(0.53,-0.1,0.83,0.4))

Interesting results to double-check:

  • Why is moaA essential for nitrate respiration but not formate? Is there another iso-enzyme for first step of MoCo synthesis?
  • mogA, moaA, moaB, moaE, modB, or modC deletion mutants are not viable in E.coli. Except for moaA, also modA,B,C are not essential for growth on formate.
  • modABC is the 3-subunit membrane transport system; it should be essential, however one explanation for it not being essential is that another low-affinity transporter might exist that compensates for the loss in medium with high Mo-concentration (Xia et al., 2018). The authors write in the discussion: “Molybdate can be taken up by the sulfate-transport system in the absence of a functional high-affinity molybdate-transport system in E. coli and B. japonicum (Rosentel et al., 1995; Delgado et al., 2006).”
  • moeA (molybdenum transferase) is present with 3 isoenzymes. MoeA1 and A2 both seem to be important for fitness as a KO of each one of them reduces fitness partially but not fully on formate/nitrate.

7.2 Formate dehydrogenase

  • fdsABCDG = primary soluble FDH
  • fdwAB = secondary soluble FDH
  • fdhABCD = primary membrane-bound FDH, 2 copies A1B1… and A2B2…
  • fdoGHI = secondary membrane-bound FDH
# plot heatmap
df_fdh <- df_fitness_summary %>% filter(time == 8, grepl("fd[swho]", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "fd[swho][A-Z0-9]+")) %>%
  mutate(gene_name = recode(gene_name, `fdhD` = "fdsC", `fdhC` = "fdhC1", `fdhE` = "fdhE1")) %>%
  mutate(gene_name = factor(gene_name, c(paste0("fds", c("A", "B", "C", "D", "G", "R")),
    "fdwA", "fdwB", paste0("fdh", c("A1", "B1", "C1", "D1", "E1", "A2")), "fdoG")))
plot_fdh_fit <- heatmap_fitness(df_fdh)

# plot genomic locations
df_fdh <- df_fdh %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_fdh_g1 <- genome_plot(df_fdh, xlim = c(677.7, 685.6), title = "chr 1")
plot_fdh_g2 <- genome_plot(df_fdh, xlim = c(3167.9, 3174.9), title = "chr 1")

print(plot_fdh_fit, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_fdh_g1, position = c(0.17,-0.1,0.57,0.4), more = TRUE)
print(plot_fdh_g2, position = c(0.52,-0.1,0.83,0.4))

7.3 Hydrogenases

  • hoxA = master regulator, 2-component system. HoxA is constantly active in its non- phosphorylated state below a temperature of 33*C. Originally, hoxA activity is regulated by the kinase HoxJ (together with HoxBC, the ‘regulatory’ hydrogenase). This function is inactivated due to a mutation in hoxJ, Friedrich & Friedrich, J Bac, 1983, Lenz et al., JMMB, 2004.
  • hoxKGZ = membrane-bound hydrogenase
  • hoxMLOQRTV = MBH accessory genes (cofactor incorporation, subunit assembly, and twin-arginine-dependent membrane translocation - TAT), Schubert et al., Mol Microbio, 2007, Fritsch et al., J Bac, 2011 Schwartz et al., J Mol Bio, 2003
  • hoxN1, hoxN2 = Nickel/Cobalt transporters. HoxN2 51% identical on AA level but not functional, Schwartz et al., J Mol Bio, 2003
  • hoxBC = regulatory hydrogenase RH, oxygen-resistant non-energy generating hydrogen sensor. HoxC is large active-site-containing subunit, hoxB small subunit HoxB carrying Fe-S clusters
  • HoxJ = RH binding signal transducer, histidine protein kinase Löscher et al., Chem Phys Chem, 2010 Buerstel et al., PNAS, 2016
  • hoxFUYH = soluble hydrogenase
  • hoxW = soluble hydrogenase (HoxH-specific) C-terminal protease (maturation)
  • hoxI = optional subunit of SH (hoxFUYH); specific reaction site for NADPH, Burgdorf et al., J Bac, 2005
  • a 4th hydrogenase exist on pHG1, made from 2 subunits (hofK, hofG, pHG064/065), see Schäfer et al., 2015.
  • these subunits have no effect on fitness (high coverage of 9-11 mutants per gene), and are very lowly expressed (PHG064 n.d., PHG065 low abundance)
# plot heatmap
df_hyd <- df_fitness_summary %>%
  filter(time == 8, grepl("hox", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "hox[A-Z0-9]+")) %>%
  ungroup %>% mutate(gene_name = factor(gene_name, paste0("hox", c("K","G","Z","M",
   "L","O","Q","R","T","V","A","B","C","J","N","N2","F","U","Y","H","W","I"))))
plot_hyd_fit <- heatmap_fitness(df_hyd)

# plot genomic locations
df_hyd <- df_hyd %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_hyd_g1 <- genome_plot(df_hyd, xlim = c(0, 4.1), title = "pHG1")
plot_hyd_g2 <- genome_plot(df_hyd, xlim = c(15.1, 22.8), title = "pHG1")
plot_hyd_g3 <- genome_plot(df_hyd, xlim = c(79.2, 85.5), title = "pHG1")

print(plot_hyd_fit, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_hyd_g1, position = c(0.15,-0.1,0.39,0.4), more = TRUE)
print(plot_hyd_g2, position = c(0.35,-0.1,0.65,0.4), more = TRUE)
print(plot_hyd_g3, position = c(0.61,-0.1,0.90,0.4))

7.4 Hydrogenase pleiotropy genes (maturation)

  • hypA1,B1,F1,C1,D1,E1 = Ni-Fe metallocenter formation proteins for both MBH and SH. “HypC1, hypD1, and hypE1 are essential for SH and MBH maturation, while only one intact copy each of hypA, hypB, and hypF is needed”, Schwartz et al., J Mol Bio, 2003
  • hypA and B
  • hypCDEF all form a complex to assemble the Ni-Fe-CO-CN cofactor
  • hypX = CO cofactor formation for NiFe-hydrogenase. HypX converts formyl-THF into water and CO in aerobic conditions, Schulz et al., JACS, 2020
  • hypA2,B2,F2 = alternative genes for hypA1,B1,F1, not sufficient alone for hydrogenase formation
  • hypF3,hypC2,hypD2,hypE2,hypA3,hypB3. The physiological function of these extra hyp genes is unclear
df_hyp <- df_fitness_summary %>% filter(time == 8, grepl("hyp", gene_name)) %>%
  # correct names of several hyp genes to make it more systematic
  mutate(gene_name = gsub("hypA PHG094", "hypA2 PHG094", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "hyp[A-Z0-9]+") %>%
    recode(hypA = "hypA1", hypB = "hypB1", hypC = "hypC1", hypD = "hypD1", hypE = "hypE1")) %>%
  ungroup %>% mutate(gene_name = factor(gene_name, paste0("hyp", c("A1","B1","C1","D1","E1",
    "F1","X","A2","B2","C2","D2","E2","F2","A3","B3","F3"))))
plot_hyp_fit <- heatmap_fitness(df_hyp)

# plot genomic locations
df_hyp <- df_hyp %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_hyp_g1 <- genome_plot(df_hyp, xlim = c(8.3, 15.6), title = "pHG1")
plot_hyp_g2 <- genome_plot(df_hyp, xlim = c(66.3, 73.5), title = "pHG1")
plot_hyp_g3 <- genome_plot(df_hyp, xlim = c(85.0, 89.7), title = "pHG1")

print(plot_hyp_fit, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_hyp_g1, position = c(0.26,-0.1,0.78,0.4), more = TRUE)

7.5 Other complexes important in energy metabolism

  • succinate dehydrogenase
    • sdhABCD = membrane integral succinate dehydrogenase, reducing UQ to UQH2: all of its genes are strictly essential, no KO mutants available
  • NADH dehydrogenase
    • nuoABCDEFGHIJKLMN = membrane integral NADH dehydrogenase, oxidizes NADH, pumps 4 H+, reduces UQ to UQH2: all of its genes are strictly essential, no KO mutants available

7.6 Electron transport chain

Cytochrome e- transport proteins, cytochrome reductases, and terminal oxidases.

  • quinol oxidases
    • cyoA1B1C1D1, … (3 copies) = cytochrome bo3 complex, quinole oxidase, low O2 affinity
    • cydA1B1, … (2 copies) = cytochrome bd complex, quinole oxidase, high O2 affinity
  • cytochrome reductases
    • qcrABC (aka petABC, H16_A3398, H16_A3397, H16_A3396) = cytochrome bc1 reductase
  • cytochrome C e- carriers
    • cytochrome c553 according to uniprot: H16_A3451, H16_A0830, H16_A0846, H16_B1452, H16_A3576, H16_A1121, H16_A1120. According to String DB, H16_A3576 is in same operon as thiosulphate oxidation pathway (SoxXYZABCD), where thiosulfate is used as electron source Zhang et al., 2020
  • terminal cytochrome C oxidases
    • coxMNOPQ = bb3 complex, affinity unknown, not essential for growth or N-fixation in B. japonycum where it was discovered.
    • ctaABCDEG = aa3 complex, low O2 affinity
    • ccoGNOP = cbb3 complex, high affinity, high similarity to NO3 reductase
    • VP2641 (H16_A1031) = Membrane spanning protein, RDD family. PFAM: unknown function, potential transporter? Interacts with cytochrome c553 and qcrC, cytochrome reductase. It’s depletion implicates this protein in formate-specific e- transport in the membrane.
df_etc <- df_fitness_summary %>% filter(time == 8, locus_tag %in% c("H16_A3451", "H16_A0830",
    "H16_A0846", "16_B1452", "H16_A3576", "H16_A1121", "H16_A1120", "H16_A1031") | 
    grepl("qcr|cco|cta|cox[MNOPQ] |cyo|cyd", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "H16_[AB][0-9]+|(qcr|cco|cta|cox|cyo|cyd)([A-Z0-9]+)?") %>%
    factor(c(
      paste0("cyo", c("A1","A2","A3","B1","B2","B3","C1","C2","D1","D2")),
      paste0("cyd", c("A1","B1","B2")), paste0("qcr", c("A","B","C")),
      paste0("H16_", c("A3576","A0846","A1120","A1121","A1031","A3451")),
      paste0("cox", c("M","N","O","P","Q")), paste0("cta", c("A","B","C","D","E","G")),
      paste0("cco", c("G","N","O","P"))
    ))
  )
plot_etc_fit <- heatmap_fitness(df_etc)

# plot genomic locations
df_etc <- df_etc %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_etc_g1 <- genome_plot(df_etc, xlim = c(3675.2, 3679.0), title = "chr 1")
plot_etc_g2 <- genome_plot(df_etc, xlim = c(361.2, 371.2), title = "chr 1")
plot_etc_g3 <- genome_plot(df_etc, xlim = c(2509.5, 2515.2), title = "chr 1")

print(plot_etc_fit, position = c(0,0.17,1,1.05), more = TRUE)
print(plot_etc_g1, position = c(0.1,-0.1,0.31,0.4), more = TRUE)
print(plot_etc_g2, position = c(0.28,-0.1,0.65,0.4), more = TRUE)
print(plot_etc_g3, position = c(0.62,-0.1,0.9,0.4))

7.7 Nitrate respiration

Nitrate reductase (NAR)

Reaction: HNO3- + QH2 –> NO2- + H2O + Q.

Two copies of the nar operon, one on the megaplasmid and one on chromosome 2. Several other enzymes catalyze similar reactions as NAR. NAS is cytoplasmic and NADH dependent assimilatory nitrate reductase Cramm R, 2008. NAP is periplasmic membrane protein complex accepting QH2, involved in nitrate respiration but probably also dissipation of excess redox power. Here we just focus on the main NAR enzyme.

df_nar <- df_fitness_summary %>%
  filter(time == 8, grepl("na[rsp]", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "na[rsp]([A-Z0-9]+)?") %>%
    recode(narG = "narG1", narH = "narH1", narI = "narI1", narL = "narL1", narX = "narX1"))
plot_nar_fit <- heatmap_fitness(df_nar)

# plot genomic locations
df_nar_genes <- df_nar %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_nar_g1 <- genome_plot(df_nar_genes, xlim = c(279.5, 294.5), title = "pHG1")

print(plot_nar_fit, position = c(0,0.26,1,1.05), more = TRUE)
print(plot_nar_g1, position = c(0.2,-0.1,0.85,0.4))

Nitrite reductase (NIR)

Reaction: NO2- + QH2 –> NO + H2O + Q

NIR is a cd1 type copper containing terminal oxidase, homodimer of 2 nirS subunits. The genes for nitrite reductase are assembled in a cluster (nirS, -C, -F, -D, -G, -H, -J, -N, -E) spanning 8.6 kb on chromosome 2. The first gene of this cluster, nirS, is the structural gene encoding cd1 -NIR. The re-maining genes are presumed to be accessory genes essential for maturation of active nitrite reductase (?).

df_nir <- df_fitness_summary %>%
  filter(time == 8, grepl("nir", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "nir([A-Z0-9]+)?"))
plot_nir_fit <- heatmap_fitness(df_nir)

# plot genomic locations
df_nir_genes <- df_nir %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_nir_g1 <- genome_plot(df_nir_genes, xlim = c(2576.3, 2585.9), title = "chr 2")

print(plot_nir_fit, position = c(0,0.26,1,1.05), more = TRUE)
print(plot_nir_g1, position = c(0.3,-0.1,0.75,0.4))

NO reductase (NOR)

Reaction: NO + QH2 –> N2O + H2O + Q

From Cramm R, 2008: “The catalytic sub-unit (NorB) of all NORs that have been characterized to date contains one high-spin heme b and a binuclear catalytic center of a low-spin heme b and a non-heme iron. The best investigated NORs are heterodimers that contain, in addition to NorB, a heme c-containing subunit NorC which channels lectrons from the physiological electron donor cytochrome c to NorB.”

df_nor <- df_fitness_summary %>%
  filter(time == 8, grepl("nor", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "nor([A-Z0-9]+)?"))
plot_nor_fit <- heatmap_fitness(df_nor)

# plot genomic locations
df_nor_genes <- df_nor %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_nor_g1 <- genome_plot(df_nor_genes, xlim = c(256.4, 262.1), title = "pHG1")

print(plot_nor_fit, position = c(0,0.25,1,1.05), more = TRUE)
print(plot_nor_g1, position = c(0.3,-0.1,0.75,0.4))

NOS (nitrous oxide) reductase

Reaction: N2O + QH2 –> N2 + H2O + Q

The NOS of R. eutropha H16 has not been investigated on the biochemical level. Genes for NOS are located adjacent to the nor genes on pHG1. NosC (PHG253) might be the cyt c that is the e- donor for NOS. The nosC KO is enriched on formate, i.e. grows better than WT. Is nosC taking e- away from formate-fed ETC?

df_nos <- df_fitness_summary %>%
  filter(time == 8, grepl("nos", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "nos([A-Z0-9]+)?"))
plot_nos_fit <- heatmap_fitness(df_nos)

# plot genomic locations
df_nos_genes <- df_nos %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_nos_g1 <- genome_plot(df_nos_genes, xlim = c(261.1, 271.9), title = "pHG1")

print(plot_nos_fit, position = c(0,0.25,1,1.05), more = TRUE)
print(plot_nos_g1, position = c(0.22,-0.1,0.85,0.4))

The following section simply plots all nitrate respiration modules on one canvas.

plot_nitrate_all <- bind_rows(df_nar, df_nor, df_nos, df_nir) %>%
  mutate(gene_name = factor(gene_name, c(
    paste0("nar", c("X1", "L1", "K1", "K2", "G1", "H1", "I1", "X2", "L2", "G2", "H2", "I2", "K3", "K5")),
    paste0("nap", c("A", "B", "C", "D")), paste0("nas", c("D", "E")),
    paste0("nir", c("S", "C", "F", "D", "J", "N", "E")),
    paste0("nor", c("A", "B", "R1", "A2", "B2", "R2")),
    paste0("nos", c("L", "Y", "F", "D", "R", "Z", "C", "X"))
  ))) %>%
  heatmap_fitness()

print(plot_nitrate_all, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_nar_g1, position = c(0,-0.1,0.35,0.4), more = TRUE)
print(plot_nir_g1, position = c(0.33,-0.1,0.58,0.4), more = TRUE)
print(plot_nor_g1, position = c(0.56,-0.1,0.72,0.4), more = TRUE)
print(plot_nos_g1, position = c(0.70,-0.1,0.96,0.4), more = TRUE)

7.8 Comparison of nitrate respiration conditions

  • we have two different nitrate respiration conditions, continuous and pulsed
  • low depletion of nitrate/nitrite/nitrous respiration mutants probably due to constant NO3 supply
df_fitness_summary %>%
  filter(substrate == "nitrate", time == 8, !is.na(gene_name)) %>%
  select(gene_name, condition, norm_gene_fitness_median) %>%
  pivot_wider(names_from = condition, values_from = norm_gene_fitness_median) %>%
  mutate(
    dfit = pulse - continuous,
    significant = !between(dfit, quantile(dfit, probs = c(0.003), na.rm = TRUE),
      quantile(dfit, probs = c(0.997), na.rm = TRUE))) %>%
  
  # plot
  xyplot(continuous ~ pulse, data = ., groups = significant,
    labels = .[["gene_name"]],
    par.settings = custom.colorblind(), col = custom_colors,
    xlim = c(-9, 3), ylim = c(-9, 3),
    xlab = "Nitrate respiration (P)", ylab = "Nitrate respiration (C)",
    panel = function(x, y, labels, ...) {
      panel.grid(h = -1, col = grey(0.9))
      panel.abline(a = 0, b = 1, col = grey(0.5), lty = 2, size = 0.8)
      panel.abline(a = 4, b = 1, col = grey(0.5), lty = 2, size = 0.8)
      panel.abline(a = -4, b = 1, col = grey(0.5), lty = 2, size = 0.8)
      panel.xyplot(x, y, ...)
      panel.repellabels(x, y, labels, cex = 0.6,
        selected = .[["significant"]],
        draw_box = TRUE, box_fill = "white", box_line = TRUE, ...)
    }
  )

7.9 Non-hydrogenase genes important for H2 fitness

Different gene clusters appear based on the fitness pattern in different conditions. Clusters specific for lithoautotrophic metabolism are particularly interesting. We can add the cluster IDs to the main data frame, and investigate fitness. Or simply select genes that show depletion only for the hydrogen metabolizing condition.

df_fitness_summary <- df_fitness_summary %>%
  left_join(by = "locus_tag",
    enframe(cutreeord(mat_cluster, k = n_clusters), "locus_tag", "cluster")
  )
df_fitness_summary %>%
  filter(time == 8, !str_detect(gene_name, "hox|hyp")) %>%
  mutate(hyd_essential = case_when(
    substrate == "hydrogen" & norm_gene_fitness_median <= -2 ~ TRUE,
    substrate != "hydrogen" & norm_gene_fitness_median >= -2 ~ TRUE,
    TRUE ~ FALSE
  )) %>% 
  group_by(locus_tag) %>%
  filter(sum(hyd_essential) == 9 |
    locus_tag %in% c("H16_A0325", "H16_A0326")) %>%
  ungroup %>% arrange(locus_tag) %>%
  mutate(gene_name = factor(gene_name, unique(gene_name))) %>%
  mutate(norm_gene_fitness_median = norm_gene_fitness_median %>%
    replace(., . < -6, -6) %>% replace(., . > 6, 6)) %>%
  heatmap_fitness

  • ptsI + ptsH - according to databases, a clear hit for a PEP-dependent sugar PTS system. This can’t be true for C. necator otherwise it would not show the familiar pattern of depletion on hydrogen and growth benefit on other substrates. This PTS system must be implicated in hydrogen metabolism, probably a transporter? The genes downstream of ptsIH do not seem to be affected (H16_A0327 to H16_A0329).
  • hldD (H16_A0804) - ADP-L-glycero-beta-D-manno-heptose-6-epimerase, involved in LPS sugar phosphate synthesis. No obvious connection to hydrogen metabolism.
  • glnE (H16_A1127) - Bifunctional glutamine synthetase adenylyltransferase. Involved in the regulation of glutamine synthetase GlnA. Transfers or removes adenylyl-residues to and from GlnA, thereby regulating N uptake through GlnA. No obvious connection to hydrogen metabolism. Only 1 mutant in library.
  • H16_A2692 - uncharacterized membrane protein. Is neutral in all other conditions, just as are all surrounding genes (H16_A269[0-9]). Small, 141 AA, potential carbonic anhydrase or transporter?
  • bolA (H16_A3419) - Predicted transcriptional regulator, no known domains or published functions. Only 1 mutant in library.
  • yadG (H16_A3421) Separated by only 1 gene (yadH, H16_A3420) from the previous hit, shows very specific depletion for hydrogen. 7-8 mutants quantified, annotated as ABC transporter, ATPase function. YadH is the permease partner of the 2 subunits, however does not show depletion. COG process ‘defense’ because it supposedly is active as anti-drug efflux pump.
  • H16_B1603 - ABC-type transporter, ATPase coupled transmembrane transporter. Strong depletion on hydrogen only. All surrounding genes are neutral to fitness. Only 1 mutant in library. A candidate for a CO2/HCO3- transporter? Aligns best with a lot of different ABC transporters, with spermidine/putrescine transporters best hits.

7.10 CO2 transporter/carbonic anhydrase homologs

All CAs with quantified fitness are neutral during lithoautotrophic growth (H16_A1192 - cag, H16_B2403 - caa, H16_B2270 - can2). Only H16_A0169 (can) is strictly essential, no mutants are available. This is different to the results from Gai et al., AMB Express, 2014 that found two essential CAs, can and caa:

“All four purified CAs were capable of performing the interconversion of CO2 and HCO3– […] Deletion of can affected the growth of R. eutropha; however the growth defect could be compensated by adding CO2 to the culture. Deletion of caa, encoding an alpha-CA, had the strongest deleterious influence on cell growth.”.

Also Kusian et al., J Bac, 2002 report the essentiality of can for lithoautotrophic growth. Caa fitness was quantified in the library with 11 mutants, yet does not show any phenotype on any substrate. The results in Gai et al. are somewhat contradictory, as the caa completion test in a quadruple-CA mutant had no beneficial effect on growth.

Altogether, no specific bicarbonate transporters are annotated in C. necator. However, important bicarbonate (HCO3+) or CO2 transporters known from e.g. cyanobacteria (Synechocystis sp. PCC 6803) could be used to search for homologous genes in C. necator. The following CO2/HCO3+ transporters are known in Synechocystis:

  • BicA (sll0834): constitutive high-flux, low-affinity Na+/HCO3+ symport
  • SbtA (slr1512): inducible low-flux, high-affinity Na+/HCO3+ symport
  • BCT1 (cmpABCD operon/slr0040-44): inducible high-affinity, ATP-dependent uniport
  • NDH (nuo operon/): all genes of this operon are essential in C. necator, no cond. fitness

The first three of these genes were blasted against the genome of C. necator. The figure shows alignment scores for the top 20 hits.

df_co2_transport_align <- read_tsv("../data/alignments/CO2_transporters.tsv", col_types = cols()) %>%
  group_by(target_name) %>%
  mutate(rank = seq_along(uniprot))
  
  xyplot(score + -log10(e_value) ~ rank | target_name, df_co2_transport_align,
    par.settings = custom.colorblind(), xlim = c(-1, 21),
    xlab = "rank", ylab = expression("-log"[10]*" e value, score"),
    between = list(x = 0.5, y = 0.5), lwd = 0,
    scales = list(alternating = FALSE),
    panel = function(x, y, z, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.barplot(x, y, ewidth = 0.5, fill_alpha = 0.7, ...)
      panel.key(..., points = FALSE, corner = c(0.97, 0.97))
    }
  )

BicA and cmpB seem to have some very similar homologs in the C. necator genome. However, none of these homologs has a clear phenotype related to hydrogen metabolism.

df_co2_transport_align %>%
  filter(rank <= 20) %>% 
  select(uniprot, target_name, rank) %>%
  inner_join(df_fitness_summary, by = "uniprot") %>% ungroup %>%
  mutate(norm_gene_fitness_median = norm_gene_fitness_median %>%
    replace(., . < -6, -6) %>% replace(., . > 6, 6)) %>%
  heatmap_fitness
Warning in inner_join(., df_fitness_summary, by = "uniprot") :
  Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 100034 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this
  warning.

7.11 Main conclusions

  • MoCo cofactor is important for growth on formate and nitrate respiration, but not for growth on hydrogen
  • Main FDH is S-FDH both in terms of protein abundance, inducibility, and essentiality
  • Alternative S-FDH operon fdw does not play a roll
  • Both hydrogenases, S-H and m_H have similar fitness penalty on hydrogen when KOed
  • Hydrogenase genes, but even more so master regulator hoxA and Hydrogenase pleiotropy (hyp) genes confer fitness advantage when KOed on all non-loithoautotrophic substrates
  • Cytochrome reductase bc1 complex (qcrABC) is particularly important for growth on formate
  • Bc1 complex plays no role on nitrate, therefore no cross talk (while a bc1 KO mutant was growth-deficient in oxygen-limited growth on nitrate (Garg et al., 2018).
  • Bc1 complex accepts reduced quinols as e- donor, pumps 4 H+ /2 e− via proton-motive Q cycle (quinol:cyt c oxidoreduction), and releases reduced cytochrome C e- carriers. Main cytochrome complex for most substrates including hydrogen
  • Cbb3-type cytochrome oxidase (cco operon) also important for growth on formate, while cta operon (aa3 complex) more important on fructose and succinate

8 Central carbon metabolism

8.1 gene-reaction relationship

To show (conditional) essentiality of genes which are related to central metabolic reactions, we need to obtain a table of gene-reaction relationships. Such a table was used in our eLife paper, Jahn et al., 2021, and was exported from the curated genome scale model of C. necator.

  • download gene-reaction table
  • add some genes manually that are not metabolic reactions but interesting regulators
  • we add:
    • cbbR (H16_B1396) cbb operon transcriptional activator
    • regA (H16_A0202) response regulator, 2-comp. system
    • regB (H16_A0203) kinase, 2-comp. system
    • rpoN (H16_A0387) Sigma 54 RNA-polymerase sigma factor
    • hoxA (PHG019) hoxA master regulator
url_reactions <- url("https://github.com/m-jahn/R-notebook-ralstonia-proteome/raw/main/data/input/model_reactions.csv")
df_reactions <- read_csv(url_reactions, show_col_types = FALSE) %>%
  rename(locus_tag = genes) %>%
  add_row(reaction_id = "Regulators", locus_tag = "H16_B1396, H16_A0202, H16_A0203, H16_A0387, PHG019")
New names:
• `` -> `...1`
df_reactions %>%
  filter(!str_detect(reaction_id, "^EX_")) %>%
  head()

8.2 Plot conditional fitness broken down by CCM pathway

  • use the same set of enzymes as in Jahn et al., 2021
  • using these sets, extract fitness from results table and plot minifigures
  • these can be arranged in/around a map of the central carbon metabolism (CCM)
  • reactions with slightly different substrates/products but that are gene-wise identical have been removed
  • these are SUCDi, ACONT2, TKT2, and PDH3 which is identical to AKGDH
list_ccm_enzymes <- list(
  CBB = c("PGK", "TPI", "GAPD", "FBA", "FBP", "TKT1", "TAh", "RPE", "RPI", "PRUK", "RBPC"),
  ED = c("PGI", "G6PDH2r", "PGL", "EDD", "EDA"),
  PYR = c("PGM", "ENO", "PYK", "PDH1", "PDH2", "PC", "PPC", "ME1", "ME2", "CS"),
  TCA = c("ACONT1", "ICDHx", "ICDHyr", "AKGDH", "SUCOAS", "SUCD1", "FUM", "MDH", "MALS"),
  REG = c("Regulators")
)
# find genes for metabolic reactions
list_reaction <- list_ccm_enzymes %>%
  enframe(name = "pathway", value = "reaction_id") %>%
  tidyr::unnest(reaction_id)

list_genes <- df_reactions %>%
  right_join(list_reaction, by = "reaction_id") %>%
  separate_longer_delim(locus_tag, ", ") %>%
  select(pathway, reaction_id, locus_tag)

# get fitness for these genes
df_fitness_ccm <- df_fitness_summary %>%
  filter(time == 8) %>%
  inner_join(list_genes, by = "locus_tag")

# plot heatmaps
plots_ccm <- df_fitness_ccm %>%
  arrange(pathway, reaction_id, locus_tag) %>%
  mutate(
    gene_name = locus_tag,
    # gene_name = eggNOG_name %>%
    # str_extract("[a-zA-Z0-9_]+") %>%
    # fct_inorder(),
    cond_short = case_match(cond_short,
      "formate - C" ~ "FoC",
      "fructose - C" ~ "FrC",
      "nitrate - C" ~ "NiC",
      "succinate - C"   ~ "SuC",
      "formate - P" ~ "FoP",
      "fructose - P" ~ "FrP",
      "hydrogen - P" ~ "HyP",
      "nitrate - P" ~ "NiP",
      "succinate - P"~ "SuP") %>%
      factor(., unique(.)[c(3,8,7,1,5,4,9,2,6)]),
    norm_gene_fitness_median = norm_gene_fitness_median %>%
           replace(., . < -6, -6) %>% replace(., . > 6, 6)
  ) %>%
  group_by(pathway, reaction_id) %>%
  dplyr::group_split() %>%
  lapply(function(df) {
    levelplot(
      norm_gene_fitness_median ~ fct_rev(cond_short) * fct_rev(gene_name) |
        reaction_id, df,
      par.settings = custom.colorblind(), colorkey = FALSE,
      col.regions = colorRampPalette(heat_cols)(25),
      at = seq(-6, 6, 0.5), aspect = "iso", xlab = "", ylab = "",
      scales = list(alternating = FALSE, cex = 0.7, tck = 0,
        x = list(rot = 90)),
      between = list(x = 0.5, y = 0.5), lwd = 0,
      panel = function(x, y, z, ...) {
        panel.levelplot(x, y, z, ...)
        panel.abline(
          v = seq_along(unique(x)) + 0.5,
          h = seq_along(unique(y)) + 0.5,
          col = "white",
          lwd = 2
        )
      }
    )
  })

cowplot::plot_grid(plotlist = plots_ccm, ncol = 6)

null device 
          1 

9 Growth advantage of Hydrogenase mutants

9.1 Determine mutant growth rate from competition experiments

The main measurement that was obtained from transposon library competition experiments is the fold change, and the fitness score as a summary statistic thereof. The fold change of a mutant sub-population over time can also be used to calculate the relative growth advantage over other strains. For such a growth model, the only required input parameters are the FC, the average population growth rate, and the time in hours.

\[ FC = (1 - (\mu_{pop} - \mu_{mut}))^{t} \] Re-arranged to obtain the mutant growth rate.

\[ \mu_{mut} = -(1 - FC^{1/t}) + \mu_{pop}\]

All input variables are available:

  • µ_pop = 0.1 h^-1, known from chemostats
  • FC for each mutant/gene from library competition experiments (‘fitness’ is the normalized log2 FC)
  • t = n_gen x (ln 2)/0.1 - number of generations (0, 8, 16) multiplied with generation time of 6.93 h

We make a generalized function that calculates growth rate (dis-) advantage.

mutant_mu <- function(fold_change, pop_mu, time, log2_fc = FALSE) {
  if (log2_fc) {fold_change = 2^fold_change}
  result = -(1 - (fold_change^(1/time))) + pop_mu
  return(result)
}

Then select the hydrogenase mutants where an enrichment was observed in heterotrophic conditions.

df_increased_mu <- df_fitness_summary %>%
  filter(
    grepl("hox[KGZMLOQRTVABCJNFUYHWI]|hyp[BCDEFX]1? |hypA PHG012", gene_name), condition == "continuous",
    substrate %in% c("formate", "fructose", "succinate")) %>%
  filter(locus_tag != "PHG063") %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "hox[KGZMLOQRTVABCJNFUYHWI]|hyp[ABCDEFX]")) %>%
  mutate(gene_name = fct_inorder(gene_name))

plot_increased_mu <- xyplot(norm_gene_fitness_median ~ time | gene_name, df_increased_mu,
    groups = substrate, as.table = TRUE, layout = c(6, 5),
    col = stdcol[2:4], xlab = "generations", ylab = "fitness",
    par.settings = custom.colorblind(), type = c("l"),
    auto.key = list(columns = 3),
    between = list(x = 0.5, y = 0.5),
    scales = list(alternating = FALSE), lwd = 2,
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.abline(h = 0, col = grey(0.5), lty = 3, lwd = 2)
      panel.xyplot(x, y, ...)
    }
  )

df_mutant_mu <- df_increased_mu %>%
  filter(time != 0) %>%
  group_by(gene_name, substrate) %>%
  mutate(mutant_mu = mutant_mu(
    fold_change = norm_gene_fitness_median, 
    pop_mu = 0.1, time = time*(log(2)/0.1),
    log2_fc = TRUE)) %>%
  select(locus_tag, time, condition, substrate,
    norm_gene_fitness_median, gene_name, cond_short, mutant_mu)

print(plot_increased_mu)

9.2 Growth advantage versus protein cost

The main question is if the observed growth advantage can be purely explained by the saving of gene expression cost. Hydrogenase and some accessory proteins are highly abundant and therefore cause a significant cost to the cell. It was shown in many other studies that reducing excess/unused proteins can increase growth rate. This is based on the observation that the protein pool is limited and cells can optimize the utilization of this limited resource by tuning gene expression towards highly utilized genes/proteins.

In the next section, global protein abundance data is imported that was obtained from quantitative MS experiments. MS data is available for the three conditions used in the previous section, continuous growth on fructose, succinate, and formate in chemostats at defined growth rates.

Note: The data is imported directly from the respective github repository, requiring internet connection.

load(url("https://github.com/m-jahn/R-notebook-ralstonia-proteome/blob/main/data/input/Ralstonia_eutropha.Rdata?raw=true"))
head(Ralstonia_eutropha[1:10])

To assess real protein cost of a knock-out, it is important to consider that KO by transposon integration will have an effect on the downstream gene expression too, effectively disrupting transcription/translation of the entire operon downstream of the integration site.

If we assume that each KO knocks out the gene expression of all consecutive genes of the operon, we need to sum up protein cost for these affected gene sets; we generate a binary matrix of associations for each KO with a number of affected genes (0 = FALSE, 1 = TRUE), and multiply with protein mass. Here, we need to test different assumptions. The canonical regulation of the hox/hyp operon happens via hoxA transcriptional regulator, hoxBC H2 sensing ‘regulatory hydrogenase’ (RH), and hoxJ, the defect kinase that transmits the output from hoxBC to hoxA?. This regulation controls two promoters:

  • the P_MBH promoter that starts transcription of the huge hoxKGZMLOQRTV + hypABFCDEX + hoxABCJ operon
  • the P_SH promoter that starts transcription of the smaller hoxFUYHWI + hypA2B2F2 operon

Accordingly, hoxA controls its own transcription in a feed-forward autoregulatory fashion. Increased hoxA protein abundance leads to stronger hoxA expression, etc. However, if a transposon integrates into any given site of one of the operons, transcription of following genes should be terminated or otherwise negatively affected. Surprisingly, this effect is visible for hyp mutants that might negatively affect hoxA expression, but not for any preceding genes (hoxKHZMLOQRTV). This suggests that:

  • there is another, secondary promoter that controls hyp + hoxABCJN gene expression
  • in fact this has been shown in …
gene_list_mh <- c("hoxK", "hoxG", "hoxZ", "hoxM", "hoxL", "hoxO", "hoxQ", "hoxR", "hoxT", "hoxV")
gene_list_sh <- c("hoxA", "hoxB", "hoxC", "hoxJ", "hoxN", "hoxF", "hoxU", "hoxY", "hoxH", "hoxW", "hoxI")
gene_list_hyp <- c("hypA", "hypB", "hypF", "hypC", "hypD", "hypE", "hypX")

list_mutant_genomic <- list(
  hoxK = gene_list_mh,
  hoxG = gene_list_mh[2:10],
  hoxZ = gene_list_mh[3:10],
  hoxM = gene_list_mh[4:10],
  hoxL = gene_list_mh[5:10],
  hoxO = gene_list_mh[6:10],
  hoxQ = gene_list_mh[7:10],
  hoxR = gene_list_mh[8:10],
  hoxT = gene_list_mh[9:10],
  hoxV = gene_list_mh[10],
  hypA = c(gene_list_mh, gene_list_hyp, gene_list_sh),
  hypB = c(gene_list_mh, gene_list_hyp[2:7], gene_list_sh),
  hypF = c(gene_list_mh, gene_list_hyp[3:7], gene_list_sh),
  hypC = c(gene_list_mh, gene_list_hyp[4:7], gene_list_sh),
  hypD = c(gene_list_mh, gene_list_hyp[5:7], gene_list_sh),
  hypE = c(gene_list_mh, gene_list_hyp[6:7], gene_list_sh),
  hypX = c(gene_list_mh, gene_list_hyp[7], gene_list_sh),
  hoxA = c(gene_list_mh, gene_list_hyp, gene_list_sh),
  hoxB = gene_list_sh[2:5],
  hoxC = gene_list_sh[3:5],
  hoxJ = gene_list_sh[4:5],
  hoxN = gene_list_sh[5],
  hoxF = gene_list_sh[c(6:11)],
  hoxU = gene_list_sh[c(7:11)],
  hoxY = gene_list_sh[c(8:11)],
  hoxH = gene_list_sh[c(9:11)],
  hoxW = gene_list_sh[c(10:11)],
  hoxI = gene_list_sh[c(11)]
)

mat_mutant_genomic <- lapply(list_mutant_genomic, function(x) {
    as.numeric(names(list_mutant_genomic) %in% x)
  }) %>% unlist %>% matrix(ncol = length(list_mutant_genomic), byrow = TRUE) %>% t
rownames(mat_mutant_genomic) <- names(list_mutant_genomic)
colnames(mat_mutant_genomic) <- names(list_mutant_genomic)
  • Add individual and accumulated protein cost for each KO mutant.
# add individual protein cost from MS data
df_mutant_cost <- df_mutant_mu %>%
  group_by(locus_tag, gene_name, substrate) %>%
  summarize(.groups = "drop",
    mean_mutant_mu = mean(mutant_mu),
    sd_mutant_mu = sd(mutant_mu)) %>%
  left_join(by = c("locus_tag", "substrate"),
    Ralstonia_eutropha %>% ungroup %>%
      filter(growthrate == 0.1) %>%
      select(locus_tag, substrate, mean_mass_fraction_norm, sd_massfraction)) %>%
  arrange(locus_tag) %>%
  tidyr::complete(substrate, nesting(locus_tag, gene_name)) %>%
  
  # add estimated total cost for each knockout
  group_by(substrate) %>%
  mutate(
    corrected_cost = mean_mass_fraction_norm %>% {colSums(na.rm = TRUE, .*100*mat_mutant_genomic)},
    corrected_sd = sd_massfraction %>% {colSums(na.rm = TRUE, .*100*mat_mutant_genomic)}
  )
  • Plot mutant growth rate, individual cost, and total protein cost that is saved in KO mutants.
plot_mutant_mu <- xyplot(mutant_mu ~ gene_name, df_mutant_mu,
    groups = substrate, as.table = TRUE,
    col = stdcol[2:4], xlab = "", ylab = expression("estimated µ [h"^-1*"]"),
    par.settings = custom.colorblind(), lwd = 2,
    scales = list(x = list(rot = 90)),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.abline(h = 0.1, col = grey(0.5), lty = 3, lwd = 2)
      panel.errbars(x, y, beside = TRUE, ...)
    }
  )

plot_mutant_cost <- xyplot(mean_mass_fraction_norm*100 ~ gene_name, df_mutant_cost,
    groups = substrate, par.settings = custom.colorblind(),
    error_margin = df_mutant_cost[["sd_massfraction"]]*100,
    lwd = 2, col = stdcol[2:4], ylim = c(-0.2, 6.5), pch = 1,
    xlab = "", ylab = "reduction protein cost [%]",
    scales = list(x = list(rot = 90)),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.errbars(x, y, beside = TRUE, ...)
      panel.key(corner = c(0.98, 0.95), ...)
      panel.key(corner = c(0.98, 0.60),
        labels = c("individual cost", "total cost (est.)"),
        col = grey(0.5), pch = c(1, 19))
    }
  ) + as.layer(
    xyplot(corrected_cost ~ gene_name, df_mutant_cost,
      groups = substrate,
      error_margin = df_mutant_cost[["corrected_sd"]],
      lwd = 2, col = stdcol[2:4],
      panel = function(x, y, ...) {
        panel.errbars(x, y, beside = TRUE, ...)
      }
    )
  )

plot_mbh_operon <- df_increased_mu %>%
  filter(time == 8) %>%
  group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop") %>%
  genome_plot(xlim = c(0.015, 22.490), title = "pHG1", rot_labels = 45)

plot_sh_operon <- df_increased_mu %>%
  filter(time == 8) %>%
  group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop") %>%
  genome_plot(xlim = c(79.612, 85.438), title = "pHG1", rot_labels = 45)

print(plot_mutant_mu, position = c(-0.017, 0.60, 1, 1), more = TRUE)
print(plot_mutant_cost, position = c(0.02, 0.30, 1, 0.70), more = TRUE)
print(plot_mbh_operon, position = c(0.075, 0.14, 0.97, 0.39), more = TRUE)
print(plot_sh_operon, position = c(0.075, 0, 0.4, 0.25))
grid::grid.text(c("A","B","C"), x = c(0.02, 0.02, 0.02),
  y = c(0.98, 0.66, 0.34), gp = grid::gpar(cex = 1.2))

  • plot correlation of estimated protein cost and growth rate, per mutant
  • correlation is visible when plotting both parameters separately, but will it also hold in direct comparison?
plot_cost_vs_mu <- xyplot(mean_mutant_mu ~ corrected_cost | substrate,
    df_mutant_cost, groups = substrate,
    pointlabels = as.character(df_mutant_cost$gene_name),
    par.settings = custom.colorblind(), col = stdcol[2:4],
    xlab = "reduction protein cost [%]",
    ylab = expression("estimated µ [h"^-1*"]"),
    aspect = 1, scales = list(alternating = FALSE),
    between = list(x = 0.5, y = 0.5),
    panel = function(x, y, pointlabels, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.superpose(x, y, ...)},
    panel.groups = function(x, y, ...) {
      panel.xyplot(x, y, ...)
      panel.lmlineq(x, y, r.squared = TRUE, label = "", ...)
    }
  )

plot_cost_vs_mu

svg("../figures/figure_mutant_cost_vs_growth.svg", width = 6.5, height = 3.5)
print(plot_cost_vs_mu)
dev.off()
png 
  2 

9.3 Prediction of growth advantage using cell economy model

The constrained Resource Balance Analysis (RBA) model described in Jahn et al., 2021 can be used to make predictions on the effect of hydrogenase knockout on cell growth rate. The RBA model is a genome scale metabolic model of R. eutropha that takes protein costs and global protein pool limitations into account. In constrast to simple FBA (flux balance analysis), growth rate is ultimately capped by rate limitation of enzymes and space limitation for enzyme mass within the cell, or the cytoplasmic membrane. The model was constrained using experimentally determined steady-state growth rates and protein abundances.

Here, we can ask the question if, and to which degree, growth rate would increase by expanding the total protein pool size (shrinking protein cost) by the fraction indicated above (e.g. hoxA KO freeing up to 4% protein mass). The first step is to import RBA simulation results.

source(url("https://github.com/m-jahn/R-notebook-ralstonia-proteome/raw/main/pipeline/read_rba_result.R"))
dir_sim <- "../../../Models/Bacterial-RBA-models/Ralstonia-eutropha-H16/simulation/hydrogenase/"

df_flux <- read_rba_result(list.files(dir_sim, pattern = "fluxes_.*.tsv$", full.names = TRUE))
df_prot <- read_rba_result(list.files(dir_sim, pattern = "proteins_.*.tsv", full.names = TRUE))
df_macr <- read_rba_result(list.files(dir_sim, pattern = "macroprocesses_.*.tsv", full.names = TRUE))

The next step is to plot the increase in growth rate as a function of protein resources that have been saved. The prediction shows smaller increases than the growth rates estimated from the library data.

  • roughly 8% increase in µ for 10% reduction of protein cost
  • library data suggest roughly 40% increase for 4-5% reduction of protein cost
  • this is not realistic, most likely an artifact of calculation from NGS data
plot_cost_predict <- df_macr %>%
  mutate(carbon_source = recode(carbon_source,
    `for` = "formate", `succ` = "succinate", `fru` = "fructose")) %>%
  mutate(sim_run = as.numeric(sim_run)) %>%
  arrange(carbon_source, sim_run) %>%
  group_by(carbon_source, key) %>%
  mutate(reduced_protein_cost = 0:10) %>%
  filter(key == "mu") %>%
  
  xyplot(value ~ reduced_protein_cost, .,
    groups = carbon_source, aspect = 1,
    par.settings = custom.colorblind(),
    type = "b", lwd = 2, lty = 2, pch = 1,
    ylim = c(0.095, 0.125), col = stdcol[2:4],
    xlab = "freed up protein resources [% total]",
    ylab = expression("µ [h"^-1*"]"),
    panel = function(x, y, z, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
      panel.key(..., cex = 0.7, corner = c(0.96, 0.96))
      panel.key(c("data", "model"), lines = TRUE, lty = c(1,2),
        pch = c(19, 1), col = grey(0.4),
        cex = 0.7, corner = c(0.96, 0.72))
    }
  )

print(plot_cost_predict)

9.4 Growth advantage of clean, in-frame hydrogenase deletion mutants

  • a selection of hydrogenase mutants was grown in bioreactors, batch mode
  • total volume was 50 mL minimal medium supplemented with 2 g/L fructose + 1 g/L NH4Cl
  • aeration: 100 mL/min for all tubes
  • temperature 30*C
  • OD 720nm was constantly measured every 15 min
  • biomass yield was determined by collecting all biomass at the end of the experiment and determining DCW
df_growth_assay <- read_csv("../data/growth_assays/df_mu_hox.csv", show_col_types = FALSE)
df_dcw_assay <- read_csv("../data/growth_assays/df_dcw_hox.csv", show_col_types = FALSE)
plot_order <- c("WT", "hoxG", "hypB", "hypX", "hoxA", "hoxH")

plot_growth_dcw <- df_dcw_assay %>%
  mutate(sample = factor(sample, plot_order)) %>%
  xyplot(yield_gDCW_gFRC ~ sample, .,
    par.settings = custom.colorblind(),
    col = stdcol[3], lwd = 2, ylim = c(-0.0, 0.6),
    xlab = "", ylab = expression("yield [g DCW g Frc"^-1*"]"),
    scales = list(x = list(rot = 45)),
    panel = function(x, y, z, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, pch = 19, cex = 0.5, col = grey(0.6))
      panel.barplot(x, y, ...)
      panel.pvalue(x, y, offset = 0.12, ...)
    }
  )

plot_growth_mu <- df_growth_assay %>%
  mutate(sample = factor(sample, plot_order)) %>%
  xyplot(mu_max ~ sample, .,
    par.settings = custom.colorblind(),
    col = stdcol[3], lwd = 2, ylim = c(-0.0, 0.39),
    xlab = "", ylab = expression("µ [h"^-1*"]"),
    scales = list(x = list(rot = 45)),
    panel = function(x, y, z, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, pch = 19, cex = 0.5, col = grey(0.6))
      panel.barplot(x, y, ...)
      panel.pvalue(x, y, offset = 0.08, ...)
    }
  )

print(plot_growth_dcw, position = c(0, 0, 0.5, 1), more = TRUE)
print(plot_growth_mu, position = c(0.5, 0, 1, 1), more = TRUE)
grid::grid.text(c("A","B"), x = c(0.03, 0.5), y = c(0.94, 0.94), gp = grid::gpar(cex = 1.2))

9.5 Hydrogen evolution of WT and hydrogenase mutant

  • an alternative hypothesis to protein cost by hydrogenases is energy cost
  • this hypothesis basically states that the enzymatic action of hydrogenases “running reverse” catalyzes the formation of H2 from NADH and protons (as known from cyanobacteria)
  • this reaction would drain the reductant pool (NADH(P)H) and lead to reduced growth
  • this reaction plays a role when the terminal electron acceptor, oxygen or one of the nitrogen oxides, are limiting
  • in order to determine if C. necator spills reductant in heterotrophic conditions through the action of hydrogenases, WT and a mutant defective in all hydrogenases were cultivated with fructose and glycerol to induce hydrogenase expression
  • without strong limitation of oxygen (which was not present in our bioreactor experiments), no hydrogen evolution was detectable
  • this result strengthens the protein cost hypothesis
df_h2_evolution <- read_csv("../data/growth_assays/h2_evolution.csv", show_col_types = FALSE) %>%
  mutate(time = ifelse(time == 20.5, 12.5, time))

plot_h2_evolution <- xyplot(mean ~ time,
    filter(df_h2_evolution, gas == "O2"),
    groups = strain,
    par.settings = custom.colorblind(),
    error_margin = filter(df_h2_evolution, gas == "O2")$sd,
    lwd = 2, pch = 1,
    xlab = "time [h]", ylab = expression("O"[2]*" [%]"),
    ylim = c(-2, 22),
    scales = list(x = list(at = seq(0, 10, 2))),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = 0, col = grey(0.9))
      panel.abline(v = seq(0, 10, 2), col = grey(0.9))
      panel.xyplot(x, y, type = "l", lty = 2, ...)
      panel.errbars(x, y, ewidth = 0, lty = 1, ...)
      panel.key(
        corner = c(0.9, 0.9),
        labels = c("∆hoxGHC,∆hofG", "WT", "O2", "H2"),
        col = c("#E7298A", "#66A61E", grey(0.5), grey(0.5)),
        pch = c(19, 19, 19, 1)
      )
    }
  ) + as.layer(
    xyplot(mean/100 ~ time,
      filter(df_h2_evolution, gas == "H2"),
      groups = strain, pch = 19, lty = 1,
      error_margin = filter(df_h2_evolution, gas == "H2")$sd/100,
      panel = function(x, y, ...) {
        panel.xyplot(x, y, type = "l", ...)
        panel.errbars(x, y, ewidth = 0, ...)
      }
    )
  )

print(plot_h2_evolution)

svg("../figures/figure_mutant_mu.svg", width = 7.5, height = 2.85)
print(plot_growth_dcw, position = c(-0.02, -0.034, 0.38, 1), more = TRUE)
print(plot_growth_mu, position = c(0.29, -0.034, 0.69, 1), more = TRUE)
print(plot_h2_evolution, position = c(0.61, 0, 1, 1), more = TRUE)
grid::grid.text(c("D","E", "F"), x = c(0.03, 0.35, 0.65),
  y = c(0.94, 0.94, 0.94), gp = grid::gpar(cex = 1.2))
dev.off()
null device 
          1 

9.6 Protein cost of secondary hydrogenases

This figure shows protein cost of the hox/hof genes that were not included in the protein cost analysis:

  • the 4th ‘actinobacterial’ hydrogenase, 1 subunit (hofG, PHG065)
Ralstonia_eutropha %>% ungroup %>%
  filter(grepl("PHG065", protein), substrate != "ammonium") %>%
  
  xyplot(mean_mass_fraction_norm*100 ~ factor(growthrate) | protein, .,
    groups = substrate, par.settings = custom.colorblind(),
    error_margin = .[["sd_massfraction"]]*100,
    lwd = 2, col = stdcol[2:4], ylim = c(-0.0005, 0.0045), pch = 19,
    xlab = "", ylab = "protein mass fraction [%]",
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.errbars(x, y, beside = TRUE, ...)
      panel.key(corner = c(0.02, 0.95), ...)
    }
  )

9.7 Protein cost of ETC complexes

  • as a complimentary information, we can determine the mass fraction of ETC complexes using MS data
  • these are membrane proteins and hence likely underestimated in comparison to soluble/cytoplasmic proteins
  • the reason is the lower efficiency when extracting membrane proteins
  • a relative comparison is nevertheless possible
  • should allow us to see the rough order of magnitude of ETC complexes
plots_etc_mass <- Ralstonia_eutropha %>%
  ungroup %>%
  select(locus_tag, mean_mass_fraction_norm, growthrate, protein, substrate, sd_massfraction) %>%
  inner_join(
    select(df_etc, locus_tag, gene_name) %>%
      distinct %>%
      mutate(gene_name = as.character(gene_name)),
    by = join_by(locus_tag)) %>%
  filter(substrate != "ammonium", growthrate == 0.1, !is.na(gene_name), !str_detect(gene_name, "H16_")) %>%
  mutate(operon = str_sub(gene_name, 1, 3)) %>%
  group_by(operon) %>%
  group_split() %>%
  lapply(function(df) {
    xyplot(mean_mass_fraction_norm*100 ~ factor(gene_name) | operon, df,
      groups = substrate, par.settings = custom.colorblind(),
      error_margin = df$sd_massfraction * 100,
      lwd = 2, col = stdcol[2:4], ylim = c(-0.005, 0.085), pch = 19,
      between = list(x = 0.5, y = 0.5), as.table = TRUE,
      scales = list(alternating = FALSE, x = list(rot = 90)),
      xlab = "", ylab = "protein mass fraction [%]",
      panel = function(x, y, ...) {
        panel.grid(h = -1, v = -1, col = grey(0.9))
        panel.errbars(x, y, beside = TRUE, ...)
        panel.key(corner = c(0.02, 0.95), ...)
      }
    )
  })

gridExtra::grid.arrange(
  ncol = 3, nrow = 2,
  plots_etc_mass[[5]],
  plots_etc_mass[[4]],
  plots_etc_mass[[6]],
  plots_etc_mass[[2]],
  plots_etc_mass[[3]],
  plots_etc_mass[[1]]
)

null device 
          1 

10 Export processed data

Cupriavidus_BarSeq_2023 <- df_fitness %>%
  rename(timepoint = time, fitness_score = norm_gene_fitness, t_stat = t) %>%
  mutate(
    induction = "not applicable",
    log2FC = replace(log2FC, is.infinite(log2FC), NA),
    fold_change = 2^log2FC,
    condition = paste0(substrate, " - ", condition)
  ) %>%
  select(-ID) %>%
  arrange(locus_tag, condition, timepoint) %>%
  left_join(by = "locus_tag",
    df_ref %>%
      select(
        locus_tag, new_locus_tag,
        uniprot, protein_name, gene_name,
        System, Process, Pathway
      )
  )

# export to RData
save(Cupriavidus_BarSeq_2023, file = "../data/barseq/Cupriavidus_BarSeq_2023.Rdata")

11 Session Info

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] limma_3.60.3           directlabels_2024.1.21 tsne_0.1-3.1           vegan_2.6-6.1         
 [5] permute_0.9-7          stringi_1.8.4          colorspace_2.1-0       dendextend_1.17.1     
 [9] lubridate_1.9.3        forcats_1.0.0          stringr_1.5.1          dplyr_1.1.4           
[13] purrr_1.0.2            readr_2.1.5            tidyr_1.3.1            tibble_3.2.1          
[17] ggplot2_3.5.1          tidyverse_2.0.0        latticetools_0.1.0     latticeExtra_0.6-30   
[21] lattice_0.22-6        

loaded via a namespace (and not attached):
 [1] gtable_0.3.5       xfun_0.45          tzdb_0.4.0         quadprog_1.5-8     vctrs_0.6.5       
 [6] tools_4.4.1        generics_0.1.3     parallel_4.4.1     fansi_1.0.6        cluster_2.1.6     
[11] pkgconfig_2.0.3    Matrix_1.6-5       RColorBrewer_1.1-3 lifecycle_1.0.4    farver_2.1.2      
[16] compiler_4.4.1     deldir_2.0-4       statmod_1.5.0      munsell_0.5.1      nlstools_2.1-0    
[21] pillar_1.9.0       crayon_1.5.3       MASS_7.3-60        Rtools_0.0.1       viridis_0.6.5     
[26] nlme_3.1-165       gtools_3.9.5       tidyselect_1.2.1   labeling_0.4.3     splines_4.4.1     
[31] cowplot_1.1.3      grid_4.4.1         cli_3.6.3          magrittr_2.0.3     utf8_1.2.4        
[36] withr_3.0.0        scales_1.3.0       bit64_4.0.5        timechange_0.3.0   jpeg_0.1-10       
[41] bit_4.0.5          interp_1.1-6       gridExtra_2.3      png_0.1-8          hms_1.1.3         
[46] knitr_1.47         viridisLite_0.4.2  mgcv_1.9-1         rlang_1.1.4        Rcpp_1.0.12       
[51] glue_1.7.0         rstudioapi_0.16.0  vroom_1.6.5        R6_2.5.1          
---
title: Investigating energy metabolism in *C. necator* using a barcoded transposon
  library
author: "Michael Jahn"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: yes
    df_print: paged
  html_notebook:
    theme: cosmo
    toc: yes
    number_sections: yes
---

# Description

This R notebook is a bioinformatics pipeline to analyze fitness data obtained from a barcoded transposon library in *Ralstonia eutropha* a.k.a. *Cupriavidus necator*. For background and details regarding the method, see [Wetmore at al., mBio, 2015](https://mbio.asm.org/content/6/3/e00306-15) and [Price et al., Nature, 2018](http://www.nature.com/articles/s41586-018-0124-0)).

One part of the data used here (growth on **fructose**, **succinate**, **formate**) was already published in a previous research article with focus on *C. necator*'s central carbon metabolism. The [preprint is available on BioRxiv](https://www.biorxiv.org/content/10.1101/2021.03.21.436304v2). The second part (autotrophic growth on **hydrogen**, anoxic growth with **nitrate respiration**) is new data that complements previous results. The focus of this pipeline is to study the energy metabolism of *C. necator* in diverse conditions, each of it challenging a different aspect/machinery of metabolism. This investigation tries to answer the following key questions:

1.  Which genes are constituting the essential parts of the **hydrogenases and formate dehydrogenases**, enabling autotrophic lifestyle? Many things about the structure and function of these enzymes is already known, where do these results show differences?
2.  Which genes are **mandatory as accessory proteins**, for example cofactor synthesis, insertion, and enzyme maturation?
3.  Why does **KO of hydrogenases provides a growth benefit** in non-autotrophic conditions?
4.  What are the essential parts of the **electron transport chain**? What complexes are used specifically for some growth conditions?
5.  Is essentiality/fitness correlated with protein abundance? Is there a dosage effect, such that KO of a high abundant iso-enzyme has stronger fitness effect compared to a low-abundant?

# Libraries

Optionally install required libraries, particularly custom packages from github.

```{r, eval = FALSE}
devtools::install_github("m-jahn/lattice-tools")
```

```{r}
suppressPackageStartupMessages({
  library(lattice)
  library(latticeExtra)
  library(latticetools)
  library(tidyverse)
  library(forcats)
  library(dendextend)
  library(colorspace)
  library(stringi)
  library(vegan)
  library(tsne)
  library(directlabels)
  library(limma)
})
```

# Data import and processing

Import the main data tables with gene fitness data for all conditions. Tables were obtained by processing sequencing data with a custom [BarSeq pipeline](https://github.com/m-jahn/rebar). The 32 generation sequencing samples are removed due to low diversity, leading to overall lower average read count per gene.

```{r, message = FALSE}
load("../data/barseq/20201222_fru_fitness.Rdata")
df_fitness_frc <- fitness_gene %>% ungroup %>%
  filter(condition != "long pulse", time != 32) %>%
  mutate(ID = as.numeric(ID), substrate = "fructose",
    condition = str_remove(condition, "short "))

load("../data/barseq/20210407_suc_for_fitness.Rdata")
df_fitness_suc <- fitness_gene %>% ungroup %>%
  separate(condition, sep = "_", into = c("substrate", "condition"))

load("../data/barseq/20210624_H2_NO3_fitness.Rdata")
df_fitness_no3 <- fitness_gene %>% ungroup %>%
  filter(condition %in% c("NO3")) %>%
  mutate(substrate = "nitrate", condition = "continuous")

load("../data/barseq/20210928_H2_fitness.Rdata")
df_fitness_h2 <- fitness_gene %>% ungroup %>%
  mutate(substrate = "hydrogen", condition = "pulse")

load("../data/barseq/20220420_lowCO2_NO3.Rdata")
df_fitness_lowco2 <- fitness_gene %>% ungroup %>%
  filter(condition == "NO3_batch") %>%
  mutate(substrate = "nitrate", condition = "pulse")

# merge fitness tables
df_fitness <- bind_rows(df_fitness_frc, df_fitness_suc, df_fitness_no3,
  df_fitness_h2, df_fitness_lowco2) %>%
  rename(locus_tag = locusId)
rm("df_fitness_frc", "df_fitness_suc", "df_fitness_h2", "df_fitness_no3",
   "df_fitness_lowco2", "fitness_gene")

# import genome annotation
df_ref <- read_csv("../data/ref/Ralstonia_H16_genome_annotation.csv",
  col_types = cols()) %>%
  filter(!duplicated(locus_tag)) %>%
  mutate(eggNOG_name = if_else(is.na(eggNOG_name), gene_name, eggNOG_name))

# define standard colors
stdcol <- custom.colorblind()$superpose.line$col
```

# Gene fitness analysis

## Depletion over time (generations)

We can plot log2 FC or normalized gene fitness over generations. For this type of overview it is best to summarize individual replicates (4x) to the mean or median, per time point and condition. We also add genome annotation to the summary table. The plots shows that depletion of some strains is so strong already at 8 generations that fitness/log2 FC could not be quantified for 16 generations due to missing read counts. It is best to focus on 8 generations as because it provides a more complete picture.

```{r, messages = FALSE}
nest_cols = c("time", "condition", "substrate")
df_ref_long <- bind_rows(df_ref, distinct(df_fitness %>% select(!!!nest_cols))) %>%
  complete(locus_tag, nesting(!!!lapply(nest_cols, sym))) %>%
  filter(!is.na(time)) %>% select(locus_tag, !!!nest_cols) %>%
  left_join(df_ref, by = "locus_tag")

df_fitness_summary <- df_fitness %>%
  group_by(locus_tag, scaffold, time, condition, substrate, strains_per_gene) %>%
  summarize(
    .groups = "drop",
    norm_gene_fitness_median = median(norm_gene_fitness, na.rm = TRUE),
    log2FC_median = median(log2FC, na.rm = TRUE),
    tstat_median = median(t, na.rm = TRUE)
  ) %>%
  full_join(df_ref_long) %>%
  mutate(cond_short = if_else(condition == "pulse", "P", "C")) %>%
  mutate(cond_short = paste0(substrate, " - ", cond_short) %>%
    factor(levels = paste0(
      rep(c("fructose", "succinate", "formate", "hydrogen", "nitrate"), each = 2),
      rep(c(" - P", " - C"), 5)
    )[-8])
  )
```

```{r, fig.width = 6.5, fig.height = 8}
plot_hist_log2FC <- df_fitness_summary %>%
  group_by(condition, substrate) %>%
  slice(1:1000) %>%
  
  xyplot(log2FC_median ~ time | substrate * condition, .,
    groups = locus_tag, as.table = TRUE,
    col = stdcol[1], alpha = 0.2, layout = c(5, 2),
    xlab = "", ylab = expression("log"[2]*" FC"),
    par.settings = custom.colorblind(), type = c("l"),
    between = list(x = 0.5, y = 0.5),
    scales = list(alternating = FALSE), lwd = 2,
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
    }
  )

plot_hist_normfg <- df_fitness_summary %>%
  group_by(condition, substrate) %>%
  slice(1:1000) %>%
  
  xyplot(norm_gene_fitness_median ~ time | substrate * condition, .,
    groups = locus_tag, as.table = TRUE,
    col = stdcol[2], alpha = 0.2, layout = c(5, 2),
    xlab = "generations", ylab = "fitness",
    par.settings = custom.colorblind(), type = c("l"),
    between = list(x = 0.5, y = 0.5),
    scales = list(alternating = FALSE), lwd = 2,
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
    }
  )

print(plot_hist_log2FC, position = c(0,0.47,1,1), more = TRUE)
print(plot_hist_normfg, position = c(0,0,1,0.53))
```

```{r, echo = FALSE}
png("../figures/figure_depletion_over_time.png", width = 900, height = 1200, res = 150)
print(plot_hist_log2FC, position = c(0,0.47,1,1), more = TRUE)
print(plot_hist_normfg, position = c(0,0,1,0.53))
dev.off()
```

## Comparing gene fitness between conditions

Similarly to Figure 2 in [Wetmore et al, mBIO, 2015](https://mbio.asm.org/content/6/3/e00306-15), we can investigate condition-dependent gene fitness by comparing conditions and substrates one-on-one. The correlation between substrates and growth regimes (continuous, pulsed) is quite different. The strongest correlation is present in the pulsed vs continuous regime for each substrate (R = 0.64 to 0.87), and within the pulsed conditions the comparison between substrates (R = 0.62 to 0.72).

```{r, fig.width = 8, fig.height = 8, message = FALSE}
df_fitness_comp <- df_fitness_summary %>% filter(time == 8) %>%
  group_by(locus_tag) %>% mutate(tstat_median = min(tstat_median)) %>%
  select(locus_tag, cond_short, norm_gene_fitness_median, gene_name, COG_Process, tstat_median) %>%
  pivot_wider(names_from = cond_short, values_from = norm_gene_fitness_median) %>%
  filter(!is.na(locus_tag)) %>% ungroup

custom_splom(df_fitness_comp %>%
  select(matches(" - [CP]")), pch = 19, cex = 0.3, col = grey(0.4, 0.4))
```

```{r, echo = FALSE}
png("../figures/figure_splom_all_conditions.png", width = 1200, height = 1200, res = 150)
custom_splom(df_fitness_comp %>%
  select(matches(" - [CP]")), pch = 19, cex = 0.3, col = grey(0.4, 0.4)) %>% print
dev.off()
```

# Unsupervised clustering of genes by fitness

Most genes correlate in fitness value between conditions. That means, only some genes have a **condition-specific fitness effect**, i.e. increase or decrease fitness of the respective strain in one substrate or growth regime specifically. This section investigates specific genes and their functions that show such differential fitness between conditions. A simple comparison of one condition versus another will not be helpful as there two many possible combinations. To identify interesting sets of genes, cluster analysis can be performed. The first step is to generate a matrix from reformatted fitness data, and cluster genes by similarity using a generic clustering algorithm such as `hclust(method = "ward.D")`.

```{r}
# generate color palette for heatmap
heat_cols <- diverging_hcl(n = 7, h = c(255, 12), c = c(50, 80), l = c(20, 97), power = c(1, 1.3))

# create a matrix from wide fitness data for plotting heatmap
mat_heatmap <- df_fitness_comp %>% ungroup %>%
  filter(if_any(.cols = matches(" - [CP]"), ~ !between(., -2, 2))) %>%
  select(matches("locus_tag| - [CP]")) %>%
  # filter out NA rows, and replace extreme values
  drop_na %>% mutate(across(matches(" - [CP]"), 
    function(x) {y = replace(x, x > 6, 6); replace(y, y < -6, -6)})) %>%
  column_to_rownames(var = "locus_tag") %>%
  as.matrix

# create cluster for reordering
mat_cluster <- mat_heatmap %>% dist %>% hclust(method = "ward.D2")
mat_heatmap <- mat_heatmap[
  order.dendrogram(as.dendrogram(mat_cluster)),
  levels(df_fitness_summary$cond_short)
]
```

The second step is to find the optimal number of clusters for subdividing the data set. Silhouette analysis can be used for this purpose. However, the result is not entirely clear, but suggests a number of 5 to 10 clusters would give good separation. A wrapper for more or less fully automated silhouette analysis is available in my `Rtools` package (random collection of helper functions).

```{r, eval = FALSE}
devtools::install_github("m-jahn/r-tools")
```

```{r, fig.width = 4, fig.height = 4}
silhouette_result <- Rtools::silhouette_analysis(mat = mat_heatmap, n_clusters = 2:20, n_repeats = 10)
silhouette_result$plot_summary$par.settings = custom.colorblind()
silhouette_result$plot_summary
n_clusters = 7

svg(filename = "../figures/figure_silhouette.svg", width = 4, height = 4)
print(silhouette_result$plot_summary)
dev.off()
```

```{r, fig.width = 10, fig.height = 4.0, warning = FALSE}
# define a custom set of colors for clusters
custom_colors <- colorRampPalette(custom.colorblind()$superpose.line$col[c(2,3,1,5)])(n_clusters)

# plot heatmap
plot_heatmap <- levelplot(mat_heatmap[ ,rev(colnames(mat_heatmap))],
  par.settings = custom.colorblind(),
  col.regions = colorRampPalette(heat_cols)(16),
  at = seq(-6, 6, 1), aspect = "fill",
  xlab = "", ylab = "", scales = list(x = list(draw = FALSE)),
  panel = function(x, y, z, ...) {
    panel.levelplot(x, y, z, ...)
    panel.abline(h = 1:9+0.5, col = "white", lwd = 1.5)
  }
)

plot_cluster_dend <- mat_cluster %>% as.dendrogram %>%
  set("branches_k_col", custom_colors, k = n_clusters) %>%
  set("branches_lwd", 0.5) %>%
  as.ggdend %>%
  ggplot(labels = FALSE)

gridExtra::grid.arrange(
  plot_cluster_dend + 
    theme(plot.margin = unit(c(0.1, 0.05, -0.26, 0.085),"npc")),
  plot_heatmap,
  nrow = 2
)
```

```{r, warning = FALSE, echo = FALSE, fig.width = 8, fig.height = 3.5}
svg("../figures/figure_heatmap.svg", width = 8, height = 3.5)
gridExtra::grid.arrange(
  plot_cluster_dend + 
    theme(plot.margin = unit(c(0.12, 0.075, -0.29, 0.117),"npc")),
  plot_heatmap,
  nrow = 2
)
dev.off()
```

## Gene similarity, by cluster

The heat map clusters genes together that show similar expression over different conditions. Alternatively, we can use dimensionality reduction to identify clusters of genes that are similar, e.g. through **PCA**, **NMDS**, or **t-SNE**.

```{r, fig.width = 5, fig.height = 5}
# set a seed to obtain same pattern for stochastic methods
set.seed(123)
regex_moa <- "mo[aedbg]([A-Z0-9]+)?"
regex_for <- "fd[swho][A-Z0-9]+"
regex_hyd <- "hox[A-Z0-9]+|hyp[A-Z0-9]+"
regex_etc <- "(qcr|cco|cta|cox|cyo|cyd)([A-Z0-9]+)?"
regex_nit <- "na[rsp]([A-Z0-9]+)?|nir([A-Z0-9]+)?|nor([A-Z0-9]+)?|nos([A-Z0-9]+)?"
regex_all <- paste(c(regex_moa, regex_for, regex_hyd, regex_etc, regex_nit), collapse = "|")

# run t-SNE analysis
SNE <- tsne(mat_heatmap %>% dist, max_iter = 500)
df_tsne <- SNE %>% setNames(c("x", "y")) %>% as_tibble %>%
  mutate(locus = rownames(mat_heatmap)) %>%
  left_join(enframe(cutreeord(mat_cluster, k = n_clusters), "locus", "cluster")) %>%
  rename(locus_tag = locus) %>%
  left_join(select(df_fitness_comp, locus_tag, gene_name)) %>%
  mutate(
    gene_name = stri_extract_first_regex(gene_name, regex_all) %>%
    str_replace("mog", "mogA")
  )

plot_tsne <- df_tsne %>%
  xyplot(V2 ~ V1, .,
    groups = cluster, col = custom_colors,
    xlab = "tSNE 1", ylab = "tSNE 2",
    par.settings = custom.colorblind(),
    labels = .[["gene_name"]], selected = {!is.na(.[["gene_name"]])},
    panel = function(x, y, selected, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.repellabels(x, y, selected = selected, cex = 0.65,
        draw_box = TRUE, box_fill = grey(0.95, 0.5), ...)
      directlabels::panel.superpose.dl(x, y, ...)
    }
  )

print(plot_tsne)
```

## Functions enriched in different clusters

Extract cluster number for each gene and perform GO or KEGG enrichment per cluster. The goal is to identify the most prevalent biological functions per cluster.

```{r}
df_cluster <- cutreeord(mat_cluster, k = n_clusters) %>%
  enframe(name = "locus_tag", value = "cluster") %>%
  arrange(cluster)
```

We use the functions `kegga` for KEGG enrichment analysis and `goana` for GO term enrichment from the `limma` package. Both functions test for over or under-representation of genes associated with certain pathways or GO terms. The functions don't take the strength of differential fitness into account (DF; the depletion/enrichment over time).

```{r}
tryCatch(
  df_kegg_enrichment <- lapply(1:n_clusters, function(clust) {
    filter(df_cluster, cluster == clust) %>% pull(locus_tag) %>%
    kegga(species.KEGG = "reh") %>%
    mutate(cluster = clust)
  }) %>% bind_rows,
  error = function(e) print(e)
)
```

Visualize the pathways that are most enriched for the different gene clusters.

```{r, fig.width = 5, fig.height = 5}
plot_keggenrich <- df_kegg_enrichment %>%
  mutate(
    kegg_pathway = tolower(Pathway),
    kegg_pathway = str_remove_all(kegg_pathway, ".?biosynthesis| of| metabolism"),
    kegg_pathway = str_sub(kegg_pathway, 1, 32)
  ) %>%
  filter(N >= 5, N <= 200) %>%
  mutate(log10_p_value = log10(P.DE), .keep = "unused") %>%
  filter(log10_p_value < 0) %>%
  group_by(cluster) %>%
  arrange(log10_p_value) %>%
  slice(1:3) %>% rowwise() %>%
  mutate(
    kegg_pathway = str_remove(kegg_pathway, " - Cupriavidus necator H16"),
    kegg_pathway = str_glue("{kegg_pathway} ({cluster})")
  ) %>% ungroup %>%
  
  xyplot(factor(kegg_pathway, rev(unique(kegg_pathway))) ~ log10_p_value, .,
    groups = cluster,
    par.settings = custom.colorblind(),
    col = custom_colors,
    xlab = expression("log"[10]*" p-value"), ylab = "",
    panel = function(x, y, labels, ...) {
      panel.grid(h = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
    }
  )

print(plot_keggenrich)
```

```{r, fig.width = 8, fig.height = 4.5, echo = FALSE}
svg(filename = "../figures/figure_tsne.svg", width = 8, height = 4.5)
print(plot_tsne, position = c(-0.02, 0 , 0.52, 1), more = TRUE)
print(plot_keggenrich, position = c(0.46, 0 , 1, 1), more = TRUE)
dev.off()
```
# Genes essential in all conditions

## Cluster 1

- before looking at condition-specific essentiality, can also look at genes essential in all conditions
- these genes are mostly present in cluster 1
- we assume that these are amino acid biosynthesis genes
- they become essential when transferring *C. necator* from rich to minimal/defined medium
- in total 33 genes with depletion 

Generalized function to plot (small) heatmaps.

```{r}
heatmap_fitness <- function(data, key = TRUE) {
  
  # make some small adjustments to optical appearance
  data %>%
  mutate(norm_gene_fitness_median = norm_gene_fitness_median %>%
    replace(., . < -6, -6) %>% replace(., . > 6, 6)) %>%

  # plot heatmap
  levelplot(norm_gene_fitness_median ~ factor(gene_name) * fct_rev(cond_short), .,
    par.settings = custom.colorblind(), colorkey = key,
    col.regions = colorRampPalette(heat_cols)(25),
    at = seq(-6, 6, 0.5), aspect = "iso",
    xlab = "", ylab = "", scales = list(cex = 0.7, x = list(rot = 90)),
    panel = function(x, y, z, ...) {
      panel.levelplot(x, y, z, ...)
      panel.abline(v = seq_along(unique(x))+0.5, 
        h = seq_along(unique(y))+0.5, col = "white", lwd = 2)
    }
  )
}
```

Generalized function to plot genes for different regions of the genome.

```{r}
genome_plot <- function(df, xlim = NULL, title = "", rot_labels = 0) {
  df <-  replace_na(df, list(strains_per_gene = 0))
  theme = custom.colorblind(col = c(grey(0.7), grey(0.85)))
  theme$axis.line$col = grey(0.3)
  theme$axis.text$col = grey(0.3)
  theme$axis.text$cex = 0.7
  
  if (!is.null(xlim))
    xscale = list(limits = xlim)
  else 
    xscale = list()
  xyplot(end/1000 ~ start/1000, df,
    groups = strand, cex = 0.7, lwd = 1,
    par.settings = theme, strains = df$strains_per_gene,
    scales = list(y = list(draw = FALSE), x = xscale),
    ylim = c(-3,2), xlab = "", ylab = "",
    gene_strand = df[["strand"]],
    gene_name = df[["gene_name"]],
    panel = function(x, y, strains, ...) {
      panel.geneplot(x, y, arrows = TRUE, 
        tip = 0.1, rot_labels = rot_labels, ...)
      panel.text((x+y)/2, rep(0, length(x)), labels = strains, cex = 0.7)
      panel.text(mean(xlim), 1.5, labels = title, col = 1)
    }
  )
}
```

Plot heat map of cluster 1 gene fitness.

```{r, fig.width = 7.0, fig.height = 4.0}
plot_cl1 <- df_cluster %>%
  filter(cluster == 1) %>%
  left_join(df_fitness_summary, by = "locus_tag") %>%
  filter(time == 8) %>%
  group_by(COG_Process) %>%
  mutate(COG_n = n()) %>%
  ungroup() %>%
  arrange(desc(COG_n), COG_Process, gene_name) %>%
  mutate(gene_name = fct_inorder(gene_name)) %>%
  heatmap_fitness()

print(plot_cl1)
```
- Which functional categories (COG) do genes belong to?

```{r, fig.width = 7.0, fig.height = 2.5}
plot_cl1_function <- df_cluster %>%
  filter(cluster == 1) %>%
  left_join(df_fitness_comp, by = "locus_tag") %>%
  count(COG_Process) %>%
  arrange(desc(n)) %>%
  mutate(COG_Process = fct_inorder(COG_Process)) %>%
  mutate(vars = factor("COG process")) %>%
  xyplot(vars ~ n, .,
    stack = TRUE, horizontal = TRUE,
    auto.key = list(columns = 2),
    par.settings = custom.colorblind(),
    xlab = "count", ylab = "", lty = 0,
    xlim = c(0, 33),
    groups = COG_Process,
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.barchart(x, y, ...)
    }
  )

print(plot_cl1_function)
```

```{r, echo = FALSE, results = FALSE, warning = FALSE, message = FALSE, fig.width = 7, fig.height = 6}
svg(filename = "../figures/figure_cl1.svg", width = 7, height = 6)
print(plot_cl1, position = c(0, 0, 1, 0.75), more = TRUE)
print(plot_cl1_function, position = c(-0.018, 0.62, 0.93, 1))
dev.off()
```



# Supervised analysis of energy metabolism

## Molybdenum cofactor biosynthesis

-   modAB = molybdenum transporter
-   moaACDE = molybdopterin backbone biosynthesis
-   moeA = molybdopterin molybdenum transferase
-   mobAB = molybdenum cofactor guanylyltransferase,
-   mog = molybdopterin biosynthesis protein (role unclear?)

```{r, fig.width = 7, fig.height = 4}
# plot heatmap
df_moco <- df_fitness_summary %>% filter(time == 8, grepl("mo[aedbg]", gene_name)) %>%
  filter(!str_detect(gene_name, "moaA2")) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "mo[aedbg]([A-Z0-9]+)?") %>%
    str_replace("mog", "mogA")) %>%
  mutate(gene_name = factor(gene_name, c("moaA", "moaC", "moaD", "moaE", "moaF",
    "modA", "modB", "modC", "mogA", "moeA1", "moeA2", "moeA3", "mobA", "mobB", "mobB2", "mobB3")))
plot_moco_fit <- heatmap_fitness(df_moco)

# plot genomic locations
df_moco <- df_moco %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_moco_g1 <- genome_plot(df_moco, xlim = c(2455.9, 2458.6), title = "chr 1")
plot_moco_g2 <- genome_plot(df_moco, xlim = c(2787.8, 2791.2), title = "chr 1")

print(plot_moco_fit, position = c(0,0.21,1,1.05), more = TRUE)
print(plot_moco_g1, position = c(0.17,-0.1,0.57,0.4), more = TRUE)
print(plot_moco_g2, position = c(0.53,-0.1,0.83,0.4))
```

Interesting results to double-check:

-   Why is moaA essential for nitrate respiration but not formate? Is there another iso-enzyme for first step of MoCo synthesis?
-   mogA, moaA, moaB, moaE, modB, or modC deletion mutants are not viable in E.coli. Except for moaA, also modA,B,C are not essential for growth on formate.
-   modABC is the 3-subunit membrane transport system; it should be essential, however one explanation for it not being essential is that another low-affinity transporter might exist that compensates for the loss in medium with high Mo-concentration ([Xia et al., 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6295455/)). The authors write in the discussion: "Molybdate can be taken up by the sulfate-transport system in the absence of a functional high-affinity molybdate-transport system in E. coli and B. japonicum (Rosentel et al., 1995; Delgado et al., 2006)."
-   moeA (molybdenum transferase) is present with 3 isoenzymes. MoeA1 and A2 both seem to be important for fitness as a KO of each one of them reduces fitness partially but not fully on formate/nitrate.

```{r, echo = FALSE, results = 'hide'}
svg(filename = "../figures/figure_fit_moco.svg", width = 7, height = 4)
print(plot_moco_fit, position = c(0,0.21,1,1.05), more = TRUE)
print(plot_moco_g1, position = c(0.17,-0.1,0.57,0.4), more = TRUE)
print(plot_moco_g2, position = c(0.53,-0.1,0.83,0.4))
dev.off()
```

## Formate dehydrogenase

-   fdsABCDG = primary soluble FDH
-   fdwAB = secondary soluble FDH
-   fdhABCD = primary membrane-bound FDH, 2 copies A1B1... and A2B2...
-   fdoGHI = secondary membrane-bound FDH

```{r, fig.width = 7, fig.height = 4}
# plot heatmap
df_fdh <- df_fitness_summary %>% filter(time == 8, grepl("fd[swho]", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "fd[swho][A-Z0-9]+")) %>%
  mutate(gene_name = recode(gene_name, `fdhD` = "fdsC", `fdhC` = "fdhC1", `fdhE` = "fdhE1")) %>%
  mutate(gene_name = factor(gene_name, c(paste0("fds", c("A", "B", "C", "D", "G", "R")),
    "fdwA", "fdwB", paste0("fdh", c("A1", "B1", "C1", "D1", "E1", "A2")), "fdoG")))
plot_fdh_fit <- heatmap_fitness(df_fdh)

# plot genomic locations
df_fdh <- df_fdh %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_fdh_g1 <- genome_plot(df_fdh, xlim = c(677.7, 685.6), title = "chr 1")
plot_fdh_g2 <- genome_plot(df_fdh, xlim = c(3167.9, 3174.9), title = "chr 1")

print(plot_fdh_fit, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_fdh_g1, position = c(0.17,-0.1,0.57,0.4), more = TRUE)
print(plot_fdh_g2, position = c(0.52,-0.1,0.83,0.4))
```

```{r, echo = FALSE, results = 'hide'}
svg(filename = "../figures/figure_fit_fdh.svg", width = 7, height = 4)
print(plot_fdh_fit, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_fdh_g1, position = c(0.17,-0.1,0.57,0.4), more = TRUE)
print(plot_fdh_g2, position = c(0.52,-0.1,0.83,0.4))
dev.off()
```

## Hydrogenases

-   hoxA = master regulator, 2-component system. HoxA is constantly active in its non- phosphorylated state below a temperature of 33\*C. Originally, hoxA activity is regulated by the kinase HoxJ (together with HoxBC, the 'regulatory' hydrogenase). This function is inactivated due to a mutation in hoxJ, [Friedrich & Friedrich, J Bac, 1983](https://journals.asm.org/doi/10.1128/jb.153.1.176-181.1983), [Lenz et al., JMMB, 2004](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.334.5330&rep=rep1&type=pdf).
-   hoxKGZ = membrane-bound hydrogenase
-   hoxMLOQRTV = MBH accessory genes (cofactor incorporation, subunit assembly, and twin-arginine-dependent membrane translocation - TAT), [Schubert et al., Mol Microbio, 2007](https://pubmed.ncbi.nlm.nih.gov/17850259/), [Fritsch et al., J Bac, 2011](https://pubmed.ncbi.nlm.nih.gov/21441514/) [Schwartz et al., J Mol Bio, 2003](https://pubmed.ncbi.nlm.nih.gov/12948488)
-   hoxN1, hoxN2 = Nickel/Cobalt transporters. HoxN2 51% identical on AA level but not functional, [Schwartz et al., J Mol Bio, 2003](https://pubmed.ncbi.nlm.nih.gov/12948488)
-   hoxBC = regulatory hydrogenase RH, oxygen-resistant non-energy generating hydrogen sensor. HoxC is large active-site-containing subunit, hoxB small subunit HoxB carrying Fe-S clusters
-   HoxJ = RH binding signal transducer, histidine protein kinase [Löscher et al., Chem Phys Chem, 2010](https://pubmed.ncbi.nlm.nih.gov/20340124/) [Buerstel et al., PNAS, 2016](https://pubmed.ncbi.nlm.nih.gov/27930319/)
-   hoxFUYH = soluble hydrogenase
-   hoxW = soluble hydrogenase (HoxH-specific) C-terminal protease (maturation)
-   hoxI = optional subunit of SH (hoxFUYH); specific reaction site for NADPH, [Burgdorf et al., J Bac, 2005](https://pubmed.ncbi.nlm.nih.gov/15838039/)
-   a 4th hydrogenase exist on pHG1, made from 2 subunits (*hofK*, *hofG*, pHG064/065), see [Schäfer et al., 2015](https://www.cell.com/structure/pdfExtended/S0969-2126(15)00499-2).
-   these subunits have no effect on fitness (high coverage of 9-11 mutants per gene), and are very lowly expressed (PHG064 n.d., PHG065 low abundance)

```{r, fig.width = 8, fig.height = 4}
# plot heatmap
df_hyd <- df_fitness_summary %>%
  filter(time == 8, grepl("hox", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "hox[A-Z0-9]+")) %>%
  ungroup %>% mutate(gene_name = factor(gene_name, paste0("hox", c("K","G","Z","M",
   "L","O","Q","R","T","V","A","B","C","J","N","N2","F","U","Y","H","W","I"))))
plot_hyd_fit <- heatmap_fitness(df_hyd)

# plot genomic locations
df_hyd <- df_hyd %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_hyd_g1 <- genome_plot(df_hyd, xlim = c(0, 4.1), title = "pHG1")
plot_hyd_g2 <- genome_plot(df_hyd, xlim = c(15.1, 22.8), title = "pHG1")
plot_hyd_g3 <- genome_plot(df_hyd, xlim = c(79.2, 85.5), title = "pHG1")

print(plot_hyd_fit, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_hyd_g1, position = c(0.15,-0.1,0.39,0.4), more = TRUE)
print(plot_hyd_g2, position = c(0.35,-0.1,0.65,0.4), more = TRUE)
print(plot_hyd_g3, position = c(0.61,-0.1,0.90,0.4))
```

```{r, echo = FALSE, results = 'hide'}
svg(filename = "../figures/figure_fit_hyd.svg", width = 8, height = 4)
print(plot_hyd_fit, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_hyd_g1, position = c(0.15,-0.1,0.39,0.4), more = TRUE)
print(plot_hyd_g2, position = c(0.35,-0.1,0.65,0.4), more = TRUE)
print(plot_hyd_g3, position = c(0.61,-0.1,0.90,0.4))
dev.off()
```

## Hydrogenase pleiotropy genes (maturation)

-   hypA1,B1,F1,C1,D1,E1 = Ni-Fe metallocenter formation proteins for both MBH and SH. "HypC1, hypD1, and hypE1 are essential for SH and MBH maturation, while only one intact copy each of hypA, hypB, and hypF is needed", [Schwartz et al., J Mol Bio, 2003](https://pubmed.ncbi.nlm.nih.gov/12948488)
-   hypA and B
-   hypCDEF all form a complex to assemble the Ni-Fe-CO-CN cofactor
-   hypX = CO cofactor formation for NiFe-hydrogenase. HypX converts formyl-THF into water and CO in aerobic conditions, [Schulz et al., JACS, 2020](https://doi.org/10.1021/jacs.9b11506)
-   hypA2,B2,F2 = alternative genes for hypA1,B1,F1, not sufficient alone for hydrogenase formation
-   hypF3,hypC2,hypD2,hypE2,hypA3,hypB3. The physiological function of these extra hyp genes is unclear

```{r, fig.width = 7, fig.height = 4}
df_hyp <- df_fitness_summary %>% filter(time == 8, grepl("hyp", gene_name)) %>%
  # correct names of several hyp genes to make it more systematic
  mutate(gene_name = gsub("hypA PHG094", "hypA2 PHG094", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "hyp[A-Z0-9]+") %>%
    recode(hypA = "hypA1", hypB = "hypB1", hypC = "hypC1", hypD = "hypD1", hypE = "hypE1")) %>%
  ungroup %>% mutate(gene_name = factor(gene_name, paste0("hyp", c("A1","B1","C1","D1","E1",
    "F1","X","A2","B2","C2","D2","E2","F2","A3","B3","F3"))))
plot_hyp_fit <- heatmap_fitness(df_hyp)

# plot genomic locations
df_hyp <- df_hyp %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_hyp_g1 <- genome_plot(df_hyp, xlim = c(8.3, 15.6), title = "pHG1")
plot_hyp_g2 <- genome_plot(df_hyp, xlim = c(66.3, 73.5), title = "pHG1")
plot_hyp_g3 <- genome_plot(df_hyp, xlim = c(85.0, 89.7), title = "pHG1")

print(plot_hyp_fit, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_hyp_g1, position = c(0.26,-0.1,0.78,0.4), more = TRUE)
```

```{r, echo = FALSE, results = 'hide'}
svg(filename = "../figures/figure_fit_hyp.svg", width = 7, height = 4)
print(plot_hyp_fit, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_hyp_g1, position = c(0.26,-0.1,0.78,0.4), more = TRUE)
dev.off()
```

## Other complexes important in energy metabolism

-   **succinate dehydrogenase**
    -   sdhABCD = membrane integral succinate dehydrogenase, reducing UQ to UQH2: **all of its genes are strictly essential**, no KO mutants available
-   **NADH dehydrogenase**
    -   nuoABCDEFGHIJKLMN = membrane integral NADH dehydrogenase, oxidizes NADH, pumps 4 H+, reduces UQ to UQH2: **all of its genes are strictly essential**, no KO mutants available

## Electron transport chain

Cytochrome e- transport proteins, cytochrome reductases, and terminal oxidases.

-   **quinol oxidases**
    -   cyoA1B1C1D1, ... (3 copies) = cytochrome bo3 complex, quinole oxidase, low O2 affinity
    -   cydA1B1, ... (2 copies) = cytochrome bd complex, quinole oxidase, high O2 affinity
-   **cytochrome reductases**
    -   qcrABC (aka petABC, `H16_A3398`, `H16_A3397`, `H16_A3396`) = cytochrome bc1 reductase
-   **cytochrome C e- carriers**
    -   cytochrome c553 according to uniprot: `H16_A3451`, `H16_A0830`, `H16_A0846`, `H16_B1452`, `H16_A3576`, `H16_A1121`, `H16_A1120`. According to String DB, `H16_A3576` is in same operon as thiosulphate oxidation pathway (SoxXYZABCD), where thiosulfate is used as electron source [Zhang et al., 2020](https://www.nature.com/articles/s41396-020-0684-5)
-   **terminal cytochrome C oxidases**
    -   coxMNOPQ = bb3 complex, affinity unknown, not essential for growth or N-fixation in B. japonycum where it was discovered.
    -   ctaABCDEG = aa3 complex, low O2 affinity
    -   ccoGNOP = cbb3 complex, high affinity, high similarity to NO3 reductase
    -   VP2641 (`H16_A1031`) = Membrane spanning protein, RDD family. PFAM: unknown function, potential transporter? Interacts with cytochrome c553 and qcrC, cytochrome reductase. It's depletion implicates this protein in formate-specific e- transport in the membrane.

```{r, fig.width = 10, fig.height = 4}
df_etc <- df_fitness_summary %>% filter(time == 8, locus_tag %in% c("H16_A3451", "H16_A0830",
    "H16_A0846", "16_B1452", "H16_A3576", "H16_A1121", "H16_A1120", "H16_A1031") | 
    grepl("qcr|cco|cta|cox[MNOPQ] |cyo|cyd", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "H16_[AB][0-9]+|(qcr|cco|cta|cox|cyo|cyd)([A-Z0-9]+)?") %>%
    factor(c(
      paste0("cyo", c("A1","A2","A3","B1","B2","B3","C1","C2","D1","D2")),
      paste0("cyd", c("A1","B1","B2")), paste0("qcr", c("A","B","C")),
      paste0("H16_", c("A3576","A0846","A1120","A1121","A1031","A3451")),
      paste0("cox", c("M","N","O","P","Q")), paste0("cta", c("A","B","C","D","E","G")),
      paste0("cco", c("G","N","O","P"))
    ))
  )
plot_etc_fit <- heatmap_fitness(df_etc)

# plot genomic locations
df_etc <- df_etc %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_etc_g1 <- genome_plot(df_etc, xlim = c(3675.2, 3679.0), title = "chr 1")
plot_etc_g2 <- genome_plot(df_etc, xlim = c(361.2, 371.2), title = "chr 1")
plot_etc_g3 <- genome_plot(df_etc, xlim = c(2509.5, 2515.2), title = "chr 1")

print(plot_etc_fit, position = c(0,0.17,1,1.05), more = TRUE)
print(plot_etc_g1, position = c(0.1,-0.1,0.31,0.4), more = TRUE)
print(plot_etc_g2, position = c(0.28,-0.1,0.65,0.4), more = TRUE)
print(plot_etc_g3, position = c(0.62,-0.1,0.9,0.4))
```

```{r, echo = FALSE, results = 'hide'}
svg(filename = "../figures/figure_fit_etc.svg", width = 10, height = 4)
print(plot_etc_fit, position = c(0,0.17,1,1.05), more = TRUE)
print(plot_etc_g1, position = c(0.1,-0.1,0.31,0.4), more = TRUE)
print(plot_etc_g2, position = c(0.28,-0.1,0.65,0.4), more = TRUE)
print(plot_etc_g3, position = c(0.62,-0.1,0.9,0.4))
dev.off()
```

## Nitrate respiration

**Nitrate reductase (NAR)**

Reaction: HNO3- + QH2 --\> NO2- + H2O + Q.

Two copies of the *nar* operon, one on the megaplasmid and one on chromosome 2. Several other enzymes catalyze similar reactions as NAR. NAS is cytoplasmic and NADH dependent *assimilatory* nitrate reductase [Cramm R, 2008](https://pubmed.ncbi.nlm.nih.gov/18957861/). NAP is *periplasmic* membrane protein complex accepting QH2, involved in nitrate respiration but probably also dissipation of excess redox power. Here we just focus on the main NAR enzyme.

```{r, fig.width = 6, fig.height = 4}
df_nar <- df_fitness_summary %>%
  filter(time == 8, grepl("na[rsp]", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "na[rsp]([A-Z0-9]+)?") %>%
    recode(narG = "narG1", narH = "narH1", narI = "narI1", narL = "narL1", narX = "narX1"))
plot_nar_fit <- heatmap_fitness(df_nar)

# plot genomic locations
df_nar_genes <- df_nar %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_nar_g1 <- genome_plot(df_nar_genes, xlim = c(279.5, 294.5), title = "pHG1")

print(plot_nar_fit, position = c(0,0.26,1,1.05), more = TRUE)
print(plot_nar_g1, position = c(0.2,-0.1,0.85,0.4))
```

**Nitrite reductase (NIR)**

Reaction: NO2- + QH2 --\> NO + H2O + Q

NIR is a cd1 type copper containing terminal oxidase, homodimer of 2 nirS subunits. The genes for nitrite reductase are assembled in a cluster (nirS, -C, -F, -D, -G, -H, -J, -N, -E) spanning 8.6 kb on chromosome 2. The first gene of this cluster, nirS, is the structural gene encoding cd1 -NIR. The re-maining genes are presumed to be accessory genes essential for maturation of active nitrite reductase (?).

```{r, fig.width = 6, fig.height = 4}
df_nir <- df_fitness_summary %>%
  filter(time == 8, grepl("nir", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "nir([A-Z0-9]+)?"))
plot_nir_fit <- heatmap_fitness(df_nir)

# plot genomic locations
df_nir_genes <- df_nir %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_nir_g1 <- genome_plot(df_nir_genes, xlim = c(2576.3, 2585.9), title = "chr 2")

print(plot_nir_fit, position = c(0,0.26,1,1.05), more = TRUE)
print(plot_nir_g1, position = c(0.3,-0.1,0.75,0.4))
```

**NO reductase (NOR)**

Reaction: NO + QH2 --\> N2O + H2O + Q

From [Cramm R, 2008](https://pubmed.ncbi.nlm.nih.gov/18957861/): "The catalytic sub-unit (NorB) of all NORs that have been characterized to date contains one high-spin heme b and a binuclear catalytic center of a low-spin heme b and a non-heme iron. The best investigated NORs are heterodimers that contain, in addition to NorB, a heme c-containing subunit NorC which channels lectrons from the physiological electron donor cytochrome c to NorB."

```{r, fig.width = 6, fig.height = 4}
df_nor <- df_fitness_summary %>%
  filter(time == 8, grepl("nor", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "nor([A-Z0-9]+)?"))
plot_nor_fit <- heatmap_fitness(df_nor)

# plot genomic locations
df_nor_genes <- df_nor %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_nor_g1 <- genome_plot(df_nor_genes, xlim = c(256.4, 262.1), title = "pHG1")

print(plot_nor_fit, position = c(0,0.25,1,1.05), more = TRUE)
print(plot_nor_g1, position = c(0.3,-0.1,0.75,0.4))
```

**NOS (nitrous oxide) reductase**

Reaction: N2O + QH2 --\> N2 + H2O + Q

The NOS of R. eutropha H16 has not been investigated on the biochemical level. Genes for NOS are located adjacent to the nor genes on pHG1. NosC (PHG253) might be the cyt c that is the e- donor for NOS. The nosC KO is enriched on formate, i.e. grows better than WT. Is nosC taking e- away from formate-fed ETC?

```{r, fig.width = 6, fig.height = 4}
df_nos <- df_fitness_summary %>%
  filter(time == 8, grepl("nos", gene_name)) %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "nos([A-Z0-9]+)?"))
plot_nos_fit <- heatmap_fitness(df_nos)

# plot genomic locations
df_nos_genes <- df_nos %>% group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop")
plot_nos_g1 <- genome_plot(df_nos_genes, xlim = c(261.1, 271.9), title = "pHG1")

print(plot_nos_fit, position = c(0,0.25,1,1.05), more = TRUE)
print(plot_nos_g1, position = c(0.22,-0.1,0.85,0.4))
```

The following section simply plots all nitrate respiration modules on one canvas.

```{r, fig.width = 10, fig.height = 4}
plot_nitrate_all <- bind_rows(df_nar, df_nor, df_nos, df_nir) %>%
  mutate(gene_name = factor(gene_name, c(
    paste0("nar", c("X1", "L1", "K1", "K2", "G1", "H1", "I1", "X2", "L2", "G2", "H2", "I2", "K3", "K5")),
    paste0("nap", c("A", "B", "C", "D")), paste0("nas", c("D", "E")),
    paste0("nir", c("S", "C", "F", "D", "J", "N", "E")),
    paste0("nor", c("A", "B", "R1", "A2", "B2", "R2")),
    paste0("nos", c("L", "Y", "F", "D", "R", "Z", "C", "X"))
  ))) %>%
  heatmap_fitness()

print(plot_nitrate_all, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_nar_g1, position = c(0,-0.1,0.35,0.4), more = TRUE)
print(plot_nir_g1, position = c(0.33,-0.1,0.58,0.4), more = TRUE)
print(plot_nor_g1, position = c(0.56,-0.1,0.72,0.4), more = TRUE)
print(plot_nos_g1, position = c(0.70,-0.1,0.96,0.4), more = TRUE)
```

```{r, echo = FALSE, results = 'hide'}
svg(filename = "../figures/figure_nitrate_resp.svg", width = 10, height = 4)
print(plot_nitrate_all, position = c(0,0.23,1,1.05), more = TRUE)
print(plot_nar_g1, position = c(0,-0.1,0.35,0.4), more = TRUE)
print(plot_nir_g1, position = c(0.33,-0.1,0.58,0.4), more = TRUE)
print(plot_nor_g1, position = c(0.56,-0.1,0.72,0.4), more = TRUE)
print(plot_nos_g1, position = c(0.70,-0.1,0.96,0.4), more = TRUE)
dev.off()
```

## Comparison of nitrate respiration conditions

-   we have two different nitrate respiration conditions, continuous and pulsed
-   low depletion of nitrate/nitrite/nitrous respiration mutants probably due to constant NO3 supply

```{r, fig.width = 6, fig.height = 6}
df_fitness_summary %>%
  filter(substrate == "nitrate", time == 8, !is.na(gene_name)) %>%
  select(gene_name, condition, norm_gene_fitness_median) %>%
  pivot_wider(names_from = condition, values_from = norm_gene_fitness_median) %>%
  mutate(
    dfit = pulse - continuous,
    significant = !between(dfit, quantile(dfit, probs = c(0.003), na.rm = TRUE),
      quantile(dfit, probs = c(0.997), na.rm = TRUE))) %>%
  
  # plot
  xyplot(continuous ~ pulse, data = ., groups = significant,
    labels = .[["gene_name"]],
    par.settings = custom.colorblind(), col = custom_colors,
    xlim = c(-9, 3), ylim = c(-9, 3),
    xlab = "Nitrate respiration (P)", ylab = "Nitrate respiration (C)",
    panel = function(x, y, labels, ...) {
      panel.grid(h = -1, col = grey(0.9))
      panel.abline(a = 0, b = 1, col = grey(0.5), lty = 2, size = 0.8)
      panel.abline(a = 4, b = 1, col = grey(0.5), lty = 2, size = 0.8)
      panel.abline(a = -4, b = 1, col = grey(0.5), lty = 2, size = 0.8)
      panel.xyplot(x, y, ...)
      panel.repellabels(x, y, labels, cex = 0.6,
        selected = .[["significant"]],
        draw_box = TRUE, box_fill = "white", box_line = TRUE, ...)
    }
  )
```

## Non-hydrogenase genes important for H2 fitness

Different gene clusters appear based on the fitness pattern in different conditions. Clusters specific for lithoautotrophic metabolism are particularly interesting. We can add the cluster IDs to the main data frame, and investigate fitness. Or simply select genes that show depletion only for the hydrogen metabolizing condition.

```{r}
df_fitness_summary <- df_fitness_summary %>%
  left_join(by = "locus_tag",
    enframe(cutreeord(mat_cluster, k = n_clusters), "locus_tag", "cluster")
  )
```

```{r, fig.width = 8, fig.height = 4.5}
df_fitness_summary %>%
  filter(time == 8, !str_detect(gene_name, "hox|hyp")) %>%
  mutate(hyd_essential = case_when(
    substrate == "hydrogen" & norm_gene_fitness_median <= -2 ~ TRUE,
    substrate != "hydrogen" & norm_gene_fitness_median >= -2 ~ TRUE,
    TRUE ~ FALSE
  )) %>% 
  group_by(locus_tag) %>%
  filter(sum(hyd_essential) == 9 |
    locus_tag %in% c("H16_A0325", "H16_A0326")) %>%
  ungroup %>% arrange(locus_tag) %>%
  mutate(gene_name = factor(gene_name, unique(gene_name))) %>%
  mutate(norm_gene_fitness_median = norm_gene_fitness_median %>%
    replace(., . < -6, -6) %>% replace(., . > 6, 6)) %>%
  heatmap_fitness
```

-   **ptsI + ptsH** - according to databases, a clear hit for a PEP-dependent sugar PTS system. This can't be true for *C. necator* otherwise it would not show the familiar pattern of depletion on hydrogen and growth benefit on other substrates. This PTS system must be implicated in hydrogen metabolism, probably a transporter? The genes downstream of ptsIH do not seem to be affected (H16_A0327 to H16_A0329).
-   **hldD** (H16_A0804) - ADP-L-glycero-beta-D-manno-heptose-6-epimerase, involved in LPS sugar phosphate synthesis. No obvious connection to hydrogen metabolism.
-   **glnE** (H16_A1127) - Bifunctional glutamine synthetase adenylyltransferase. Involved in the regulation of glutamine synthetase GlnA. Transfers or removes adenylyl-residues to and from GlnA, thereby regulating N uptake through GlnA. No obvious connection to hydrogen metabolism. Only 1 mutant in library.
-   **H16_A2692** - uncharacterized membrane protein. Is neutral in all other conditions, just as are all surrounding genes (H16_A269[0-9]). Small, 141 AA, potential carbonic anhydrase or transporter?
-   **bolA** (H16_A3419) - Predicted transcriptional regulator, no known domains or published functions. Only 1 mutant in library.
-   **yadG** (H16_A3421) Separated by only 1 gene (yadH, H16_A3420) from the previous hit, shows very specific depletion for hydrogen. 7-8 mutants quantified, annotated as ABC transporter, ATPase function. YadH is the permease partner of the 2 subunits, however does not show depletion. COG process 'defense' because it supposedly is active as anti-drug efflux pump.
-   **H16_B1603** - ABC-type transporter, ATPase coupled transmembrane transporter. Strong depletion on hydrogen only. All surrounding genes are neutral to fitness. Only 1 mutant in library. A candidate for a CO2/HCO3- transporter? Aligns best with a lot of different ABC transporters, with spermidine/putrescine transporters best hits.

## CO2 transporter/carbonic anhydrase homologs

All CAs with quantified fitness are neutral during lithoautotrophic growth (H16_A1192 - *cag*, H16_B2403 - *caa*, H16_B2270 - *can2*). Only H16_A0169 (*can*) is strictly essential, no mutants are available. This is different to the results from [Gai et al., AMB Express, 2014](https://link.springer.com/article/10.1186/2191-0855-4-2) that found two essential CAs, *can* and *caa*:

"All four purified CAs were capable of performing the interconversion of CO2 and HCO3-- [...] Deletion of *can* affected the growth of R. eutropha; however the growth defect could be compensated by adding CO2 to the culture. Deletion of *caa*, encoding an alpha-CA, had the strongest deleterious influence on cell growth.".

Also [Kusian et al., J Bac, 2002](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC135314/) report the essentiality of *can* for lithoautotrophic growth. *Caa* fitness was quantified in the library with 11 mutants, yet does not show any phenotype on any substrate. The results in Gai et al. are somewhat contradictory, as the *caa* completion test in a quadruple-CA mutant had no beneficial effect on growth.

Altogether, no specific bicarbonate transporters are annotated in *C. necator*. However, important bicarbonate (HCO3+) or CO2 transporters known from e.g. cyanobacteria (*Synechocystis* sp. PCC 6803) could be used to search for homologous genes in *C. necator*. The following CO2/HCO3+ transporters are known in *Synechocystis*:

-   **BicA** (sll0834): constitutive high-flux, low-affinity Na+/HCO3+ symport
-   **SbtA** (slr1512): inducible low-flux, high-affinity Na+/HCO3+ symport
-   **BCT1** (cmpABCD operon/slr0040-44): inducible high-affinity, ATP-dependent uniport
-   **NDH** (nuo operon/): all genes of this operon are essential in *C. necator*, no cond. fitness

The first three of these genes were blasted against the genome of *C. necator*. The figure shows alignment scores for the top 20 hits.

```{r, fig.width = 6.5, fig.height = 3}
df_co2_transport_align <- read_tsv("../data/alignments/CO2_transporters.tsv", col_types = cols()) %>%
  group_by(target_name) %>%
  mutate(rank = seq_along(uniprot))
  
  xyplot(score + -log10(e_value) ~ rank | target_name, df_co2_transport_align,
    par.settings = custom.colorblind(), xlim = c(-1, 21),
    xlab = "rank", ylab = expression("-log"[10]*" e value, score"),
    between = list(x = 0.5, y = 0.5), lwd = 0,
    scales = list(alternating = FALSE),
    panel = function(x, y, z, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.barplot(x, y, ewidth = 0.5, fill_alpha = 0.7, ...)
      panel.key(..., points = FALSE, corner = c(0.97, 0.97))
    }
  )
```

BicA and cmpB seem to have some very similar homologs in the *C. necator* genome. However, none of these homologs has a clear phenotype related to hydrogen metabolism.

```{r, fig.width = 9.5, fig.height = 3.5}
df_co2_transport_align %>%
  filter(rank <= 20) %>% 
  select(uniprot, target_name, rank) %>%
  inner_join(df_fitness_summary, by = "uniprot") %>% ungroup %>%
  mutate(norm_gene_fitness_median = norm_gene_fitness_median %>%
    replace(., . < -6, -6) %>% replace(., . > 6, 6)) %>%
  heatmap_fitness
```

## Main conclusions

-   MoCo cofactor is important for growth on formate *and* nitrate respiration, but not for growth on hydrogen
-   Main FDH is S-FDH both in terms of protein abundance, inducibility, and essentiality
-   Alternative S-FDH operon *fdw* does not play a roll
-   Both hydrogenases, S-H and m_H have similar fitness penalty on hydrogen when KOed
-   Hydrogenase genes, but even more so master regulator hoxA and Hydrogenase pleiotropy (hyp) genes confer fitness advantage when KOed on all non-loithoautotrophic substrates
-   Cytochrome reductase bc1 complex (qcrABC) is particularly important for growth on formate
-   Bc1 complex plays no role on nitrate, therefore no cross talk (while a bc1 KO mutant was growth-deficient in oxygen-limited growth on nitrate (Garg et al., 2018).
-   Bc1 complex accepts reduced quinols as e- donor, pumps 4 H+ /2 e− via proton-motive Q cycle (quinol:cyt c oxidoreduction), and releases reduced cytochrome C e- carriers. Main cytochrome complex for most substrates including hydrogen
-   Cbb3-type cytochrome oxidase (*cco* operon) also important for growth on formate, while *cta* operon (aa3 complex) more important on fructose and succinate

# Central carbon metabolism

## gene-reaction relationship

To show (conditional) essentiality of genes which are related to central metabolic reactions, we need to obtain a table of gene-reaction relationships.
Such a table was used in our eLife paper, [Jahn et al., 2021](https://elifesciences.org/articles/69019), and was exported from the curated genome scale model of *C. necator*.

- download gene-reaction table
- add some genes manually that are not metabolic reactions but interesting regulators
- we add: 
  - `cbbR` (`H16_B1396`) cbb operon transcriptional activator
  - `regA` (`H16_A0202`) response regulator, 2-comp. system
  - `regB` (`H16_A0203`) kinase, 2-comp. system
  - `rpoN` (`H16_A0387`) Sigma 54 RNA-polymerase sigma factor
  - `hoxA` (`PHG019`) hoxA master regulator

```{r}
url_reactions <- url("https://github.com/m-jahn/R-notebook-ralstonia-proteome/raw/main/data/input/model_reactions.csv")
df_reactions <- read_csv(url_reactions, show_col_types = FALSE) %>%
  rename(locus_tag = genes) %>%
  add_row(reaction_id = "Regulators", locus_tag = "H16_B1396, H16_A0202, H16_A0203, H16_A0387, PHG019")

df_reactions %>%
  filter(!str_detect(reaction_id, "^EX_")) %>%
  head()
```

## Plot conditional fitness broken down by CCM pathway

- use the same set of enzymes as in [Jahn et al., 2021](https://elifesciences.org/articles/69019)
- using these sets, extract fitness from results table and plot minifigures
- these can be arranged in/around a map of the central carbon metabolism (CCM)
- reactions with slightly different substrates/products but that are gene-wise identical have been removed
- these are SUCDi, ACONT2, TKT2, and PDH3 which is identical to AKGDH

```{r}
list_ccm_enzymes <- list(
  CBB = c("PGK", "TPI", "GAPD", "FBA", "FBP", "TKT1", "TAh", "RPE", "RPI", "PRUK", "RBPC"),
  ED = c("PGI", "G6PDH2r", "PGL", "EDD", "EDA"),
  PYR = c("PGM", "ENO", "PYK", "PDH1", "PDH2", "PC", "PPC", "ME1", "ME2", "CS"),
  TCA = c("ACONT1", "ICDHx", "ICDHyr", "AKGDH", "SUCOAS", "SUCD1", "FUM", "MDH", "MALS"),
  REG = c("Regulators")
)
```


```{r, fig.height = 12, fig.width = 14, warning = FALSE}
# find genes for metabolic reactions
list_reaction <- list_ccm_enzymes %>%
  enframe(name = "pathway", value = "reaction_id") %>%
  tidyr::unnest(reaction_id)

list_genes <- df_reactions %>%
  right_join(list_reaction, by = "reaction_id") %>%
  separate_longer_delim(locus_tag, ", ") %>%
  select(pathway, reaction_id, locus_tag)

# get fitness for these genes
df_fitness_ccm <- df_fitness_summary %>%
  filter(time == 8) %>%
  inner_join(list_genes, by = "locus_tag")

# plot heatmaps
plots_ccm <- df_fitness_ccm %>%
  arrange(pathway, reaction_id, locus_tag) %>%
  mutate(
    gene_name = locus_tag,
    # gene_name = eggNOG_name %>%
    # str_extract("[a-zA-Z0-9_]+") %>%
    # fct_inorder(),
    cond_short = case_match(cond_short,
      "formate - C"	~ "FoC",
      "fructose - C" ~ "FrC",
      "nitrate - C" ~ "NiC",
      "succinate - C"	~ "SuC",
      "formate - P" ~ "FoP",
      "fructose - P" ~ "FrP",
      "hydrogen - P" ~ "HyP",
      "nitrate - P"	~ "NiP",
      "succinate - P"~ "SuP") %>%
      factor(., unique(.)[c(3,8,7,1,5,4,9,2,6)]),
    norm_gene_fitness_median = norm_gene_fitness_median %>%
           replace(., . < -6, -6) %>% replace(., . > 6, 6)
  ) %>%
  group_by(pathway, reaction_id) %>%
  dplyr::group_split() %>%
  lapply(function(df) {
    levelplot(
      norm_gene_fitness_median ~ fct_rev(cond_short) * fct_rev(gene_name) |
        reaction_id, df,
      par.settings = custom.colorblind(), colorkey = FALSE,
      col.regions = colorRampPalette(heat_cols)(25),
      at = seq(-6, 6, 0.5), aspect = "iso", xlab = "", ylab = "",
      scales = list(alternating = FALSE, cex = 0.7, tck = 0,
        x = list(rot = 90)),
      between = list(x = 0.5, y = 0.5), lwd = 0,
      panel = function(x, y, z, ...) {
        panel.levelplot(x, y, z, ...)
        panel.abline(
          v = seq_along(unique(x)) + 0.5,
          h = seq_along(unique(y)) + 0.5,
          col = "white",
          lwd = 2
        )
      }
    )
  })

cowplot::plot_grid(plotlist = plots_ccm, ncol = 6)
```


```{r, echo = FALSE, warning = FALSE, message = FALSE}
svg("../figures/figure_centralmet.svg", width = 14, height = 12)
cowplot::plot_grid(plotlist = plots_ccm, ncol = 6)
dev.off()
```

# Growth advantage of Hydrogenase mutants

## Determine mutant growth rate from competition experiments

The main measurement that was obtained from transposon library competition experiments is the fold change, and the fitness score as a summary statistic thereof. The fold change of a mutant sub-population over time can also be used to calculate the relative growth advantage over other strains. For such a growth model, the only required input parameters are the FC, the average population growth rate, and the time in hours.

$$ FC = (1 - (\mu_{pop} - \mu_{mut}))^{t} $$ Re-arranged to obtain the mutant growth rate.

$$ \mu_{mut} = -(1 - FC^{1/t}) + \mu_{pop}$$

All input variables are available:

-   µ_pop = 0.1 h\^-1, known from chemostats
-   FC for each mutant/gene from library competition experiments ('fitness' is the normalized log2 FC)
-   t = n_gen x (ln 2)/0.1 - number of generations (0, 8, 16) multiplied with generation time of 6.93 h

We make a generalized function that calculates growth rate (dis-) advantage.

```{r}
mutant_mu <- function(fold_change, pop_mu, time, log2_fc = FALSE) {
  if (log2_fc) {fold_change = 2^fold_change}
  result = -(1 - (fold_change^(1/time))) + pop_mu
  return(result)
}
```

Then select the hydrogenase mutants where an enrichment was observed in heterotrophic conditions.

```{r, fig.width = 6.5, fig.height = 6.0}
df_increased_mu <- df_fitness_summary %>%
  filter(
    grepl("hox[KGZMLOQRTVABCJNFUYHWI]|hyp[BCDEFX]1? |hypA PHG012", gene_name), condition == "continuous",
    substrate %in% c("formate", "fructose", "succinate")) %>%
  filter(locus_tag != "PHG063") %>%
  mutate(gene_name = stri_extract_first_regex(gene_name, "hox[KGZMLOQRTVABCJNFUYHWI]|hyp[ABCDEFX]")) %>%
  mutate(gene_name = fct_inorder(gene_name))

plot_increased_mu <- xyplot(norm_gene_fitness_median ~ time | gene_name, df_increased_mu,
    groups = substrate, as.table = TRUE, layout = c(6, 5),
    col = stdcol[2:4], xlab = "generations", ylab = "fitness",
    par.settings = custom.colorblind(), type = c("l"),
    auto.key = list(columns = 3),
    between = list(x = 0.5, y = 0.5),
    scales = list(alternating = FALSE), lwd = 2,
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.abline(h = 0, col = grey(0.5), lty = 3, lwd = 2)
      panel.xyplot(x, y, ...)
    }
  )

df_mutant_mu <- df_increased_mu %>%
  filter(time != 0) %>%
  group_by(gene_name, substrate) %>%
  mutate(mutant_mu = mutant_mu(
    fold_change = norm_gene_fitness_median, 
    pop_mu = 0.1, time = time*(log(2)/0.1),
    log2_fc = TRUE)) %>%
  select(locus_tag, time, condition, substrate,
    norm_gene_fitness_median, gene_name, cond_short, mutant_mu)

print(plot_increased_mu)
```

## Growth advantage *versus* protein cost

The main question is if the observed growth advantage can be purely explained by the saving of gene expression cost. Hydrogenase and some accessory proteins are highly abundant and therefore cause a significant cost to the cell. It was shown in many other studies that reducing excess/unused proteins can increase growth rate. This is based on the observation that the protein pool is limited and cells can optimize the utilization of this limited resource by tuning gene expression towards highly utilized genes/proteins.

In the next section, global protein abundance data is imported that was obtained from quantitative MS experiments. MS data is available for the three conditions used in the previous section, continuous growth on fructose, succinate, and formate in chemostats at defined growth rates.

**Note: The data is imported directly from the respective github repository, requiring internet connection**.

```{r}
load(url("https://github.com/m-jahn/R-notebook-ralstonia-proteome/blob/main/data/input/Ralstonia_eutropha.Rdata?raw=true"))
head(Ralstonia_eutropha[1:10])
```

To assess real protein cost of a knock-out, it is important to consider that KO by transposon integration will have an effect on the downstream gene expression too, effectively disrupting transcription/translation of the entire operon downstream of the integration site.

If we assume that each KO knocks out the gene expression of all consecutive genes of the operon, we need to sum up protein cost for these affected gene sets; we generate a **binary matrix** of associations for each KO with a number of affected genes (0 = FALSE, 1 = TRUE), and multiply with protein mass. Here, we need to test different assumptions. The canonical regulation of the hox/hyp operon happens via `hoxA` transcriptional regulator, `hoxBC` H2 sensing 'regulatory hydrogenase' (RH), and `hoxJ`, the defect kinase that transmits the output from `hoxBC` to `hoxA?`. This regulation controls two promoters:

-   the P_MBH promoter that starts transcription of the huge `hoxKGZMLOQRTV` + `hypABFCDEX` + `hoxABCJ operon`
-   the P_SH promoter that starts transcription of the smaller `hoxFUYHWI` + `hypA2B2F2` operon

Accordingly, `hoxA` controls its own transcription in a feed-forward autoregulatory fashion. Increased `hoxA` protein abundance leads to stronger `hoxA` expression, etc. However, if a transposon integrates into any given site of one of the operons, transcription of following genes should be terminated or otherwise negatively affected. Surprisingly, this effect is visible for `hyp` mutants that might negatively affect `hoxA` expression, but not for any preceding genes (`hoxKHZMLOQRTV`). This suggests that:

-   there is another, secondary promoter that controls `hyp` + `hoxABCJN` gene expression
-   in fact this has been shown in ...

```{r}
gene_list_mh <- c("hoxK", "hoxG", "hoxZ", "hoxM", "hoxL", "hoxO", "hoxQ", "hoxR", "hoxT", "hoxV")
gene_list_sh <- c("hoxA", "hoxB", "hoxC", "hoxJ", "hoxN", "hoxF", "hoxU", "hoxY", "hoxH", "hoxW", "hoxI")
gene_list_hyp <- c("hypA", "hypB", "hypF", "hypC", "hypD", "hypE", "hypX")

list_mutant_genomic <- list(
  hoxK = gene_list_mh,
  hoxG = gene_list_mh[2:10],
  hoxZ = gene_list_mh[3:10],
  hoxM = gene_list_mh[4:10],
  hoxL = gene_list_mh[5:10],
  hoxO = gene_list_mh[6:10],
  hoxQ = gene_list_mh[7:10],
  hoxR = gene_list_mh[8:10],
  hoxT = gene_list_mh[9:10],
  hoxV = gene_list_mh[10],
  hypA = c(gene_list_mh, gene_list_hyp, gene_list_sh),
  hypB = c(gene_list_mh, gene_list_hyp[2:7], gene_list_sh),
  hypF = c(gene_list_mh, gene_list_hyp[3:7], gene_list_sh),
  hypC = c(gene_list_mh, gene_list_hyp[4:7], gene_list_sh),
  hypD = c(gene_list_mh, gene_list_hyp[5:7], gene_list_sh),
  hypE = c(gene_list_mh, gene_list_hyp[6:7], gene_list_sh),
  hypX = c(gene_list_mh, gene_list_hyp[7], gene_list_sh),
  hoxA = c(gene_list_mh, gene_list_hyp, gene_list_sh),
  hoxB = gene_list_sh[2:5],
  hoxC = gene_list_sh[3:5],
  hoxJ = gene_list_sh[4:5],
  hoxN = gene_list_sh[5],
  hoxF = gene_list_sh[c(6:11)],
  hoxU = gene_list_sh[c(7:11)],
  hoxY = gene_list_sh[c(8:11)],
  hoxH = gene_list_sh[c(9:11)],
  hoxW = gene_list_sh[c(10:11)],
  hoxI = gene_list_sh[c(11)]
)

mat_mutant_genomic <- lapply(list_mutant_genomic, function(x) {
    as.numeric(names(list_mutant_genomic) %in% x)
  }) %>% unlist %>% matrix(ncol = length(list_mutant_genomic), byrow = TRUE) %>% t
rownames(mat_mutant_genomic) <- names(list_mutant_genomic)
colnames(mat_mutant_genomic) <- names(list_mutant_genomic)
```

-   Add individual and accumulated protein cost for each KO mutant.

```{r}
# add individual protein cost from MS data
df_mutant_cost <- df_mutant_mu %>%
  group_by(locus_tag, gene_name, substrate) %>%
  summarize(.groups = "drop",
    mean_mutant_mu = mean(mutant_mu),
    sd_mutant_mu = sd(mutant_mu)) %>%
  left_join(by = c("locus_tag", "substrate"),
    Ralstonia_eutropha %>% ungroup %>%
      filter(growthrate == 0.1) %>%
      select(locus_tag, substrate, mean_mass_fraction_norm, sd_massfraction)) %>%
  arrange(locus_tag) %>%
  tidyr::complete(substrate, nesting(locus_tag, gene_name)) %>%
  
  # add estimated total cost for each knockout
  group_by(substrate) %>%
  mutate(
    corrected_cost = mean_mass_fraction_norm %>% {colSums(na.rm = TRUE, .*100*mat_mutant_genomic)},
    corrected_sd = sd_massfraction %>% {colSums(na.rm = TRUE, .*100*mat_mutant_genomic)}
  )
```

-   Plot mutant growth rate, individual cost, and total protein cost that is saved in KO mutants.

```{r, fig.width = 7.5, fig.height = 7.5}
plot_mutant_mu <- xyplot(mutant_mu ~ gene_name, df_mutant_mu,
    groups = substrate, as.table = TRUE,
    col = stdcol[2:4], xlab = "", ylab = expression("estimated µ [h"^-1*"]"),
    par.settings = custom.colorblind(), lwd = 2,
    scales = list(x = list(rot = 90)),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.abline(h = 0.1, col = grey(0.5), lty = 3, lwd = 2)
      panel.errbars(x, y, beside = TRUE, ...)
    }
  )

plot_mutant_cost <- xyplot(mean_mass_fraction_norm*100 ~ gene_name, df_mutant_cost,
    groups = substrate, par.settings = custom.colorblind(),
    error_margin = df_mutant_cost[["sd_massfraction"]]*100,
    lwd = 2, col = stdcol[2:4], ylim = c(-0.2, 6.5), pch = 1,
    xlab = "", ylab = "reduction protein cost [%]",
    scales = list(x = list(rot = 90)),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.errbars(x, y, beside = TRUE, ...)
      panel.key(corner = c(0.98, 0.95), ...)
      panel.key(corner = c(0.98, 0.60),
        labels = c("individual cost", "total cost (est.)"),
        col = grey(0.5), pch = c(1, 19))
    }
  ) + as.layer(
    xyplot(corrected_cost ~ gene_name, df_mutant_cost,
      groups = substrate,
      error_margin = df_mutant_cost[["corrected_sd"]],
      lwd = 2, col = stdcol[2:4],
      panel = function(x, y, ...) {
        panel.errbars(x, y, beside = TRUE, ...)
      }
    )
  )

plot_mbh_operon <- df_increased_mu %>%
  filter(time == 8) %>%
  group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop") %>%
  genome_plot(xlim = c(0.015, 22.490), title = "pHG1", rot_labels = 45)

plot_sh_operon <- df_increased_mu %>%
  filter(time == 8) %>%
  group_by(locus_tag, gene_name, scaffold, strand, start, end) %>%
  summarize(strains_per_gene = min(unique(strains_per_gene)), .groups = "drop") %>%
  genome_plot(xlim = c(79.612, 85.438), title = "pHG1", rot_labels = 45)

print(plot_mutant_mu, position = c(-0.017, 0.60, 1, 1), more = TRUE)
print(plot_mutant_cost, position = c(0.02, 0.30, 1, 0.70), more = TRUE)
print(plot_mbh_operon, position = c(0.075, 0.14, 0.97, 0.39), more = TRUE)
print(plot_sh_operon, position = c(0.075, 0, 0.4, 0.25))
grid::grid.text(c("A","B","C"), x = c(0.02, 0.02, 0.02),
  y = c(0.98, 0.66, 0.34), gp = grid::gpar(cex = 1.2))
```

```{r, echo = FALSE, results = 'hide'}
svg("../figures/figure_mutant_cost.svg", width = 7.5, height = 7.5)
print(plot_mutant_mu, position = c(-0.017, 0.60, 1, 1), more = TRUE)
print(plot_mutant_cost, position = c(0.02, 0.30, 1, 0.70), more = TRUE)
print(plot_mbh_operon, position = c(0.075, 0.14, 0.97, 0.39), more = TRUE)
print(plot_sh_operon, position = c(0.075, 0, 0.4, 0.25))
grid::grid.text(c("A","B","C"), x = c(0.02, 0.02, 0.02),
  y = c(0.98, 0.66, 0.34), gp = grid::gpar(cex = 1.2))
dev.off()
```

- plot correlation of estimated protein cost and growth rate, per mutant
- correlation is visible when plotting both parameters separately, but will it also hold in direct comparison?

```{r, fig.width = 6.5, fig.height =3.5}
plot_cost_vs_mu <- xyplot(mean_mutant_mu ~ corrected_cost | substrate,
    df_mutant_cost, groups = substrate,
    pointlabels = as.character(df_mutant_cost$gene_name),
    par.settings = custom.colorblind(), col = stdcol[2:4],
    xlab = "reduction protein cost [%]",
    ylab = expression("estimated µ [h"^-1*"]"),
    aspect = 1, scales = list(alternating = FALSE),
    between = list(x = 0.5, y = 0.5),
    panel = function(x, y, pointlabels, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.superpose(x, y, ...)},
    panel.groups = function(x, y, ...) {
      panel.xyplot(x, y, ...)
      panel.lmlineq(x, y, r.squared = TRUE, label = "", ...)
    }
  )

plot_cost_vs_mu

svg("../figures/figure_mutant_cost_vs_growth.svg", width = 6.5, height = 3.5)
print(plot_cost_vs_mu)
dev.off()
```


## Prediction of growth advantage using cell economy model

The constrained Resource Balance Analysis (RBA) model described in [Jahn et al., 2021](https://elifesciences.org/articles/69019) can be used to make predictions on the effect of hydrogenase knockout on cell growth rate. The RBA model is a genome scale metabolic model of *R. eutropha* that takes protein costs and global protein pool limitations into account. In constrast to simple FBA (flux balance analysis), growth rate is ultimately capped by rate limitation of enzymes and space limitation for enzyme mass within the cell, or the cytoplasmic membrane. The model was constrained using experimentally determined steady-state growth rates and protein abundances.

Here, we can ask the question if, and to which degree, growth rate would increase by expanding the total protein pool size (shrinking protein cost) by the fraction indicated above (e.g. *hoxA* KO freeing up to 4% protein mass). The first step is to import RBA simulation results.

```{r, warning = FALSE, results = 'hide'}
source(url("https://github.com/m-jahn/R-notebook-ralstonia-proteome/raw/main/pipeline/read_rba_result.R"))
dir_sim <- "../../../Models/Bacterial-RBA-models/Ralstonia-eutropha-H16/simulation/hydrogenase/"

df_flux <- read_rba_result(list.files(dir_sim, pattern = "fluxes_.*.tsv$", full.names = TRUE))
df_prot <- read_rba_result(list.files(dir_sim, pattern = "proteins_.*.tsv", full.names = TRUE))
df_macr <- read_rba_result(list.files(dir_sim, pattern = "macroprocesses_.*.tsv", full.names = TRUE))
```

The next step is to plot the increase in growth rate as a function of protein resources that have been saved. The prediction shows smaller increases than the growth rates estimated from the library data.

-   roughly 8% increase in µ for 10% reduction of protein cost
-   library data suggest roughly 40% increase for 4-5% reduction of protein cost
-   this is not realistic, most likely an artifact of calculation from NGS data

```{r, fig.width = 3.7, fig.height = 3.7}
plot_cost_predict <- df_macr %>%
  mutate(carbon_source = recode(carbon_source,
    `for` = "formate", `succ` = "succinate", `fru` = "fructose")) %>%
  mutate(sim_run = as.numeric(sim_run)) %>%
  arrange(carbon_source, sim_run) %>%
  group_by(carbon_source, key) %>%
  mutate(reduced_protein_cost = 0:10) %>%
  filter(key == "mu") %>%
  
  xyplot(value ~ reduced_protein_cost, .,
    groups = carbon_source, aspect = 1,
    par.settings = custom.colorblind(),
    type = "b", lwd = 2, lty = 2, pch = 1,
    ylim = c(0.095, 0.125), col = stdcol[2:4],
    xlab = "freed up protein resources [% total]",
    ylab = expression("µ [h"^-1*"]"),
    panel = function(x, y, z, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
      panel.key(..., cex = 0.7, corner = c(0.96, 0.96))
      panel.key(c("data", "model"), lines = TRUE, lty = c(1,2),
        pch = c(19, 1), col = grey(0.4),
        cex = 0.7, corner = c(0.96, 0.72))
    }
  )

print(plot_cost_predict)
```

## Growth advantage of clean, in-frame hydrogenase deletion mutants

- a selection of hydrogenase mutants was grown in bioreactors, batch mode
- total volume was 50 mL minimal medium supplemented with 2 g/L fructose + 1 g/L NH4Cl
- aeration: 100 mL/min for all tubes
- temperature 30\*C
- OD 720nm was constantly measured every 15 min
- biomass yield was determined by collecting all biomass at the end of the experiment and determining DCW

```{r, fig.width = 6.5, fig.height = 2.65}
df_growth_assay <- read_csv("../data/growth_assays/df_mu_hox.csv", show_col_types = FALSE)
df_dcw_assay <- read_csv("../data/growth_assays/df_dcw_hox.csv", show_col_types = FALSE)
plot_order <- c("WT", "hoxG", "hypB", "hypX", "hoxA", "hoxH")

plot_growth_dcw <- df_dcw_assay %>%
  mutate(sample = factor(sample, plot_order)) %>%
  xyplot(yield_gDCW_gFRC ~ sample, .,
    par.settings = custom.colorblind(),
    col = stdcol[3], lwd = 2, ylim = c(-0.0, 0.6),
    xlab = "", ylab = expression("yield [g DCW g Frc"^-1*"]"),
    scales = list(x = list(rot = 45)),
    panel = function(x, y, z, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, pch = 19, cex = 0.5, col = grey(0.6))
      panel.barplot(x, y, ...)
      panel.pvalue(x, y, offset = 0.12, ...)
    }
  )

plot_growth_mu <- df_growth_assay %>%
  mutate(sample = factor(sample, plot_order)) %>%
  xyplot(mu_max ~ sample, .,
    par.settings = custom.colorblind(),
    col = stdcol[3], lwd = 2, ylim = c(-0.0, 0.39),
    xlab = "", ylab = expression("µ [h"^-1*"]"),
    scales = list(x = list(rot = 45)),
    panel = function(x, y, z, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, pch = 19, cex = 0.5, col = grey(0.6))
      panel.barplot(x, y, ...)
      panel.pvalue(x, y, offset = 0.08, ...)
    }
  )

print(plot_growth_dcw, position = c(0, 0, 0.5, 1), more = TRUE)
print(plot_growth_mu, position = c(0.5, 0, 1, 1), more = TRUE)
grid::grid.text(c("A","B"), x = c(0.03, 0.5), y = c(0.94, 0.94), gp = grid::gpar(cex = 1.2))
```

## Hydrogen evolution of WT and hydrogenase mutant

- an alternative hypothesis to protein cost by hydrogenases is energy cost
- this hypothesis basically states that the enzymatic action of hydrogenases "running reverse" catalyzes the formation of H2 from NADH and protons (as known from cyanobacteria)
- this reaction would drain the reductant pool (NADH(P)H) and lead to reduced growth
- this reaction plays a role when the terminal electron acceptor, oxygen or one of the nitrogen oxides, are limiting
- in order to determine if C. necator spills reductant in heterotrophic conditions through the action of hydrogenases, WT and a mutant defective in all hydrogenases were cultivated with fructose and glycerol to induce hydrogenase expression
- without strong limitation of oxygen (which was not present in our bioreactor experiments), no hydrogen evolution was detectable
- this result strengthens the protein cost hypothesis


```{r, fig.width = 4, fig.height = 3.2}
df_h2_evolution <- read_csv("../data/growth_assays/h2_evolution.csv", show_col_types = FALSE) %>%
  mutate(time = ifelse(time == 20.5, 12.5, time))

plot_h2_evolution <- xyplot(mean ~ time,
    filter(df_h2_evolution, gas == "O2"),
    groups = strain,
    par.settings = custom.colorblind(),
    error_margin = filter(df_h2_evolution, gas == "O2")$sd,
    lwd = 2, pch = 1,
    xlab = "time [h]", ylab = expression("O"[2]*" [%]"),
    ylim = c(-2, 22),
    scales = list(x = list(at = seq(0, 10, 2))),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = 0, col = grey(0.9))
      panel.abline(v = seq(0, 10, 2), col = grey(0.9))
      panel.xyplot(x, y, type = "l", lty = 2, ...)
      panel.errbars(x, y, ewidth = 0, lty = 1, ...)
      panel.key(
        corner = c(0.9, 0.9),
        labels = c("∆hoxGHC,∆hofG", "WT", "O2", "H2"),
        col = c("#E7298A", "#66A61E", grey(0.5), grey(0.5)),
        pch = c(19, 19, 19, 1)
      )
    }
  ) + as.layer(
    xyplot(mean/100 ~ time,
      filter(df_h2_evolution, gas == "H2"),
      groups = strain, pch = 19, lty = 1,
      error_margin = filter(df_h2_evolution, gas == "H2")$sd/100,
      panel = function(x, y, ...) {
        panel.xyplot(x, y, type = "l", ...)
        panel.errbars(x, y, ewidth = 0, ...)
      }
    )
  )

print(plot_h2_evolution)
```


```{r, fig.width = 7.5, fig.height = 2.85}
svg("../figures/figure_mutant_mu.svg", width = 7.5, height = 2.85)
print(plot_growth_dcw, position = c(-0.02, -0.034, 0.38, 1), more = TRUE)
print(plot_growth_mu, position = c(0.29, -0.034, 0.69, 1), more = TRUE)
print(plot_h2_evolution, position = c(0.61, 0, 1, 1), more = TRUE)
grid::grid.text(c("D","E", "F"), x = c(0.03, 0.35, 0.65),
  y = c(0.94, 0.94, 0.94), gp = grid::gpar(cex = 1.2))
dev.off()
```

## Protein cost of secondary hydrogenases

This figure shows protein cost of the hox/hof genes that were not included in the protein cost analysis:

-   the 4th 'actinobacterial' hydrogenase, 1 subunit (hofG, PHG065)

```{r, fig.width = 4, fig.height = 3}
Ralstonia_eutropha %>% ungroup %>%
  filter(grepl("PHG065", protein), substrate != "ammonium") %>%
  
  xyplot(mean_mass_fraction_norm*100 ~ factor(growthrate) | protein, .,
    groups = substrate, par.settings = custom.colorblind(),
    error_margin = .[["sd_massfraction"]]*100,
    lwd = 2, col = stdcol[2:4], ylim = c(-0.0005, 0.0045), pch = 19,
    xlab = "", ylab = "protein mass fraction [%]",
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.errbars(x, y, beside = TRUE, ...)
      panel.key(corner = c(0.02, 0.95), ...)
    }
  )
```

## Protein cost of ETC complexes

- as a complimentary information, we can determine the mass fraction of ETC complexes using MS data
- these are membrane proteins and hence likely underestimated in comparison to soluble/cytoplasmic proteins
- the reason is the lower efficiency when extracting membrane proteins
- a relative comparison is nevertheless possible
- should allow us to see the rough order of magnitude of ETC complexes

```{r, fig.width = 8, fig.height = 6}
plots_etc_mass <- Ralstonia_eutropha %>%
  ungroup %>%
  select(locus_tag, mean_mass_fraction_norm, growthrate, protein, substrate, sd_massfraction) %>%
  inner_join(
    select(df_etc, locus_tag, gene_name) %>%
      distinct %>%
      mutate(gene_name = as.character(gene_name)),
    by = join_by(locus_tag)) %>%
  filter(substrate != "ammonium", growthrate == 0.1, !is.na(gene_name), !str_detect(gene_name, "H16_")) %>%
  mutate(operon = str_sub(gene_name, 1, 3)) %>%
  group_by(operon) %>%
  group_split() %>%
  lapply(function(df) {
    xyplot(mean_mass_fraction_norm*100 ~ factor(gene_name) | operon, df,
      groups = substrate, par.settings = custom.colorblind(),
      error_margin = df$sd_massfraction * 100,
      lwd = 2, col = stdcol[2:4], ylim = c(-0.005, 0.085), pch = 19,
      between = list(x = 0.5, y = 0.5), as.table = TRUE,
      scales = list(alternating = FALSE, x = list(rot = 90)),
      xlab = "", ylab = "protein mass fraction [%]",
      panel = function(x, y, ...) {
        panel.grid(h = -1, v = -1, col = grey(0.9))
        panel.errbars(x, y, beside = TRUE, ...)
        panel.key(corner = c(0.02, 0.95), ...)
      }
    )
  })

gridExtra::grid.arrange(
  ncol = 3, nrow = 2,
  plots_etc_mass[[5]],
  plots_etc_mass[[4]],
  plots_etc_mass[[6]],
  plots_etc_mass[[2]],
  plots_etc_mass[[3]],
  plots_etc_mass[[1]]
)
```


```{r, echo = FALSE}
svg("../figures/figure_etc_mass.svg", width = 8, height = 6)
gridExtra::grid.arrange(
  ncol = 3, nrow = 2,
  plots_etc_mass[[5]],
  plots_etc_mass[[4]],
  plots_etc_mass[[6]],
  plots_etc_mass[[2]],
  plots_etc_mass[[3]],
  plots_etc_mass[[1]]
)
dev.off()
```

# Export processed data

- export fitness data in a format that is suitable for [ShinyLib](https://m-jahn.shinyapps.io/ShinyLib/)
- we export entire data set for best comparison options, even though first part (carbon sources) was published ealier as Jahn et al., eLife, 2021

```{r}
Cupriavidus_BarSeq_2023 <- df_fitness %>%
  rename(timepoint = time, fitness_score = norm_gene_fitness, t_stat = t) %>%
  mutate(
    induction = "not applicable",
    log2FC = replace(log2FC, is.infinite(log2FC), NA),
    fold_change = 2^log2FC,
    condition = paste0(substrate, " - ", condition)
  ) %>%
  select(-ID) %>%
  arrange(locus_tag, condition, timepoint) %>%
  left_join(by = "locus_tag",
    df_ref %>%
      select(
        locus_tag, new_locus_tag,
        uniprot, protein_name, gene_name,
        System, Process, Pathway
      )
  )

# export to RData
save(Cupriavidus_BarSeq_2023, file = "../data/barseq/Cupriavidus_BarSeq_2023.Rdata")
```


# Session Info

```{r}
sessionInfo()
```
