This R notebook is a bioinformatics pipeline to map reads from a barcoded transposon library to the genome of a target organism. For background and details regarding the method, see Wetmore at al., mBio, 2015 and Price et al., Nature, 2018). The initial steps of processing next generation sequencing data was directly adapted from Morgan Price’s Feba repository, see also the TnSeq-pipe github repository for an overview.
Fastq
raw data files were processed as outlined in the documentation for the TnSeq-pipe github repository. This step creates the barcode mappings (data/mapped/*.tsv
) and the summary pool file (data/pool/pool.tsv
). Mapping files from different sequencing runs were combine to one pool file using this standalone perl script from the TnSeq-pipe repo.
perl feba/bin/DesignRandomPool.pl -minN 1 \
-pool ../R_projects/R-notebook-ralstonia-proteome/data/pool/CN_V2_pool.tsv \
-genes ref/GCF_000009285.1_ASM928v2_genomic_trimmed.tsv \
../R_projects/R-notebook-ralstonia-proteome/data/mapped/*.tsv
# optionally install repos from github
# devtools::install_github("m-jahn/lattice-tools")
suppressPackageStartupMessages({
library(lattice)
library(latticeExtra)
library(latticetools)
library(data.table)
library(MASS)
library(stringi)
library(zoo)
library(tidyverse)
})
The next step is to inspect basic statistics of transposon insertions and their distribution over the genome.
First read in data tables of the barcode ‘pool’, a summary of individual sequencing reads. Barcodes without mapping to the genome are removed. Data from three different sequencing runs were combined in one pool file (CN_V2_pool.tsv
). A separate pool file contains a different, earlier iteration of the library and therefore a different set of transposon insertions, prepared by Kyle Kimler (CN_V1_pool.tsv
).
# import seq data from first version of library
df_pool_V1 <- read_tsv("../data/pool/CN_V1_pool.tsv")
Rows: 125070 Columns: 12
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): barcode, rcbarcode, scaffold, strand, scaffold2, strand2
dbl (6): nTot, n, pos, n2, pos2, nPastEnd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# import seq data from second version of library
df_pool_V2 <- read_tsv("../data/pool/CN_V2_pool.tsv")
Rows: 124143 Columns: 12
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): barcode, rcbarcode, scaffold, strand, scaffold2, strand2
dbl (6): nTot, n, pos, n2, pos2, nPastEnd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# combine all in one df, removing duplicate barcodes
df_pool <- bind_rows(df_pool_V1, df_pool_V2, .id = "version") %>%
arrange(desc(version)) %>% filter(!duplicated(barcode)) %>%
arrange(version)
# import reference genome
df_ref <- read_tsv("../data/ref/GCF_000009285.1_ASM928v2_genomic_trimmed.tsv") %>%
filter(!duplicated(old_locus_tag))
Rows: 7051 Columns: 7
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): scaffold, strand, desc, old_locus_tag, new_locus_tag
dbl (2): begin, end
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[,1] [,2]
version "1" "2"
N_reads "26329546" " 5786608"
N_unique_bc "124649" "124143"
N_bc_with_2_or_more_reads "72443" "57040"
N_bc_with_10_or_more_reads "68423" "29391"
N_bc_eith_1_read "52206" "67103"
N_bc_alternative_pos "59203" "18073"
N_bc_plus_strand "61189" "61678"
N_bc_minus_strand "63460" "62465"
N_bc_per_kbp "16.80585" "16.73763"
Next we can plot the frequency of reads per barcoded transposons.
plot_reads_per_bc <- histogram(~ log2(nTot) | paste("strand:", strand), df_pool,
par.settings = custom.colorblind(),
between = list(x = 0.5, y = 0.5),
xlab = expression("log"[2]*" reads per barcode"),
scales = list(alternating = FALSE),
panel = function(x, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.histogram(x, border = "white", ...)
}
)
print(plot_reads_per_bc)
Read frequency over genome
Each transposon insertion is indexed with a position on the genome. We can now plot insertion frequency over the genome. There are different ways to do that depending on how the data is treated. The most simple case (as done below) plotting the number of reads per transposon versus its insertion site on the genome, broken down by chromosome type (‘scaffold’).
plot_reads_on_genome <- xyplot(nTot ~ pos | scaffold,
df_pool %>% arrange(pos),
par.settings = custom.colorblind(),
between = list(x = 0.5, y = 0.5),
layout = c(1,3), type = "l", lwd = 1.5,
scales = list(alternating = FALSE),
panel = function(x, y, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.xyplot(x, y, ...)
}
)
print(plot_reads_on_genome)
Tn insertion frequency over genome
However this does not really reflect the actual insertion frequency. For the frequency, what matters is the number of different, unique insertions per kb of the genome. We can apply a density function or generate a rolling mean to evaluate frequency. Lattice’s densityplot
scales the frequency per location based on the length of x-axis, so it’s not suitable to compare insertion frequencies between chromosomes of different length.
Instead, we use a defined window of for example 10,000 bp and determine the sum of Tn insertion events per window. Some fo them might be duplicated barcodes because they map to more than one position. However these were still included here since they are often biologically relevant (CBB operon) and are low in number.
plot_Tns_on_genome <- df_pool %>%
arrange(pos) %>%
mutate(region = cut_interval(pos, length = 10000, labels = FALSE)*10000) %>%
group_by(scaffold, region) %>%
summarize(tn_per_region = length(barcode)) %>%
xyplot(tn_per_region ~ region | scaffold, .,
par.settings = custom.colorblind(),
ylab = "Tn insertions / 10 kb",
between = list(x = 0.5, y = 0.5),
layout = c(1,3), type = "l", lwd = 1.5,
scales = list(alternating = FALSE),
panel = function(x, y, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.xyplot(x, y, ...)
}
)
`summarise()` has grouped output by 'scaffold'. You can override using the `.groups` argument.
print(plot_Tns_on_genome)
The basic Feba scripts produce a table of barcodes, their frequencies and genomic position information. What we really want to know is how many transposons/barcodes are mapped to each gene, which position within a gene they have, how many barcodes do not map to a gene (intergenic, low importance), and how many genes were not hit by a transposon (probably essential). For this purpose we can use the function foverlaps()
from package data.table
to map transposon insertion sites to genes (or vice versa). The following part was inspired by previous work of Kyle Kimler (github link).
# prepare input data in form of data tables
dt_pool <- data.table(df_pool)
dt_ref <- data.table(df_ref)
#dummy begin/end columns are created in the pool file to allow foverlap function
dt_pool$begin <- dt_pool$pos
dt_pool$end <- dt_pool$pos
# map Tn insertion sites to genes
setkey(dt_ref, scaffold, begin, end)
df_pool_annotated <- foverlaps(dt_pool, dt_ref,
by.x = c("scaffold", "begin", "end"), type = "within") %>%
as_tibble %>%
select(barcode, rcbarcode, nTot, n, scaffold, i.strand, pos, begin,
end, strand, desc, old_locus_tag, new_locus_tag, version) %>%
rename(gene_strand = strand, strand = i.strand) %>%
# exclude alternative mappings in case of overlapping genes
filter(!duplicated(barcode)) %>%
# include also unhit genes in main table
full_join(df_ref)
Joining, by = c("scaffold", "strand", "begin", "end", "desc", "old_locus_tag", "new_locus_tag")
head(df_pool_annotated)
Now that all transposons are mapped to genes (if possible), we can calculate basic statistics about how many genes were hit, how many transposons inserted in a gene on average, and how many transposons hit intergenic regions.
Barcodes per gene type
We filter ambiguous barcodes out (barcode mapping to more than one position). We can see that almost all transposons inserted into genes/pseudogenes, which is the expected outcome.
df_pool_annotated %>%
group_by(desc) %>%
summarize(n_barcodes = sum(!is.na(barcode))) %>%
mutate(desc = replace_na(desc, "intergenic")) %>%
arrange(desc(n_barcodes))
Insertions per gene
df_pool_annotated <- df_pool_annotated %>%
group_by(old_locus_tag) %>%
mutate(tn_per_gene = sum(!is.na(barcode))) %>%
mutate(tn_per_gene = case_when(
is.na(old_locus_tag) ~ 0,
TRUE ~ as.numeric(tn_per_gene[1]))
)
plot_insertions_per_gene <- df_pool_annotated %>%
summarize(tn_per_gene = tn_per_gene[1]) %>%
filter(tn_per_gene < 150) %>%
histogram( ~ tn_per_gene, .,
par.settings = custom.colorblind(), border = "white",
breaks = 40, xlab = "insertions per gene",
panel = function(x, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.histogram(x, ...)
}
)
print(plot_insertions_per_gene)
Top 10 genes by number of Tn insertions
df_pool_annotated %>%
summarize(tn_per_gene = tn_per_gene[1]) %>%
arrange(desc(tn_per_gene)) %>% slice(1:10)
The mapping of a transposon to a gene also reveals its relative position within the gene. We can use this information to tag insertions as more likely to have a fitness effect, or not. We can also filter out transposons that lie outside the central portion of a gene (e.g. 10% margin to each side), or within a fixed flanking region (e.g. first or last 100 bp). The original FEBA protocol from Morgan Price uses a quality filter that requires transposons be located within the central 80% of a gene. We follow this definition and flag transposon outside the central portion of a gene as unreliable.
# apply margin of 10% gene length
df_pool_annotated <- df_pool_annotated %>% ungroup %>%
mutate(
gene_length = end-begin,
pos_relative = (pos-begin)/(end-begin),
central = dplyr::between(pos_relative, 0.1, 0.9)
)
How many transposons that inserted into a gene are central? We can summarize, and find that around 80% are central.
df_pool_annotated %>%
filter(!is.na(central)) %>%
group_by(central) %>%
summarize(frequency = length(pos)) %>%
mutate(percent = frequency/sum(frequency)*100)
How are insertions distributed over each gene, measured in relative position from 0 to 1? There is a trend towards higher insertion frequency at the termini of genes. Otherwise the insertion frequency is homogeneously distributed.
plot_insertion_position <- df_pool_annotated %>%
filter(!is.na(central)) %>%
histogram( ~ pos_relative, .,
par.settings = custom.colorblind(),
breaks = 50,
panel = function(x, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.histogram(x, border = "white", ...)
}
)
print(plot_insertion_position)
We can estimate gene essentiality from the average frequency of transposon insertions per region, and the specific number of transposon insertions for a gene. Transposons integrate with a certain probability into the genome, and this probability depends on the distance to the origin of replication on a chromosome. Following the protocol from Rubin et al., PNAS, 2015 for a Tn library in Synechococcus, an insertion index is calculated that is a measure of essentiality. This index is the insertion frequency per gene divided by frequency per region (average of e.g. 100 genes).
I = (tn_gene / length_gene) / (tn_region / length_region)
with tn_gene and length_gene being the number of transposons inserted per gene, normalized by length. This is then compared to the average insertion frequency for a region/window with e.g. 10,000 bp width.
Random transposon insertion into the genome resembles the random drawing of balls from an urn, with replacement. The transposon can integrate at the position multiple times (= recycling of positions). To estimate the probability of the observed number of insertions, we can apply the binomial distribution (without replacement it would be the hypergeometric distribution). The number of total available insertion positions is a window of 10,000 bp around a gene. The single probability of transposon insertion into a gene is the length of the gene divided by length of the window (e.g. 1,000 bp/10,000 bp). The number of samples drawn is the actual number of Tn insertions in this window, e.g. 300. The probability P(x)
of Tn insertion into the gene at a rate x = 0 (exactly no insertion) is then:
P(x=0) = pbinom(x, sum of Tn insertions = 300, single probability = 1000/10000) = 1.873928e-14
These two terms are eqivalent: dbinom()
gives the density function, i.e. probability for the exact event P(X = x)
. pbinom()
gives the distribution function, that means the sum of the probabilities for all events P[X ≤ x]
in case of lower.tail = TRUE
(the default).
sum(dbinom(0:10, 300, 1000/10000))
[1] 1.068376e-05
pbinom(10, 300, 1000/10000)
[1] 1.068376e-05
Now we determine insertion index and insertion probability for the TnSeq data.
len_interval = 50000
# calculate the ratio of insertion frequency per gene and per region
df_pool_annotated <- df_pool_annotated %>%
# Construct intervals spanning 10 kb
mutate(pos = if_else(is.na(pos), round(begin+(end-begin)/2), pos)) %>%
group_by(scaffold) %>% arrange(pos) %>%
mutate(length_interval = cut_interval(pos, length = len_interval)) %>%
group_by(scaffold, length_interval) %>%
mutate(length_interval = length(pos)) %>%
# determine CENTRAL insertions per flanking interval,
# as sum over a rolling window
group_by(scaffold) %>%
mutate(tn_interval = zoo::rollapply(central,
FUN = function(x){sum(x, na.rm = TRUE)},
width = length_interval, fill = NA, partial = TRUE)) %>%
# finally determine insertion index
group_by(old_locus_tag) %>%
mutate(insertion_index = median(na.rm = TRUE,
(sum(central, na.rm = TRUE) / gene_length) / (tn_interval / len_interval)
) %>% replace_na(0) %>% replace(., . > 100, NA)) %>%
# and probability of observed number of insertions or lower P(X <= x)
mutate(insertion_probability =
pbinom(sum(central, na.rm = TRUE), round(mean(tn_interval, na.rm = TRUE)), mean(gene_length)*0.8/len_interval)
)
Now we can plot distribution of insertion indices, and insertion probabilities.
plot_ii_hist <- df_pool_annotated %>% slice(1) %>%
filter(insertion_index < 3) %>%
histogram( ~ insertion_index, .,
par.settings = custom.colorblind(),
breaks = 30,
panel = function(x, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.histogram(x, border = "white", ...)
}
)
plot_ip_hist <- df_pool_annotated %>% slice(1) %>%
histogram( ~ insertion_probability, .,
par.settings = custom.colorblind(),
breaks = 30,
panel = function(x, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.histogram(x, border = "white", ...)
}
)
plot_ii_vs_ip <- df_pool_annotated %>% slice(1) %>%
xyplot(log10(insertion_probability) ~ log10(insertion_index), .,
par.settings = custom.colorblind(), pch = 19, alpha = 0.3,
panel = function(x, y, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.xyplot(x, y, ...)
}
)
print(plot_ii_hist, split = c(1,1,3,1), more = TRUE)
print(plot_ip_hist, split = c(2,1,3,1), more = TRUE)
print(plot_ii_vs_ip, split = c(3,1,3,1))
We see that there is population of genes with a probability of insertion similar to the average (II around 1). In other words, many genes are hit randomly by transposon insertions at the same rate as the surrounding genomic region (coding or non-coding doesn’t play a roll). And then there is a set of ‘outliers’ that are hit not at all or at much lower frequency. To determine where to set a threshold between ‘still within random insertion freuency’ and ‘significantly lower than random insertion frequency’, the method described in Rubin et al., PNAS, 2016 is used. This method is based on scripts from the Bio-Tradis workflow. The following code is adapted from tradis_essentiality.R
. It fits two gamma distributions to the underlying populations of 1) essential and 2) non-essential genes. The distributions are used to obtain thresholds for high likelihood of a gene/II falling into category 1 or 2.
# function to fit two gamma distributions to insertion index (II) distribution
# and identify thresholds for essential, ambiguous, and non-essential genes
find_essential <- function(ins_index, prob_ratio = 5) {
# identify second maxima
h <- hist(ins_index, breaks = 200, plot = FALSE)
maxindex <- which.max(h$density[10:length(h$density)])
maxval <- h$mids[maxindex+3]
# find inter-mode minimum insertion index with loess
hist_min <- hist(ins_index[ins_index < maxval],
breaks = seq(0, maxval, by = maxval/2000), plot = FALSE)
lo <- loess(hist_min$density ~ c(1:2000))
local_min = hist_min$mids[which.min(predict(lo))]
# fraction of values assigned to each distribution
f1 = (sum(ins_index < local_min) + sum(ins_index == 0))/length(ins_index)
f2 = (sum(ins_index >= local_min))/length(ins_index)
# fit 1) exponential function to II of essential genes
# fit 2) gamma distribution to II of nonessential genes
d1 = fitdistr(ins_index[ins_index < local_min], "exponential")
d2 = fitdistr(ins_index[ins_index >= local_min], "gamma") %>% suppressWarnings()
# plots
ii_range <- seq(0, round(max(ins_index), 1), length.out = 1000)
fit_essential <- f1*dgamma(ii_range, 1, d1$estimate[1])
fit_non_essential <- f2*dgamma(ii_range, d2$estimate[1], d2$estimate[2])
# given the two probability density functions,
# we can determine the probability of an event falling into one or the other category
# for example, we can determine a threshold for the II where the probability of
# falling into one category is much (5x) higher than of falling into the other
p1 <- f1*dgamma(ii_range, 1, d1$estimate[1])
p2 <- f2*dgamma(ii_range, d2$estimate[1], d2$estimate[2])
ambiguous <- ii_range[which((p1 < prob_ratio*p2) & (p2 < prob_ratio*p1))]
# return list of results
list(
lower_t = ambiguous[1],
upper_t = tail(ambiguous, 1),
ii_range = ii_range,
fit_essential = fit_essential,
fit_non_essential = fit_non_essential
)
}
After adapting the function, we run it with the insertion index for all genes as input and obtain the fitted density functions, and the thresholds for ambiguous genes.
# run function with II
essential <- summarize(df_pool_annotated, ii = insertion_index[1]) %>%
filter(ii < 3) %>% pull(ii) %>%
find_essential
# and plot the results on an II histogram with overlaid probability
# density function
df_pool_annotated %>%
summarize(insertion_index = insertion_index[1]) %>%
filter(insertion_index < 3) %>%
histogram( ~ insertion_index, .,
par.settings = custom.colorblind(),
breaks = 30, col = grey(0.7),
type = "density",
panel = function(x, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.histogram(x, border = "white", ...)
panel.key(labels = c("essential", "non-essential"), points = FALSE,
lines = TRUE, lwd = 3, corner = c(0.9, 0.9))
panel.abline(v = c(essential$lower_t, essential$upper_t),
lwd = 2, lty = 2, col = grey(0.5))
}
) + as.layer(
xyplot(essential$fit_essential + essential$fit_non_essential ~ essential$ii_range,
type = "l", lwd = 3
)
)
First and most interesting question: How many genes are essential and non-essential? We can compile a new summary data frame with one gene per row, and add extensive genome annotation to it (compiled from KEGG, eggNOG COG, uniprot).
# summarize information per gene
df_essential <- df_pool_annotated %>%
rename(locus_tag = old_locus_tag) %>%
group_by(locus_tag) %>%
summarize(
n_barcodes = sum(!is.na(barcode)),
n_barcodes_central = sum(central, na.rm = TRUE),
scaffold = scaffold[1],
begin = begin[1],
end = end[1],
gene_strand = gene_strand[1],
desc = desc[1],
insertion_index = insertion_index[1],
insertion_probability = insertion_probability[1],
tn_interval = mean(tn_interval, na.rm = TRUE)
) %>%
# add verdict about essentiality to each locus
mutate(
essentiality = case_when(
insertion_index <= essential$lower_t ~ "essential",
insertion_index <= essential$upper_t & insertion_index > essential$lower_t ~ "ambiguous",
insertion_index > essential$upper_t ~ "non-essential"
)
) %>%
# optional filtering of false-positive 'essential' genes that were probably not hit
# due to low tn_interval insertion frequency
mutate(
essentiality = case_when(
insertion_probability >= 0.1 & essentiality == "essential" ~ "ambiguous",
TRUE ~ essentiality
)
) %>%
# merge with genome annotation for R.e.
left_join(read_csv("../data/ref/Ralstonia_H16_genome_annotation.csv"))
Rows: 6832 Columns: 26
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (21): uniprot, protein, protein_name, gene_name, locus_tag, EC_number, gene_ontology_IDs, System, Process, Pathway, COG, ...
dbl (5): length, start, end, feature_interval_length, Psortb_score
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining, by = c("locus_tag", "end")
head(df_essential)
We can now explore the set of essential genes broken down by functional annotation, and so on.
# overview
df_essential %>% pull(essentiality) %>% table
.
ambiguous essential non-essential
260 513 5776
# plot only essential and ambiguous genes sorted by category
df_essential %>% filter(essentiality %in% c("essential", "ambiguous")) %>%
group_by(COG_Process, essentiality) %>%
summarize(n_genes = length(locus_tag)) %>%
mutate(n_genes_tot = sum(n_genes)) %>% ungroup %>%
arrange(desc(n_genes_tot)) %>%
mutate(COG_Process = replace_na(COG_Process, "Not annotated") %>%
substr(1, 16) %>% paste0("..")) %>%
xyplot(n_genes ~ COG_Process %>% factor(., unique(.)), .,
par.settings = custom.colorblind(),
groups = essentiality, ylim = c(0, 120),
between = list(x = 0.5, y = 0.5), lwd = 2,
scales = list(alternating = FALSE, x = list(rot = 25)),
as.table = TRUE, horizontal = FALSE, stack = TRUE,
border = "white", xlab = "", ylab = "N genes",
panel = function(x, y, ...) {
panel.grid(h = -1, v = -1, col = grey(0.9))
panel.barchart(x, y, ...)
panel.key(..., corner = c(0.9, 0.9), pch = 15)
}
)
`summarise()` has grouped output by 'COG_Process'. You can override using the `.groups` argument.
We can also ‘zoom in’ on different genes of interest. Now it’s gonna be really interesting. We can plot three levels of information for the CBB genes: the locus arrangement, i.e. gene map, the barcode positions and reads mapped to each position, and finally the verdict of a gene is essential or not (probably also insertion index).
This plot shows TN insertions for the CBB operon on the megaplasmid.
plots_cbb_mp <- plot_tn_insertions(chromosome = "NC_005241.1",
start_bp = 435000, end_bp = 451000)
print(plots_cbb_mp[[3]], position = c(0,0.6,1,1), more = TRUE)
print(plots_cbb_mp[[2]], position = c(0,0.35,1,0.75), more = TRUE)
print(plots_cbb_mp[[1]], position = c(0,0,1,0.5))
This plot shows the CBB operon on chromosome 2.
plots_cbb_mp <- plot_tn_insertions(chromosome = "NC_008314.1",
start_bp = 1547000, end_bp = 1565000)
print(plots_cbb_mp[[3]], position = c(0,0.6,1,1), more = TRUE)
print(plots_cbb_mp[[2]], position = c(0,0.35,1,0.75), more = TRUE)
print(plots_cbb_mp[[1]], position = c(0,0,1,0.5))
Only one of the genes is labeled as essential, which makes sense, because at least Rubisco was reported before as not being essential. The TnSeq data was obtained with a library grown on LB complete medium, where Calvin cycle genes should not be essential. The central region of chromosome 2 is extremely sparse with Tn insertions. This is a zoomed out version of the previous plot.
plots_cbb_mp_out <- plot_tn_insertions(chromosome = "NC_008314.1",
start_bp = 1525000, end_bp = 1580000)
print(plots_cbb_mp_out[[3]], position = c(0,0.6,1,1), more = TRUE)
print(plots_cbb_mp_out[[2]], position = c(0,0.35,1,0.75), more = TRUE)
print(plots_cbb_mp_out[[1]], position = c(0,0,1,0.5))
Finally we can also examine genes/operons that are labeled as essential, such as DNA replication proteins DnaA, DnaN, gyrB (gyrase).
plots_cbb_mp <- plot_tn_insertions(chromosome = "NC_008313.1",
start_bp = 0, end_bp = 30000)
print(plots_cbb_mp[[3]], position = c(0,0.6,1,1), more = TRUE)
print(plots_cbb_mp[[2]], position = c(0,0.35,1,0.75), more = TRUE)
print(plots_cbb_mp[[1]], position = c(0,0,1,0.5))
Example for several essential subunits of gat operon, the Glutamyl/Aspartyl-tRNA(Gln/Asp) amidotransferase, responsible for ‘charging’ glutaminyl-tRNAs (actually transferring an amino group to a glutamyl-tRNA). Other essential enzymes (subunits) are mreBCD involved in cell cycling and DNA maintenance, mrdAB involved in petidoglycan synthesis, and lipAB involved in lipi biosynthesis.
plots_cbb_mp <- plot_tn_insertions(chromosome = "NC_008313.1",
start_bp = 110000, end_bp = 145000)
print(plots_cbb_mp[[3]], position = c(0,0.6,1,1), more = TRUE)
print(plots_cbb_mp[[2]], position = c(0,0.35,1,0.75), more = TRUE)
print(plots_cbb_mp[[1]], position = c(0,0,1,0.5))
Example for ribosomal proteins on chromosome 1.
plots_cbb_mp <- plot_tn_insertions(chromosome = "NC_008313.1",
start_bp = 3724000, end_bp = 3758000)
print(plots_cbb_mp[[3]], position = c(0,0.6,1,1), more = TRUE)
print(plots_cbb_mp[[2]], position = c(0,0.35,1,0.75), more = TRUE)
print(plots_cbb_mp[[1]], position = c(0,0,1,0.5))
Genes were labeled according to three categories with the help of the TnSeq analysis: 1) essential
for genes with very few or no transposon insertions, 2) non-essential
for genes that were hit according to average insertion frequency, and 3) ambiguous
for genes that could fall in both categories.
THe following analysis compares essentiality from TnSeq results with essentiality for the LB (complete) medium condition as predicted by the genome scale model. We will rughly look at the following four categories:
TRUE NEGATIVE
)FALSE POSITIVE
)FALSE NEGATIVE
)TRUE POSITIVE
)The first task is to import model predictions for gene essentiality.
df_model <- read_csv("../data/input/model_gene_essentiality.csv", col_types = cols()) %>%
select(-1) %>% rename(locus_tag = gene)
New names:
* `` -> ...1
Then we add model essentiality to summary table, group by essentiality and count. 1/3 of the essential genes according to the model were also found to be essential based on Tn insertions.
df_essential <- df_model %>% group_by(locus_tag) %>%
summarize(model_essential = max(as.integer(LB_medium))) %>%
right_join(df_essential)
Joining, by = "locus_tag"
df_essential %>%
filter(!is.na(essentiality), !is.na(model_essential)) %>%
group_by(essentiality) %>%
summarize(
mod_essential = sum(model_essential == 1),
mod_non_essential = sum(model_essential == 0)
)
We can have a closer look at the overlapping set of genes (True positives), and on the set of genes that is essential according to the data, but not according to the model (False negatives). The table is arranged with True positives first, then false negatives.
df_essential %>%
filter(essentiality == "essential", !is.na(model_essential)) %>%
arrange(desc(model_essential), desc(tn_interval))
Export selected tables.
# export essentiality information for all genes
df_essential %>%
select(-c(2, 19:37)) %>%
write_csv("../data/output/essentiality_all.csv")
# export simple table with ternary encoding of essential (2), ambiguous (1),
# non-essential genes (0) for mapping with Escher
df_essential %>%
mutate(essentiality = recode(essentiality, "non-essential" = 0, "ambiguous" = 1, "essential" = 2)) %>%
select(locus_tag, essentiality) %>%
write_csv("../data/output/essentiality_escher.csv")
# export annotated pool file for library V1 and V2
df_pool_annotated %>% filter(!is.na(barcode), version == "1") %>%
select(-tn_per_gene, -length_interval, -tn_interval,
-insertion_index, -insertion_probability) %>%
write_tsv("../data/output/annotated_pool_V1.tsv")
df_pool_annotated %>% filter(!is.na(barcode), version == "2") %>%
select(-tn_per_gene, -length_interval, -tn_interval,
-insertion_index, -insertion_probability) %>%
write_tsv("../data/output/annotated_pool_V2.tsv")