Description

This R notebook is a bioinformatics pipeline to map reads from a barcoded transposon library to the genome of a target organism. For background and details regarding the method, see Wetmore at al., mBio, 2015 and Price et al., Nature, 2018). The initial steps of processing next generation sequencing data were directly adapted from Morgan Price’s Feba repository, see also README.md of this repository.

Bash pipeline

Fastq raw data files can be processed as outlined in the README.md documentation for this repository. This creates the barcode mappings and the summary pool files.

cd /path/to/TnSeq-pipe
source/run_tnseq_mapping.sh --pattern H16.* --ref GCF_000009285.1_ASM928v2_genomic

Libraries

library(lattice)
library(latticeExtra)
library(latticetools)
library(tidyverse)
library(stringi)
library(data.table)

Mapping statistics and distribution on genome

The next step is to inspect basic statistics of transposon insertions and their distribution over the genome.

Basic statistics

First read in a pooled data table and display summary statistics. Remove barcodes without mapped position.

# import data
df_pool <- read_tsv("../data/pool/pool.tsv") %>%
  filter(!is.na(pos))

# import reference genome
df_ref <- read_tsv("../ref/GCF_000009285.1_ASM928v2_genomic.gff") %>%
  filter(!duplicated(old_locus_tag))
[1] "Number of total reads: 6029"
Number of total reads: 6029
[1] "Number of unique barcodes: 2309"
Number of unique barcodes: 2309
[1] "Number of barcodes with >= 2 reads: 680"
Number of barcodes with >= 2 reads: 680
[1] "Number of barcodes with >= 10 reads: 112"
Number of barcodes with >= 10 reads: 112
[1] "Number of barcodes with only 1 read: 1629"
Number of barcodes with only 1 read: 1629
[1] "Number of barcodes with > 1 position: 82"
Number of barcodes with > 1 position: 82
[1] "Number of barcodes on -/+ strand: 1181, 1128"
Number of barcodes on -/+ strand: 1181, 1128

Next we can plot the frequency of reads per barcoded transposons.

plot_reads_per_bc <- histogram(~ log2(nTot) | paste("strand:", strand), df_pool,
  par.settings = custom.colorblind(), breaks = 8,
  between = list(x = 0.5, y = 0.5), xlim = c(-0.5, 7.5),
  xlab = expression("log"[2]*" reads per barcode"),
  scales = list(alternating = FALSE),
  panel = function(x, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.histogram(x, border = "white", ...)
  }
)

print(plot_reads_per_bc)

Distribution over the genome

Read frequency over genome

Each transposon insertion is indexed with a position on the genome. We can now plot insertion frequency over the genome. There are different ways to do that depending on how the data is treated. The most simple case (as done below) is plotting the number of reads per transposon versus its insertion site on the genome, broken down by chromosome type (‘scaffold’).

plot_reads_on_genome <- xyplot(nTot ~ pos | scaffold,
  df_pool %>% arrange(pos),
  par.settings = custom.colorblind(),
  between = list(x = 0.5, y = 0.5),
  layout = c(1,3), type = "l", lwd = 1.5,
  scales = list(alternating = FALSE),
  panel = function(x, y, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.xyplot(x, y, ...)
  }
)

print(plot_reads_on_genome)

Tn insertion frequency over genome

However this does not really reflect the actual insertion frequency. For the frequency, what matters is the number of different, unique insertions per kb of the genome. We can apply a density function or generate a rolling mean to evaluate frequency. Here, a defined window of 10,000 bp was used, and the sum of Tn insertion events per window was determined.

plot_Tns_on_genome <- df_pool %>% arrange(pos) %>%
  mutate(interval = cut_interval(pos, length = 10000, labels = FALSE)*10000) %>%
  group_by(scaffold, interval) %>%
  summarize(tn_frequency = length(pos)) %>%
  
  xyplot(tn_frequency ~ interval | scaffold, .,
    par.settings = custom.colorblind(),
    ylab = "Tn insertions / 10 kb",
    between = list(x = 0.5, y = 0.5),
    layout = c(1,3), type = "l", lwd = 1.5,
    scales = list(alternating = FALSE),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
    }
  )

print(plot_Tns_on_genome)

Mapping barcodes to genes

The basic Feba scripts produce a table of barcodes, their frequencies and genomic position information. What we really want to know is how many transposons/barcodes are mapped to each gene, which position within a gene they have, how many barcodes do not map to a gene (intergenic, low importance), and how many genes were not hit by a transposon (probably essential). For this purpose we can use the function foverlaps() from package data.table to map transposon insertion sites to genes (or vice versa). The following part was inspired by previous work of Kyle Kimler (github link).

# prepare input data in form of data tables
dt_pool <- data.table(df_pool)
dt_ref <- data.table(df_ref)

#dummy begin/end columns are created in the pool file to allow foverlap function
dt_pool$begin <- dt_pool$pos
dt_pool$end <- dt_pool$pos

# map Tn insertion sites to genes
setkey(dt_ref, scaffold, begin, end)
df_pool_annotated <- foverlaps(dt_pool, dt_ref, 
    by.x = c("scaffold", "begin", "end"), type = "within") %>%
  as_tibble %>%
  select(!matches("2$|^i\\.[be]")) %>%
  rename(gene_strand = strand, strand = i.strand) %>%
  # exclude alternative mappings in case of overlapping genes
  filter(!duplicated(barcode)) %>%
  # include also unhit genes in main table
  full_join(df_ref) %>%
  # sort by scaffold, then position
  arrange(scaffold, begin)

head(df_pool_annotated)

Gene insertion frequency

Now that all transposons are mapped to genes (if possible), we can calculate basic statistics about how many genes were hit, how many transposons inserted in a gene on average, and how many transposons hit intergenic regions.

Barcodes per gene type

The majority of transposon insertions should take place in gene ORFs. NA are intergenic regions with no annotated function.

df_pool_annotated %>%
  group_by(desc) %>%
  summarize(n_barcodes = sum(!is.na(barcode))) %>%
  mutate(desc = replace_na(desc, "intergenic")) %>%
  arrange(desc(n_barcodes))

Insertions per gene

plot_insertions_per_gene <- df_pool_annotated %>% 
  filter(!is.na(old_locus_tag)) %>%
  group_by(old_locus_tag) %>%
  summarize(n_barcodes = length(unique(barcode))) %>%
  
  histogram( ~ n_barcodes, .,
  par.settings = custom.colorblind(),
  breaks = 10,
  panel = function(x, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.histogram(x, border = "white", ...)
  }
)

print(plot_insertions_per_gene)


Top 10 genes by number of Tn insertions

df_pool_annotated %>% 
  filter(!is.na(old_locus_tag)) %>%
  group_by(old_locus_tag) %>%
  summarize(n_barcodes = length(unique(barcode))) %>%
  arrange(desc(n_barcodes)) %>% slice(1:10)

Position of transposons within a gene

The mapping of a transposon to a gene also reveals its relative position within the gene. We can use this information to tag insertions as more likely to have a fitness effect, or not. We can also filter out transposons that lie outside the central portion of a gene (e.g. 10% margin to each side), or within a fixed flanking region (e.g. first or last 100 bp). The original FEBA protocol from Morgan Price uses a quality filter that requires transposons be located within the central 80% of a gene. We follow this definition and flag transposons outside the central portion of a gene.

# apply margin of 10% gene length
df_pool_annotated <- df_pool_annotated %>%
  mutate(
    gene_length = end-begin,
    pos_relative = (pos-begin)/(end-begin),
    central = between(pos_relative, 0.1, 0.9)
  )

How many transposons that inserted into a gene are central? We can summarize.

df_pool_annotated %>%
  filter(!is.na(central)) %>%
  group_by(central) %>%
  summarize(frequence = length(pos))

How are insertions distributed over the gene, in relative position? There is a trend towards higher insertion frequency at the termini of genes. Otherwise the insertion frequency is homogeneously distributed.

plot_insertion_position <- df_pool_annotated %>% 
  filter(!is.na(central)) %>%
  
  histogram( ~ pos_relative, .,
  par.settings = custom.colorblind(),
  breaks = 50,
  panel = function(x, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.histogram(x, border = "white", ...)
  }
)

print(plot_insertion_position)

Export result tables and figures

Export the annotated pool file containing the final barcodes and their locations in the genome.

df_pool_annotated %>% filter(!is.na(barcode)) %>%
  write_tsv("../data/pool/pool_genes.tsv")

Export figures to *.svg image files.

for (plt in grep("^plot_", ls(), value = TRUE)) {
  svg(filename = paste0("../images/", plt, ".svg"), width = 6, height = 4)
    print(get(plt))
  dev.off()
}
---
title: "TnSeq -- mapping barcoded transposons to genome"
date: "`r format(Sys.time(), '%d %B, %Y')`"
author: "Michael Jahn"
output:
  html_notebook: 
    theme: spacelab
    toc: yes
---

## Description

This R notebook is a bioinformatics pipeline to map reads from a barcoded transposon library to the genome of a target organism. For background and details regarding the method, see [Wetmore at al., mBio, 2015](https://mbio.asm.org/content/6/3/e00306-15) and [Price et al., Nature, 2018](http://www.nature.com/articles/s41586-018-0124-0)). The initial steps of processing next generation sequencing data were directly adapted from [Morgan Price's Feba repository](https://bitbucket.org/berkeleylab/feba/src/master/), see also `README.md` of this repository.

## Bash pipeline

`Fastq` raw data files can be processed as outlined in the `README.md` documentation for this repository. This creates the barcode mappings and the summary pool files.

```{bash, eval = FALSE}
cd /path/to/TnSeq-pipe
source/run_tnseq_mapping.sh --pattern H16.* --ref GCF_000009285.1_ASM928v2_genomic
```

## Libraries

```{r, message = FALSE}
library(lattice)
library(latticeExtra)
library(latticetools)
library(tidyverse)
library(stringi)
library(data.table)
```

## Mapping statistics and distribution on genome

The next step is to inspect basic statistics of transposon insertions and their distribution over the genome.

### Basic statistics

First read in a pooled data table and display summary statistics. Remove barcodes without mapped position.

```{r, message = FALSE}
# import data
df_pool <- read_tsv("../data/pool/pool.tsv") %>%
  filter(!is.na(pos))

# import reference genome
df_ref <- read_tsv("../ref/GCF_000009285.1_ASM928v2_genomic.gff") %>%
  filter(!duplicated(old_locus_tag))
```


```{r, echo = FALSE, message = FALSE}
paste("Number of total reads:", sum(df_pool$nTot))
paste("Number of unique barcodes:", nrow(df_pool))
paste("Number of barcodes with >= 2 reads:", df_pool %>% filter(nTot >= 2) %>% nrow)
paste("Number of barcodes with >= 10 reads:", df_pool %>% filter(nTot >= 10) %>% nrow)
paste("Number of barcodes with only 1 read:", df_pool %>% filter(nTot == 1) %>% nrow)
paste("Number of barcodes with > 1 position:", df_pool %>% filter(n2 > 0) %>% nrow)
paste("Number of barcodes on -/+ strand:", df_pool %>% group_by(strand) %>%
  summarize(n = length(n)) %>% pull(n) %>% paste(collapse = ", "))
```

Next we can plot the frequency of reads per barcoded transposons.

```{r, message = FALSE}
plot_reads_per_bc <- histogram(~ log2(nTot) | paste("strand:", strand), df_pool,
  par.settings = custom.colorblind(), breaks = 8,
  between = list(x = 0.5, y = 0.5), xlim = c(-0.5, 7.5),
  xlab = expression("log"[2]*" reads per barcode"),
  scales = list(alternating = FALSE),
  panel = function(x, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.histogram(x, border = "white", ...)
  }
)

print(plot_reads_per_bc)
```


### Distribution over the genome

**Read frequency over genome**

Each transposon insertion is indexed with a position on the genome. We can now plot insertion frequency over the genome. There are different ways to do that depending on how the data is treated. The most simple case (as done below) is plotting the number of reads per transposon versus its insertion site on the genome, broken down by chromosome type ('scaffold').

```{r}
plot_reads_on_genome <- xyplot(nTot ~ pos | scaffold,
  df_pool %>% arrange(pos),
  par.settings = custom.colorblind(),
  between = list(x = 0.5, y = 0.5),
  layout = c(1,3), type = "l", lwd = 1.5,
  scales = list(alternating = FALSE),
  panel = function(x, y, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.xyplot(x, y, ...)
  }
)

print(plot_reads_on_genome)
```

**Tn insertion frequency over genome**

However this does not really reflect the actual insertion *frequency*. For the frequency, what matters is the number of different, unique insertions per kb of the genome. We can apply a density function or generate a rolling mean to evaluate frequency. Here, a defined window of 10,000 bp was used, and the sum of Tn insertion events per window was determined.

```{r, message = FALSE}
plot_Tns_on_genome <- df_pool %>% arrange(pos) %>%
  mutate(interval = cut_interval(pos, length = 10000, labels = FALSE)*10000) %>%
  group_by(scaffold, interval) %>%
  summarize(tn_frequency = length(pos)) %>%
  
  xyplot(tn_frequency ~ interval | scaffold, .,
    par.settings = custom.colorblind(),
    ylab = "Tn insertions / 10 kb",
    between = list(x = 0.5, y = 0.5),
    layout = c(1,3), type = "l", lwd = 1.5,
    scales = list(alternating = FALSE),
    panel = function(x, y, ...) {
      panel.grid(h = -1, v = -1, col = grey(0.9))
      panel.xyplot(x, y, ...)
    }
  )

print(plot_Tns_on_genome)
```

### Mapping barcodes to genes


The basic Feba scripts produce a table of barcodes, their frequencies and genomic position information. What we really want to know is **how many transposons/barcodes are mapped to each gene, which position within a gene they have, how many barcodes do not map to a gene (intergenic, low importance), and how many genes were not hit by a transposon (probably essential)**. For this purpose we can use the function `foverlaps()` from package `data.table` to map transposon insertion sites to genes (or vice versa). The following part was inspired by previous work of Kyle Kimler ([github link](https://github.com/kylekimler/)).

```{r, message = FALSE}
# prepare input data in form of data tables
dt_pool <- data.table(df_pool)
dt_ref <- data.table(df_ref)

#dummy begin/end columns are created in the pool file to allow foverlap function
dt_pool$begin <- dt_pool$pos
dt_pool$end <- dt_pool$pos

# map Tn insertion sites to genes
setkey(dt_ref, scaffold, begin, end)
df_pool_annotated <- foverlaps(dt_pool, dt_ref, 
    by.x = c("scaffold", "begin", "end"), type = "within") %>%
  as_tibble %>%
  select(!matches("2$|^i\\.[be]")) %>%
  rename(gene_strand = strand, strand = i.strand) %>%
  # exclude alternative mappings in case of overlapping genes
  filter(!duplicated(barcode)) %>%
  # include also unhit genes in main table
  full_join(df_ref) %>%
  # sort by scaffold, then position
  arrange(scaffold, begin)

head(df_pool_annotated)
```

### Gene insertion frequency

Now that all transposons are mapped to genes (if possible), we can calculate basic statistics about how many genes were hit, how many transposons inserted in a gene on average, and how many transposons hit intergenic regions.

**Barcodes per gene type**

The majority of transposon insertions should take place in gene ORFs. `NA` are intergenic regions with no annotated function.

```{r, message = FALSE}
df_pool_annotated %>%
  group_by(desc) %>%
  summarize(n_barcodes = sum(!is.na(barcode))) %>%
  mutate(desc = replace_na(desc, "intergenic")) %>%
  arrange(desc(n_barcodes))
```

----------

**Insertions per gene**

```{r, message = FALSE}
plot_insertions_per_gene <- df_pool_annotated %>% 
  filter(!is.na(old_locus_tag)) %>%
  group_by(old_locus_tag) %>%
  summarize(n_barcodes = length(unique(barcode))) %>%
  
  histogram( ~ n_barcodes, .,
  par.settings = custom.colorblind(),
  breaks = 10,
  panel = function(x, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.histogram(x, border = "white", ...)
  }
)

print(plot_insertions_per_gene)
```

----------

**Top 10 genes by number of Tn insertions**

```{r, message = FALSE}
df_pool_annotated %>% 
  filter(!is.na(old_locus_tag)) %>%
  group_by(old_locus_tag) %>%
  summarize(n_barcodes = length(unique(barcode))) %>%
  arrange(desc(n_barcodes)) %>% slice(1:10)
```


### Position of transposons within a gene

The mapping of a transposon to a gene also reveals its relative position within the gene. We can use this information to tag insertions as more likely to have a fitness effect, or not. We can also filter out transposons that lie outside the central portion of a gene (e.g. 10% margin to each side), or within a fixed flanking region (e.g. first or last 100 bp). The original FEBA protocol from Morgan Price uses a quality filter that requires transposons be located within the central 80% of a gene. We follow this definition and flag transposons outside the central portion of a gene.

```{r}
# apply margin of 10% gene length
df_pool_annotated <- df_pool_annotated %>%
  mutate(
    gene_length = end-begin,
    pos_relative = (pos-begin)/(end-begin),
    central = between(pos_relative, 0.1, 0.9)
  )
```

----------

How many transposons that inserted into a gene are central? We can summarize.

```{r, message = FALSE}
df_pool_annotated %>%
  filter(!is.na(central)) %>%
  group_by(central) %>%
  summarize(frequence = length(pos))
```

----------

How are insertions distributed over the gene, in relative position? There is a trend towards higher insertion frequency at the termini of genes. Otherwise the insertion frequency is homogeneously distributed.

```{r}
plot_insertion_position <- df_pool_annotated %>% 
  filter(!is.na(central)) %>%
  
  histogram( ~ pos_relative, .,
  par.settings = custom.colorblind(),
  breaks = 50,
  panel = function(x, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.histogram(x, border = "white", ...)
  }
)

print(plot_insertion_position)
```


## Export result tables and figures

Export the annotated pool file containing the final barcodes and their locations in the genome.

```{r}
df_pool_annotated %>% filter(!is.na(barcode)) %>%
  write_tsv("../data/pool/pool_genes.tsv")
```

Export figures to `*.svg` image files.

```{r}
for (plt in grep("^plot_", ls(), value = TRUE)) {
  svg(filename = paste0("../images/", plt, ".svg"), width = 6, height = 4)
    print(get(plt))
  dev.off()
}
```

