Description

This R notebook is a bioinformatics pipeline to map reads from a barcoded transposon library to the genome of a target organism. For background and details regarding the method, see Wetmore at al., mBio, 2015 and Price et al., Nature, 2018). The initial steps of processing next generation sequencing data was directly adapted from Morgan Price’s Feba repository, see also the TnSeq-pipe github repository for an overview.

Bash pipeline

Fastq raw data files were processed as outlined in the documentation for the TnSeq-pipe github repository. This step creates the barcode mappings (data/mapped/*.tsv) and the summary pool file (data/pool/pool.tsv). Mapping files from different sequencing runs were combine to one pool file using this standalone perl script from the TnSeq-pipe repo.

perl feba/bin/DesignRandomPool.pl -minN 1 \
  -pool ../R_projects/R-notebook-ralstonia-proteome/data/pool/CN_V2_pool.tsv \
  -genes ref/GCF_000009285.1_ASM928v2_genomic_trimmed.tsv \
  ../R_projects/R-notebook-ralstonia-proteome/data/mapped/*.tsv

Libraries

# optionally install repos from github
# devtools::install_github("m-jahn/lattice-tools")

suppressPackageStartupMessages({
  library(lattice)
  library(latticeExtra)
  library(latticetools)
  library(data.table)
  library(MASS)
  library(stringi)
  library(zoo)
  library(tidyverse)
})

Mapping statistics and distribution on genome

The next step is to inspect basic statistics of transposon insertions and their distribution over the genome.

Basic statistics

First read in data tables of the barcode ‘pool’, a summary of individual sequencing reads. Barcodes without mapping to the genome are removed. Data from three different sequencing runs were combined in one pool file (CN_V2_pool.tsv). A separate pool file contains a different, earlier iteration of the library and therefore a different set of transposon insertions, prepared by Kyle Kimler (CN_V1_pool.tsv).

# import seq data from first version of library
df_pool_V1 <- read_tsv("../data/pool/CN_V1_pool.tsv")
Rows: 125070 Columns: 12
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): barcode, rcbarcode, scaffold, strand, scaffold2, strand2
dbl (6): nTot, n, pos, n2, pos2, nPastEnd

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# import seq data from second version of library
df_pool_V2 <- read_tsv("../data/pool/CN_V2_pool.tsv")
Rows: 124143 Columns: 12
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): barcode, rcbarcode, scaffold, strand, scaffold2, strand2
dbl (6): nTot, n, pos, n2, pos2, nPastEnd

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# combine all in one df, removing duplicate barcodes
df_pool <- bind_rows(df_pool_V1, df_pool_V2, .id = "version") %>%
  arrange(desc(version)) %>% filter(!duplicated(barcode)) %>%
  arrange(version)

# import reference genome
df_ref <- read_tsv("../data/ref/GCF_000009285.1_ASM928v2_genomic_trimmed.tsv") %>%
  filter(!duplicated(old_locus_tag))
Rows: 7051 Columns: 7
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): scaffold, strand, desc, old_locus_tag, new_locus_tag
dbl (2): begin, end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
                           [,1]       [,2]      
version                    "1"        "2"       
N_reads                    "26329546" " 5786608"
N_unique_bc                "124649"   "124143"  
N_bc_with_2_or_more_reads  "72443"    "57040"   
N_bc_with_10_or_more_reads "68423"    "29391"   
N_bc_eith_1_read           "52206"    "67103"   
N_bc_alternative_pos       "59203"    "18073"   
N_bc_plus_strand           "61189"    "61678"   
N_bc_minus_strand          "63460"    "62465"   
N_bc_per_kbp               "16.80585" "16.73763"

Next we can plot the frequency of reads per barcoded transposons.

plot_reads_per_bc <- histogram(~ log2(nTot) | paste("strand:", strand), df_pool,
  par.settings = custom.colorblind(),
  between = list(x = 0.5, y = 0.5),
  xlab = expression("log"[2]*" reads per barcode"),
  scales = list(alternating = FALSE),
  panel = function(x, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.histogram(x, border = "white", ...)
  }
)

print(plot_reads_per_bc)

Distribution over the genome

Read frequency over genome

Each transposon insertion is indexed with a position on the genome. We can now plot insertion frequency over the genome. There are different ways to do that depending on how the data is treated. The most simple case (as done below) plotting the number of reads per transposon versus its insertion site on the genome, broken down by chromosome type (‘scaffold’).

plot_reads_on_genome <- xyplot(nTot ~ pos | scaffold,
  df_pool %>% arrange(pos),
  par.settings = custom.colorblind(),
  between = list(x = 0.5, y = 0.5),
  layout = c(1,3), type = "l", lwd = 1.5,
  scales = list(alternating = FALSE),
  panel = function(x, y, ...) {
    panel.grid(h = -1, v = -1, col = grey(0.9))
    panel.xyplot(x, y, ...)
  }
)

print(plot_reads_on_genome)