Identify transgene insert loci/copy number in nanopore De-novo assembly
1
1
Entering edit mode
13 days ago
ctmfarland ▴ 20

Hello,

I am trying to identify the insert loci and copy number of a transgenic expression cassette in a yeast. The cassette contains my gene of interest with a proprietary promoter and a marker. I have a tried and true workflow for use with Illumina short reads i.e. align reads to reference+cassette, call variants, compare coverages for copy number and identify SVs for insert locus, but now I want to use ONT long reads to generate a de-novo assembly which actually contains the cassette inserts.

Here's what I know from short read analysis: there are two suspected insert loci based on SV analysis, and I believe the cassette inserted itself several times back-to-back at each locus based on comparison of coverage at the marker gene vs. surrounding coverage along the rest of the chromosome when aligned to the chromosome-level reference genome. However, the exact copy number at each locus remains uncertain.

I created a de-novo assembly with Flye, and although it does successfully assemble the cassette, it is only present within its own scaffold that spans its length, and does not integrate into any of the larger scaffolds. I tried generating several assemblies experimenting with various parameters within Flye and got the same results, with some assemblies omitting the cassette entirely.

My suspicion as to why this is happening is that the average read length in my ONT library is ~3Kbp, while the total length of the cassette is ~4.4Kbp. So, because there are multiple true inserts densely crowded around one another, and the average ONT read is shorter than the length of the cassette, I suspect the assembler is unable to resolve the reads that span across the breakends of the inserts as they appear to conflict with one another, thus making it impossible for the assembler to tell that there are multiple copies and/or in what order they exist, so it just assembles one copy into its own scaffold and calls it a day...

Does this seem reasonable? Any advice on further steps i.e. gap closing, etc.?

Thanks!

ont nanopore transgene sequencing wgs • 510 views
ADD COMMENT
1
Entering edit mode

the average read length in my ONT library is ~3Kbp

What is the range of read lengths in this dataset? Was this run on a minION or larger flowcell? How many reads are you using for assembly and based on the genome size what was the theoretical fold coverage in the dataset?

Since this is a plain genomic library average ~3kb reads appear to be shorter than what one would expect from well made libraries. Perhaps re-making the libraries with an intent to get much longer reads would save time/effort and offer a clear answer.

ADD REPLY
0
Entering edit mode

Thanks for the prompt response!

  • min: 67, max: 93,862, avg: 3,626, q25: 2737, q75: 4406
  • minION
  • 1,150,160 reads total/used and calculated theoretical fold coverage at 443 for genome size ~9.4Mbp

The lab that performed the sequencing did note that they saw fragment sizes in the 5kbp range, which did seem somewhat short to them. The shipment was delayed by a day which resulted in the dry ice sublimating and there some degradation, but not enough to significantly hinder the resulting library quality.

ADD REPLY
1
Entering edit mode
12 days ago

Interesting, your analysis might be correct.

One way forward

  • grep fastq reads which contain parts of your insert (left side, right side, center etc). You could use: https://github.com/fulcrumgenomics/fqgrep or custom scripts or grep itself
  • align these reads to the genome/contigs you have with minimap2
  • find where these reads hit/partially hit and check the alignments. Are there reads which cross from insert into the genome ?

You might need to annotate your contigs eg with the helixer web service, to get an impression of whats going on.

Another way would be attempting a reassembly, eg with Shasta or hifiasm --ont. Flye should be decent for fungi though.

ADD COMMENT
1
Entering edit mode

That was an excellent idea. I used blastn to pull out all ONT reads containing some/all of the gene of interest and mapped that to the reference genome using minimap2 with soft clipping enabled for supplementary alignments. This resulted in the genomic side of the insert breakends within the reads correctly mapping to the reference, and the remainder of the insert shows up as massive soft-clipped regions in the surrounding area. One of the soft-clipped regions has a disproportionately large amount of the overall read density and it supports the proposed insert site from the short read analysis. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2881 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6