Hello,
I am trying to identify the insert loci and copy number of a transgenic expression cassette in a yeast. The cassette contains my gene of interest with a proprietary promoter and a marker. I have a tried and true workflow for use with Illumina short reads i.e. align reads to reference+cassette, call variants, compare coverages for copy number and identify SVs for insert locus, but now I want to use ONT long reads to generate a de-novo assembly which actually contains the cassette inserts.
Here's what I know from short read analysis: there are two suspected insert loci based on SV analysis, and I believe the cassette inserted itself several times back-to-back at each locus based on comparison of coverage at the marker gene vs. surrounding coverage along the rest of the chromosome when aligned to the chromosome-level reference genome. However, the exact copy number at each locus remains uncertain.
I created a de-novo assembly with Flye, and although it does successfully assemble the cassette, it is only present within its own scaffold that spans its length, and does not integrate into any of the larger scaffolds. I tried generating several assemblies experimenting with various parameters within Flye and got the same results, with some assemblies omitting the cassette entirely.
My suspicion as to why this is happening is that the average read length in my ONT library is ~3Kbp, while the total length of the cassette is ~4.4Kbp. So, because there are multiple true inserts densely crowded around one another, and the average ONT read is shorter than the length of the cassette, I suspect the assembler is unable to resolve the reads that span across the breakends of the inserts as they appear to conflict with one another, thus making it impossible for the assembler to tell that there are multiple copies and/or in what order they exist, so it just assembles one copy into its own scaffold and calls it a day...
Does this seem reasonable? Any advice on further steps i.e. gap closing, etc.?
Thanks!
What is the range of read lengths in this dataset? Was this run on a minION or larger flowcell? How many reads are you using for assembly and based on the genome size what was the theoretical fold coverage in the dataset?
Since this is a plain genomic library average ~3kb reads appear to be shorter than what one would expect from well made libraries. Perhaps re-making the libraries with an intent to get much longer reads would save time/effort and offer a clear answer.
Thanks for the prompt response!
The lab that performed the sequencing did note that they saw fragment sizes in the 5kbp range, which did seem somewhat short to them. The shipment was delayed by a day which resulted in the dry ice sublimating and there some degradation, but not enough to significantly hinder the resulting library quality.