I have been reading you all for a while now, as I started to delve into the WGS world a few months ago. Most of my knowledge is self-taught, thanks to webpages like this, or acquired through a course or internship. I have seen several assembly posts but haven't found any that really helps me with my problem. I will try to give as much context as I can to help out.
I am interested in detecting a specific insertion sequence (or something similar to it) within a bunch of Illumina Reads (2x300bp long run in a Miseq) obtained from a Mycobacterium species. I have already processed the reads and ran a de novo assembly through Spades. I created a blastdb with the resulting contigs and "fished" my sequence using blast+. I then extracted my sequence of interest from the contigs file for further analysis.
My issue comes when I assess the quality of my assembly. I have used QUAST and used a reference sequence (approx. 5.2Mb) from the same subspecies of mycobacteria, which also seems to have this sequence. I think the reports look good until I get to the misassembly section:
misassemblies 48 misassembled contigs 29 Misassembled contigs length 2421296 local misassemblies 23 scaffold gap ext. mis. 0 scaffold gap loc. mis. 0 unaligned mis. contigs 2 unaligned contigs 18 + 22 part Unaligned length 492243 Genome fraction (%) 94.955 Duplication ratio 1.002 N's per 100 kbp 0.00 mismatches per 100 kbp 575.74 indels per 100 kbp 15.81 Largest alignment 200998 Total aligned length 4930311
One of this misassemblies appears in my contig of interest. Half of the contig relocates to one side of the reference, whereas the other relocates to the other side. My sequence of interest falls within this second half and luckily it is not cut by the relocation. Since mycobacteria don't usually recombine I think this contig is an artifact, and I am therefore concerned about the other 28 contigs and how I can refer to my assembly in a future publication (I don't want to upload a bad quality assembly and want it to be the best possible version).
I have tried increasing k-mer size from those set by default in Spades but got similar results so I wonder if this is just a limitation set by using short reads for de novo assembly or if there is any way of improving the misassemblies without having to resequence using long read technologies.
Thank you very much for your help!