Hi everyone,

I'm still stuck with my problem about remainings indel in my assembly (see this post for further informations : C: [PacBio assembly] Remaining indels after polishing )

I'm trying to understand my problem. To do so, instead of aligning RNA-seq reads (Illumina), I've tried to align DNA reads (still illumina, paired) against my PacBio assembly, to have more data more evenly distributed. With two bam file (RNA seq and DNA reads alignement VS PacBio assembly) and my genome, I was checking some regions using IGV.

Indels are unevenly distributed along the genome, they seem to be clustered in some very particular regions, sometimes in introns, sometimes in exons, but they are most likely to appear in very polymorphic regions. Interestingly, theses regions (high indels, high polymorphism) show small coverage drops of Illumina reads. I don't really know how to interpret theses drops and I need some enlightenments. My guess is that, as the sequence from my assembly is very different, the aligner (hisat2 in my case) couldn't align a lot of reads, and the coverage decrease. Not really sure about my conclusions though.

What do you think about theses coverage drops ? Could it be assembly errors ?

Don't hesitate to ask me further details,



Your interpretation is very likely correct, aligners have a lot of trouble when regions have a number of indels near each other. Can you post such an image from IGV?

Sure ! Here some pictures. One is a dezoom of the regions, so you can properly see the coverage drop, and an other one is a zoom of one of thoses regions (that I tagged in red).

Wow, yeah, the aligners are going to have a heck of a time with that region. I wouldn't categorize those as assembly errors, but rather as errors in the underlying PacBio reads.

And it is not even the worst region... Some are even more dirtier tham that. So here is my problem. If their are not assembly errors, but it can't be corrected by polishing, even at high coverage, how do I retrieve theses indels ? Maybe it is "biologically correct" and reflect what inside my genome, but is it possible to have a high polymorphism degree just on indels like that ? Within both introns and sometimes exons ? This problem is driving me crazy, I lack a bit of knowledge to resolve this problem.

My overall deletion rate is like 0.00026 and insertion rate is 0.00068 per base. This is really high, and I'm not sure I should annotate a sequence with so many indels. But I'm not sure about the meaning of this high rate. Could it be "simply" the reflect of an high polymorphism degree within this species ?

