Question: Interpreting coverage gaps when aligning Illumina reads against PacBio assembly
0
gravatar for Roxane Boyer
4 weeks ago by
Roxane Boyer310
France/Marseille/IBDM
Roxane Boyer310 wrote:

Hi everyone,

I'm still stuck with my problem about remainings indel in my assembly (see this post for further informations : C: [PacBio assembly] Remaining indels after polishing )

I'm trying to understand my problem. To do so, instead of aligning RNA-seq reads (Illumina), I've tried to align DNA reads (still illumina, paired) against my PacBio assembly, to have more data more evenly distributed. With two bam file (RNA seq and DNA reads alignement VS PacBio assembly) and my genome, I was checking some regions using IGV.

Indels are unevenly distributed along the genome, they seem to be clustered in some very particular regions, sometimes in introns, sometimes in exons, but they are most likely to appear in very polymorphic regions. Interestingly, theses regions (high indels, high polymorphism) show small coverage drops of Illumina reads. I don't really know how to interpret theses drops and I need some enlightenments. My guess is that, as the sequence from my assembly is very different, the aligner (hisat2 in my case) couldn't align a lot of reads, and the coverage decrease. Not really sure about my conclusions though.

What do you think about theses coverage drops ? Could it be assembly errors ?

Don't hesitate to ask me further details,

Cheers,

Roxane

ADD COMMENTlink written 4 weeks ago by Roxane Boyer310

Your interpretation is very likely correct, aligners have a lot of trouble when regions have a number of indels near each other. Can you post such an image from IGV?

ADD REPLYlink written 4 weeks ago by Devon Ryan70k
1

Sure ! Here some pictures. One is a dezoom of the regions, so you can properly see the coverage drop, and an other one is a zoom of one of thoses regions (that I tagged in red).

https://ibb.co/fNFfZ5 https://ibb.co/fEgnE5

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Roxane Boyer310

Wow, yeah, the aligners are going to have a heck of a time with that region. I wouldn't categorize those as assembly errors, but rather as errors in the underlying PacBio reads.

ADD REPLYlink written 4 weeks ago by Devon Ryan70k

And it is not even the worst region... Some are even more dirtier tham that. So here is my problem. If their are not assembly errors, but it can't be corrected by polishing, even at high coverage, how do I retrieve theses indels ? Maybe it is "biologically correct" and reflect what inside my genome, but is it possible to have a high polymorphism degree just on indels like that ? Within both introns and sometimes exons ? This problem is driving me crazy, I lack a bit of knowledge to resolve this problem.

ADD REPLYlink written 4 weeks ago by Roxane Boyer310

My overall deletion rate is like 0.00026 and insertion rate is 0.00068 per base. This is really high, and I'm not sure I should annotate a sequence with so many indels. But I'm not sure about the meaning of this high rate. Could it be "simply" the reflect of an high polymorphism degree within this species ?

ADD REPLYlink written 4 weeks ago by Roxane Boyer310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 838 users visited in the last hour