Question: Single end or paired end for variant calling
gravatar for memory_donk
5.3 years ago by
memory_donk310 wrote:

Hi Biostars,

I want to align reads from a non-model microbat genome to the repeat-masked version of the published microbat (Myotis lucifugus) genome and do variant calling. I have short insert paired-end data generated on an Illumina Nextseq. The Myotis lucifugus genome has fairly gappy scaffolds and is of course a different, albeit closely related species. 

Is more appropriate for me to align my reads as single-end or paired-end reads? BWA and other similar aligners penalize unpaired reads heavily by default. My concern is that I will have reads thrown out because their pair either falls within a masked repetitive region or in a gap of N's in the scaffold.

As a side note, when I run BWA with defaults, I get radically different mapping percentages depending on whether I align the reads as single-ended (~60% mapping) or paired-end (~85% mapping). Is this because BWA is penalizing the single-ended reads for not having a mate pair? Would I fix this by reducing the -U penalty for an unpaired read pair from its default of 17 to 0?

Sorry that this is a few questions bundled together. This is also my first time posting here and I apologize if I've missed a rule.

snp alignment genome • 2.7k views
ADD COMMENTlink modified 5.3 years ago by John12k • written 5.3 years ago by memory_donk310
gravatar for cpad0112
5.3 years ago by
Hyderabad India
cpad011214k wrote:

If you have paired end data, go with paired end alignment, irrespective advantages and/or disadvantages with an aligner. Theoretically and practically, paired end data is better than single end/unpaired data for alignment.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by cpad011214k

Thanks for your response, much appreciated. Is there a reason why paired-end is better in this particular case? The reasoning definitely makes sense when you're doing de novo assembly and other similar applications, but when aligning to a genome which could contain gaps, small rearrangements (which I wont be studying), masked repeats etc it seems like losing reads because their pairs don't align could potentially be a problem. 

ADD REPLYlink written 5.3 years ago by memory_donk310

Yes.  Paired-end reads are essentially "longer" reads, allowing the aligner to better align around repetitive regions, gaps, etc.  

ADD REPLYlink written 5.3 years ago by Sean Davis26k

If there are two spots in the genome where AGCTTTACGCTTTCGACTAGGA maps to, one on chr1 the other on chr17, you wont know where it goes.

But in PE you also have a mate. In this example, lets say the mate can map to either chr14, or, fortunately, chr17 just 700bp downstream of the above - now it is obvious where both of these reads map, you can be confident in their mapping, and you know exactly how long the fragment is even though you did no sample size selection :) 

ADD REPLYlink written 5.3 years ago by John12k

Definitely makes sense, thanks! I've gone with paired end mapping. I'm getting really good coverage of my reference genome too so I think I'm in good shape now :).

ADD REPLYlink written 5.3 years ago by memory_donk310
gravatar for John
5.3 years ago by
John12k wrote:

You could always do both - map first with paired-end, then take the unmapped pairs and make them singles. If you have a tight enough insert size distribution, you can then extend these later by Xbp upstream. It might not be worth doing any of this if you find you get 99% of your reads mapping in PE anyway, but if the paired-end alone is totally unusable, you can do this

ADD COMMENTlink written 5.3 years ago by John12k

Thanks. For perspective, only ~70% of my reads are mapping in proper pairs according to samtools flagstat. I was also working with a museum specimen rather than a fresh sample and did a much less stringent size selection to preserve material. I've used a pair-merging program previously which showed that at least ~30% of my reads are also overlapping (so in principle I could also map these as long single-ended reads rather than pairs). I'll definitely try it a couple of ways.

ADD REPLYlink written 5.3 years ago by memory_donk310

70% isnt too bad you know - particularly when you're mapping to a different species!! I wouldn't bother with my suggestion then, as it will cause headaches down the road. Many programs work with either PE or SE, and cant handle both. I'd keep the 70% and be happy it wasn't 40% ;) Probably thanks to those long reads :)

ADD REPLYlink written 5.3 years ago by John12k

Well that's good to know! I did try mapping them as single ended and paired and also tried mapping the single end reads  with the default -U parameter (unpaired penalty) and setting it to zero, it made no difference so I'm guessing BWA is smart enough to know when to use it and when not. Paired did have a slightly higher mapping percent than single, but I think I know why now. By default, bwa mem does mate rescue, where, if one of a pair maps but the other's map quality is too low, it will try to map it again using the SW algorithm. If it passes SW then it will still get mapped. I believe it only does this in paired end mode.

Thanks again for your suggestions John, much appreciated.

ADD REPLYlink written 5.3 years ago by memory_donk310
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1075 users visited in the last hour