Single end or paired end for variant calling
2
1
Entering edit mode
8.5 years ago
memory_donk ▴ 360

Hi Biostars,

I want to align reads from a non-model microbat genome to the repeat-masked version of the published microbat (Myotis lucifugus) genome and do variant calling. I have short insert paired-end data generated on an Illumina Nextseq. The Myotis lucifugus genome has fairly gappy scaffolds and is of course a different, albeit closely related species.

Is more appropriate for me to align my reads as single-end or paired-end reads? BWA and other similar aligners penalize unpaired reads heavily by default. My concern is that I will have reads thrown out because their pair either falls within a masked repetitive region or in a gap of N's in the scaffold.

As a side note, when I run BWA with defaults, I get radically different mapping percentages depending on whether I align the reads as single-ended (~60% mapping) or paired-end (~85% mapping). Is this because BWA is penalizing the single-ended reads for not having a mate pair? Would I fix this by reducing the -U penalty for an unpaired read pair from its default of 17 to 0?

Sorry that this is a few questions bundled together. This is also my first time posting here and I apologize if I've missed a rule.

SNP genome alignment • 4.3k views
ADD COMMENT
1
Entering edit mode
8.5 years ago
John 13k

You could always do both - map first with paired-end, then take the unmapped pairs and make them singles. If you have a tight enough insert size distribution, you can then extend these later by Xbp upstream. It might not be worth doing any of this if you find you get 99% of your reads mapping in PE anyway, but if the paired-end alone is totally unusable, you can do this

ADD COMMENT
0
Entering edit mode

Thanks. For perspective, only ~70% of my reads are mapping in proper pairs according to samtools flagstat. I was also working with a museum specimen rather than a fresh sample and did a much less stringent size selection to preserve material. I've used a pair-merging program previously which showed that at least ~30% of my reads are also overlapping (so in principle I could also map these as long single-ended reads rather than pairs). I'll definitely try it a couple of ways.

ADD REPLY
1
Entering edit mode

70% isnt too bad you know - particularly when you're mapping to a different species!! I wouldn't bother with my suggestion then, as it will cause headaches down the road. Many programs work with either PE or SE, and cant handle both. I'd keep the 70% and be happy it wasn't 40% ;) Probably thanks to those long reads :)

ADD REPLY
0
Entering edit mode

Well that's good to know! I did try mapping them as single ended and paired and also tried mapping the single end reads with the default -U parameter (unpaired penalty) and setting it to zero, it made no difference so I'm guessing BWA is smart enough to know when to use it and when not. Paired did have a slightly higher mapping percent than single, but I think I know why now. By default, bwa mem does mate rescue, where, if one of a pair maps but the other's map quality is too low, it will try to map it again using the SW algorithm. If it passes SW then it will still get mapped. I believe it only does this in paired end mode.

Thanks again for your suggestions John, much appreciated.

ADD REPLY
0
Entering edit mode
8.5 years ago

If you have paired end data, go with paired end alignment, irrespective advantages and/or disadvantages with an aligner. Theoretically and practically, paired end data is better than single end/unpaired data for alignment.

ADD COMMENT
0
Entering edit mode

Thanks for your response, much appreciated. Is there a reason why paired-end is better in this particular case? The reasoning definitely makes sense when you're doing de novo assembly and other similar applications, but when aligning to a genome which could contain gaps, small rearrangements (which I wont be studying), masked repeats etc it seems like losing reads because their pairs don't align could potentially be a problem.

ADD REPLY
1
Entering edit mode

If there are two spots in the genome where AGCTTTACGCTTTCGACTAGGA maps to, one on chr1 the other on chr17, you wont know where it goes.

But in PE you also have a mate. In this example, lets say the mate can map to either chr14, or, fortunately, chr17 just 700bp downstream of the above - now it is obvious where both of these reads map, you can be confident in their mapping, and you know exactly how long the fragment is even though you did no sample size selection :)

ADD REPLY
0
Entering edit mode

Definitely makes sense, thanks! I've gone with paired end mapping. I'm getting really good coverage of my reference genome too so I think I'm in good shape now :).

ADD REPLY
0
Entering edit mode

Yes. Paired-end reads are essentially "longer" reads, allowing the aligner to better align around repetitive regions, gaps, etc.

ADD REPLY

Login before adding your answer.

Traffic: 3051 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6