I want to align reads from a non-model microbat genome to the repeat-masked version of the published microbat (Myotis lucifugus) genome and do variant calling. I have short insert paired-end data generated on an Illumina Nextseq. The Myotis lucifugus genome has fairly gappy scaffolds and is of course a different, albeit closely related species.
Is more appropriate for me to align my reads as single-end or paired-end reads? BWA and other similar aligners penalize unpaired reads heavily by default. My concern is that I will have reads thrown out because their pair either falls within a masked repetitive region or in a gap of N's in the scaffold.
As a side note, when I run BWA with defaults, I get radically different mapping percentages depending on whether I align the reads as single-ended (~60% mapping) or paired-end (~85% mapping). Is this because BWA is penalizing the single-ended reads for not having a mate pair? Would I fix this by reducing the -U penalty for an unpaired read pair from its default of 17 to 0?
Sorry that this is a few questions bundled together. This is also my first time posting here and I apologize if I've missed a rule.
If you have paired end data, go with paired end alignment, irrespective advantages and/or disadvantages with an aligner. Theoretically and practically, paired end data is better than single end/unpaired data for alignment.
You could always do both - map first with paired-end, then take the unmapped pairs and make them singles. If you have a tight enough insert size distribution, you can then extend these later by Xbp upstream. It might not be worth doing any of this if you find you get 99% of your reads mapping in PE anyway, but if the paired-end alone is totally unusable, you can do this