Bwa mem is giving me alignements that are causing noise in subsequent analysis. How I can filter them?
2
0
Entering edit mode
7.9 years ago
shinken123 ▴ 150

Hi

I have alignments of BWA mem and are giving me a lot of noise in subsequent analysis. If I use Bwa aln I can filter the files by MAPQ and those that have the Tag XT:A:U and this solve the noise problem, however BWA aln is slow to map all my data. How I can filter the Bwa mem alignments if this flag si not present?

My first filter, for my Bwa mem alignments, was by MAPQ, and this is giving me unique mapped reads (the flag 256 is not present) however my analysis are still noisy, I also have reads with several mismatches that I suspect that are the problem, I can filter them by the flag NM:i, however I am wondering if there is a better way to filter my files and obtain more reliable alignments.

Best wishes

genome next-gen alignment • 2.6k views
ADD COMMENT
0
Entering edit mode

What subsequent analyses are you doing? How do you know the alignment is causing "noise"?

ADD REPLY
0
Entering edit mode

I am calculating the D statistic (introgression) for maize, and with the aln filtering the results are similar to other individuals (different maize race) from the same environment, also if I use a masked reference genome for the alignment the "noise" disappears. Thus is very possible that the repetitive regions of the genome and alignments to that regions are responsible for the "noise", but this only happens in this new sequenced individuals, other previous individuals do not need the filtering and the D stats are normal. Could be also sequencing problems? This new individuals have longer reads 150bp, and the previous ones 100.

ADD REPLY
0
Entering edit mode
7.9 years ago
DG 7.3k

Don't you normally use SNPs called from your data for this calculation as opposed to the whole aligned BAM file? If that is the case the place where you should apply filtering is on your variant calling algorithm (by settings its parameters) and on the statistics that describe called variants at the end of that. If you are getting too many false positive calls due to alignment issues or regions of mismatching that should control for it.

ADD COMMENT
0
Entering edit mode

Thank you very much. Yes I know, for SNPs is ok, but I am not working with SNPs, but with Genotype Likelihoods (ANGSD), I am testing filters there, so if I solve this I would tell you how.

ADD REPLY
0
Entering edit mode

You are ultimately still dealing with the genotype likelihoods of either some sites or of all sites, but similar filtering criteria would apply. I haven't used ANGSD before but looking at the documentation quickly it looks like there are all sorts of filters that can be applied. MapQ would be helpful here, as would trimming the ends of reads. Depth, etc.

ADD REPLY
0
Entering edit mode

Yes, thank you, I am testing them. the main problem here is that it looks that MapQ is not being affected by the number of mismatches (I have reads, 150bp length, with more than 20 mismatches and the MapQ are above 30), also these reads have "good flags" are not secondary alignments, are proper paired, etc. However I am playing with the filter parameters of ANGSD and I will see what happen and if I will found the best filters for bwa mem alignments.

ADD REPLY
0
Entering edit mode

The Base Quality scores are also important. You should be able to include that in your filtering to enrich for high quality sites. All of these factors should also be being included in the genotype likelihoods ultimately as well.

ADD REPLY
0
Entering edit mode

Yes, I am including MapQ and base quality, and I am playing with the rest of the filters, thanks!!

ADD REPLY

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6