Question: Bwa mem is giving me alignements that are causing noise in subsequent analysis. How I can filter them?
gravatar for shinken123
19 months ago by
shinken12330 wrote:


I have alignments of BWA mem and are giving me a lot of noise in subsequent analysis. If I use Bwa aln I can filter the files by MAPQ and those that have the Tag XT:A:U and this solve the noise problem, however BWA aln is slow to map all my data. How I can filter the Bwa mem alignments if this flag si not present?

My first filter, for my Bwa mem alignments, was by MAPQ, and this is giving me unique mapped reads (the flag 256 is not present) however my analysis are still noisy, I also have reads with several mismatches that I suspect that are the problem, I can filter them by the flag NM:i, however I am wondering if there is a better way to filter my files and obtain more reliable alignments.

Best wishes

alignment next-gen genome • 721 views
ADD COMMENTlink modified 19 months ago by Dan Gaston6.9k • written 19 months ago by shinken12330

What subsequent analyses are you doing? How do you know the alignment is causing "noise"?

ADD REPLYlink written 19 months ago by andrew.j.skelton734.1k

I am calculating the D statistic (introgression) for maize, and with the aln filtering the results are similar to other individuals (different maize race) from the same environment, also if I use a masked reference genome for the alignment the "noise" disappears. Thus is very possible that the repetitive regions of the genome and alignments to that regions are responsible for the "noise", but this only happens in this new sequenced individuals, other previous individuals do not need the filtering and the D stats are normal. Could be also sequencing problems? This new individuals have longer reads 150bp, and the previous ones 100.

ADD REPLYlink written 19 months ago by shinken12330
gravatar for Dan Gaston
19 months ago by
Dan Gaston6.9k
Dan Gaston6.9k wrote:

Don't you normally use SNPs called from your data for this calculation as opposed to the whole aligned BAM file? If that is the case the place where you should apply filtering is on your variant calling algorithm (by settings its parameters) and on the statistics that describe called variants at the end of that. If you are getting too many false positive calls due to alignment issues or regions of mismatching that should control for it.

ADD COMMENTlink written 19 months ago by Dan Gaston6.9k

Thank you very much. Yes I know, for SNPs is ok, but I am not working with SNPs, but with Genotype Likelihoods (ANGSD), I am testing filters there, so if I solve this I would tell you how.

ADD REPLYlink written 19 months ago by shinken12330

You are ultimately still dealing with the genotype likelihoods of either some sites or of all sites, but similar filtering criteria would apply. I haven't used ANGSD before but looking at the documentation quickly it looks like there are all sorts of filters that can be applied. MapQ would be helpful here, as would trimming the ends of reads. Depth, etc.

ADD REPLYlink written 19 months ago by Dan Gaston6.9k

Yes, thank you, I am testing them. the main problem here is that it looks that MapQ is not being affected by the number of mismatches (I have reads, 150bp length, with more than 20 mismatches and the MapQ are above 30), also these reads have "good flags" are not secondary alignments, are proper paired, etc. However I am playing with the filter parameters of ANGSD and I will see what happen and if I will found the best filters for bwa mem alignments.

ADD REPLYlink written 19 months ago by shinken12330

The Base Quality scores are also important. You should be able to include that in your filtering to enrich for high quality sites. All of these factors should also be being included in the genotype likelihoods ultimately as well.

ADD REPLYlink written 19 months ago by Dan Gaston6.9k

Yes, I am including MapQ and base quality, and I am playing with the rest of the filters, thanks!!

ADD REPLYlink written 19 months ago by shinken12330
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1094 users visited in the last hour