Question: Bwa mem is giving me alignements that are causing noise in subsequent analysis. How I can filter them?
gravatar for shinken123
2.2 years ago by
shinken12350 wrote:


I have alignments of BWA mem and are giving me a lot of noise in subsequent analysis. If I use Bwa aln I can filter the files by MAPQ and those that have the Tag XT:A:U and this solve the noise problem, however BWA aln is slow to map all my data. How I can filter the Bwa mem alignments if this flag si not present?

My first filter, for my Bwa mem alignments, was by MAPQ, and this is giving me unique mapped reads (the flag 256 is not present) however my analysis are still noisy, I also have reads with several mismatches that I suspect that are the problem, I can filter them by the flag NM:i, however I am wondering if there is a better way to filter my files and obtain more reliable alignments.

Best wishes

alignment next-gen genome • 968 views
ADD COMMENTlink modified 2.2 years ago by Dan Gaston7.0k • written 2.2 years ago by shinken12350

What subsequent analyses are you doing? How do you know the alignment is causing "noise"?

ADD REPLYlink written 2.2 years ago by andrew.j.skelton735.1k

I am calculating the D statistic (introgression) for maize, and with the aln filtering the results are similar to other individuals (different maize race) from the same environment, also if I use a masked reference genome for the alignment the "noise" disappears. Thus is very possible that the repetitive regions of the genome and alignments to that regions are responsible for the "noise", but this only happens in this new sequenced individuals, other previous individuals do not need the filtering and the D stats are normal. Could be also sequencing problems? This new individuals have longer reads 150bp, and the previous ones 100.

ADD REPLYlink written 2.2 years ago by shinken12350
gravatar for Dan Gaston
2.2 years ago by
Dan Gaston7.0k
Dan Gaston7.0k wrote:

Don't you normally use SNPs called from your data for this calculation as opposed to the whole aligned BAM file? If that is the case the place where you should apply filtering is on your variant calling algorithm (by settings its parameters) and on the statistics that describe called variants at the end of that. If you are getting too many false positive calls due to alignment issues or regions of mismatching that should control for it.

ADD COMMENTlink written 2.2 years ago by Dan Gaston7.0k

Thank you very much. Yes I know, for SNPs is ok, but I am not working with SNPs, but with Genotype Likelihoods (ANGSD), I am testing filters there, so if I solve this I would tell you how.

ADD REPLYlink written 2.2 years ago by shinken12350

You are ultimately still dealing with the genotype likelihoods of either some sites or of all sites, but similar filtering criteria would apply. I haven't used ANGSD before but looking at the documentation quickly it looks like there are all sorts of filters that can be applied. MapQ would be helpful here, as would trimming the ends of reads. Depth, etc.

ADD REPLYlink written 2.2 years ago by Dan Gaston7.0k

Yes, thank you, I am testing them. the main problem here is that it looks that MapQ is not being affected by the number of mismatches (I have reads, 150bp length, with more than 20 mismatches and the MapQ are above 30), also these reads have "good flags" are not secondary alignments, are proper paired, etc. However I am playing with the filter parameters of ANGSD and I will see what happen and if I will found the best filters for bwa mem alignments.

ADD REPLYlink written 2.2 years ago by shinken12350

The Base Quality scores are also important. You should be able to include that in your filtering to enrich for high quality sites. All of these factors should also be being included in the genotype likelihoods ultimately as well.

ADD REPLYlink written 2.2 years ago by Dan Gaston7.0k

Yes, I am including MapQ and base quality, and I am playing with the rest of the filters, thanks!!

ADD REPLYlink written 2.2 years ago by shinken12350
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1883 users visited in the last hour