Question: Is there a correct order in which to filter BAM files?
0
gravatar for James Ashmore
4.9 years ago by
James Ashmore2.8k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.8k wrote:

Say I have a BAM file which I want to filter, I want to apply the following processing steps: Remove reads aligned to blacklist regions, remove mitochondrial mapping reads, remove unmapped reads, remove pcr duplicates, remove reads with MAPQ score lower than 30.

Is there a 'correct' order to do these in, or does it not matter? I think problems could arise where certain filtering steps create orphans, resulting in incorrect SAM flags, which later software require to be accurate.

filtering bam • 2.0k views
ADD COMMENTlink modified 4.9 years ago by Brian Bushnell17k • written 4.9 years ago by James Ashmore2.8k

Most of the variant callers don't consider reads that are flagged as pcr duplicates, secondary alignments, unmapped reads (no brainer). So you need not to worry about their presence in the BAM file provided that these reads have been correctly flagged. MAPQ > 30 can be chosen if you really want to be stringent but as Brian suggested that it may incur some bias. There will be lots of reads with MAPQ < 30 but still align uniquely to the reference genome. So I would not use MAPQ to filter reads as variant caller will be smart enough to only use uniquely aligned reads. But these decisions are subjective and differ between people. 

ADD REPLYlink written 4.9 years ago by Ashutosh Pandey12k
2
gravatar for geek_y
4.9 years ago by
geek_y10k
Barcelona
geek_y10k wrote:

I think removing PCR duplicates first would be good as this step takes paired-end information into account. Then perform rest of the steps in any order ( can be a single command with pipes). 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by geek_y10k
0
gravatar for Brian Bushnell
4.9 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Where did the fad of removing reads with mapq below 30 come from?  If you are doing variant-calling, that can incur a serious reference bias.

ADD COMMENTlink written 4.9 years ago by Brian Bushnell17k

This is interesting. Do you mean it's too stringent ? Or mapq based filtering is not at all a good idea ?

ADD REPLYlink written 4.9 years ago by geek_y10k

mapq-based filtering is useful when you set a threshold of "above 3", meaning reads that the mapper think are uniquely/unambiguously aligned.  But the more a read differs from the reference, the lower the mapq will be.  So filtering with a high threshold will exclude reads that indicate variants.  It's much better to use all the reads (that were aligned unambiguously) the generate a consensus and figure out things from there, rather than throwing away all the data that did not align near-perfectly on a per-read basis.

mapq is useful for a variant-caller to weight the values of different reads covering a given location, but I don't suggest using it for generic filtering in most cases (other than to screen ambiguously-mapped reads).

ADD REPLYlink written 4.9 years ago by Brian Bushnell17k

I'm not doing any variant calling, and the MAPQ value helped remove regions with extremely large coverage in my dataset which I did not want for my specific analysis.

ADD REPLYlink written 4.9 years ago by James Ashmore2.8k

Well IMHO just removing the extra reads genome wide to make the bam file smaller doesn't sound a systematic approach. You should some other approach if you want to do a focused analysis on certain genomic regions. 

ADD REPLYlink written 4.9 years ago by Ashutosh Pandey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1184 users visited in the last hour