Question: What Type Of Filtering Is Typically Applied To A Bam File?
gravatar for Ashutosh Pandey
6.6 years ago by
Ashutosh Pandey11k wrote:

Hi ,

I am new to NGS analysis. Recently, I aligned illumina paired-end short reads against mouse reference genome using BWA. I only used the trimming option (-q 15) for BWA and left other parameters unchanged. Now, I have SAM files with all different kinds of FLAG in the second column of the SAM file. I know these FLAG give us an idea about the orientation of the paired reads and also whether both reads belonging to a pair mapped or not. Also, the fifth column (MAPQ) gives the mapping quality score. I have the following questions:

1) Can I directly use these bam (sam) files with unmapped/low mapping quality reads or improperly placed paired-reads for variant calling (SNPs, Indels) using different tools including samtools, GATK etc. Will these programs automatically discard or ignore all the irrelevant mapped reads such as two reads belonging to same pair but mapped to different chromosomes (it is possible in few cancer samples but not in the one that I am analyzing) OR one has to write a script to automatically extract the paired reads that are confidently and properly mapped and use them for the further analysis.

2) In case the second case discussed above is true, then:

a) Should I only consider the paired reads where both of them are mapped and follow the insert size distribution ? b) It could be a case where one of the read from the pair is confidently mapped (assume MAPQ >=20) while the other read falls into repetitive genome and is not mapped or not mapped because of other reason ? I assume I should use the mapped read in this case otherwise it may make me loose lot of reads.

format sam • 4.9k views
ADD COMMENTlink modified 6.6 years ago by Ketil3.9k • written 6.6 years ago by Ashutosh Pandey11k

In general you should try to verify the optimal requirements for each of the tools. The rule of the thumb is that there are no default filtering options, one man's trash is the other's treasure. For example incorrectly paired reads could also indicate inversions or other structural variations.

ADD REPLYlink modified 6.6 years ago • written 6.6 years ago by Istvan Albert ♦♦ 80k
gravatar for JC
6.6 years ago by
JC7.7k wrote:

Hi, welcome to the NGS analysis world.

1) You can use your SAM->BAM files directly to Samtools/GATK, as you said, unmapped reads are discarded. In general you discard those just to save space in your hard-drive.

2) For SNP detection I don't consider necessary to exclude unpaired reads, as you discuss above, one pair can fail for several reason (including technical).

ADD COMMENTlink written 6.6 years ago by JC7.7k

Thanks JC,

I think you have already answered my questions but I just want to confirm.

a) You said unmapped reads are discarded by GATK/samtools from the analysis so it doesnt matter if you have those reads in your input bam files. I think that one can also control the MAPQ of the reads to be used using samtools -M capMapQ parameter. But what about reads that are improperly paired including reads reads belonging to a pair but mapped on different chromosomes. If I include such reads in bam files, I dont want GATK/samtools to use these reads.Do you think I need to discard such read pairs or samtools or GATK will be able to take care of this thing.

In case we need to select the reads manually.

b) I went to and came up with these FLAG values for the reads I think that will be relevant for my analysis (paired end illumina reads) :

99 - read paired, read mapped in proper pair, mate reverse strand, first in pair 163 - read paired, read mapped in proper pair, mate reverse strand, second in pair 73 - read paired, mate unmapped, first in pair 137 - read paired, mate unmapped, second in pair

c) I assume, I will have to still keep the original bam file (with all the reads) with me as some tools like Dindel ask you to give both mapped and unmapped reads belonging to a pair as input so that it can try split read method to find indels.



ADD REPLYlink written 6.6 years ago by Ashutosh Pandey11k

You also want 83 and 147, (properly paired reads that map in the reverse) and if you want 73 and 137, you also want 89 and 153, which also map in the reverse.

ADD REPLYlink written 6.6 years ago by swbarnes25.3k

sorry for the late response, but I think other people already extend the answer. Good luck!

ADD REPLYlink written 6.6 years ago by JC7.7k
gravatar for Ketil
6.6 years ago by
Ketil3.9k wrote:

The first thing I'd do after mapping would be to run some quick statistics, to check that insert sizes and mapping look good (I use my own program, but I'm sure there are plenty of tools for this) Unless you are doing something wrong, the vast majority of reads should be properly mapped. If (or where) you fail to map pairs, that could be sign of structural variations or perhaps dense SNPs, discarding these reads might mean you miss out some important stuff. Pairs may map to different chromosomes if regions are sufficiently similar, and you are a bit unlucky with SNPs and read errors - but the number of reads where this happens should be miniscule, and filtering it out might introduce bias in your sample.

You don't specify what your analysis is for, but I can't help but feel that if your analysis is changed dramatically by such sequencing "noise", something is Wrong.

ADD COMMENTlink written 6.6 years ago by Ketil3.9k

Thanks Ketil. You are right that most of the reads are mapped correctly with only few of them are improperly mapped. My analysis is primarily focused to look for SNP and small indel variants. I will also look for large indels and inversion later on. I think i should use properly placed paired-reads and also singleton reads (pair unmapped but the read itself is mapped confidently) to look for SNPs and small indels variants using samtools (My files are small so I can extract these reads overnight). For large indels and inversion which may require improperly placed reads, I can always go back to the original bam files (containing all the reads).


ADD REPLYlink written 6.6 years ago by Ashutosh Pandey11k

Note that most(?) variant callers will output confidence values for SNPs, so it will take into account mapping quality and other factors. By filtering out reads, you likely only reduce sensitivity (e.g. close to a region with no/low mapping, you risk halving the coverage) for dubious gains. I'd just go with the raw BAM file.

Edit: gratuitous link to bamstats: Alternative could be

ADD REPLYlink modified 6.6 years ago • written 6.6 years ago by Ketil3.9k

Thanks Ketil. You are right that I should not care much about the small percentage of improper placed reads. I can always do some SNP and indel confidence checking in the end.

ADD REPLYlink written 6.6 years ago by Ashutosh Pandey11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1775 users visited in the last hour