Question: Difference between marking duplicates and filtering BAM on phred score
0
gravatar for Nalini
9 months ago by
Nalini10
India
Nalini10 wrote:

Hey!

I am working with whole genome sequencing data of bacterial samples and have done an Illumina for the same. So on receiving the .fastq files (paired-end), I have aligned them to my reference genome using bowtie2 to get a SAM file which I converted to BAM and sorted the bam file. I then filtered the sorted bam file to obtain one where phred score >30. This filtered bam file is what I have used for my downstreaming analysis.

My question is whether there is a difference in the final output file if I use MarkDuplicates by Picard? I read how MarkDuplicates by Picard works, where it recognizes optical artifacts and PCR duplicates by seeing a pair with Q>15 is what is considered. Hope I have got that right! So does that mean when I just do a Q>30, I have taken care of duplicates or are these two totally different quality checks?

Please advice if I can go ahead with the Q>30 filtered files or I need to use the Mark Duplicates tool also. Would be great if you could give me a simple explanation for the same.

Thanks in advance!! :)

ADD COMMENTlink modified 9 months ago by d-cameron2.0k • written 9 months ago by Nalini10
2
gravatar for d-cameron
9 months ago by
d-cameron2.0k
Australia
d-cameron2.0k wrote:

are these two totally different quality checks?

These are two totally different quality checks.

Mark duplicates removes fragments that been sequenced multiple times due to PCR amplification.

MAPQ filtering removes reads that are ambiguously placed by the aligner. When a read aligner places a read, it also reports a MAPQ (mapping quality) phred-scaled quality score. Since many genomes contain repetitive sequence, many reads cannot be unambiguously placed as the read aligns equally well to two or more locations in the genomes (multi-mapping reads).

Note that there are also a phred-scaled base quality score for each base. When you say "phred score >30", it is not obvious whether you are talking about filtering out multi-mapping reads, or trimming reads with runs of low base quality scores (which you should also do).

Edit: due to the difficulty/impossibility of determining the actual source location of multi-mapping reads, these reads are also quite difficult to correctly deduplicate with MarkDuplicates since MarkDuplicates relies on the read alignments of the duplicate reads to also match.

ADD COMMENTlink modified 9 months ago • written 9 months ago by d-cameron2.0k
1
gravatar for finswimmer
9 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello,

MarkDuplicates finds duplicates based on the mapping information. The quality values are only taken into account for determine which of the duplicates should stay as the "original" read.

So filtering your bam file by phred scores doesn't remove duplicates.

Also I guess it's better to perform first MarkDuplicates and doing than any filtering steps.

fin swimmer

ADD COMMENTlink modified 9 months ago • written 9 months ago by finswimmer11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1229 users visited in the last hour