Question: Difference in duplicate number with samtools flagstat and multiqc/fastqc
gravatar for ZheFrench
3.1 years ago by
ZheFrench300 wrote:

Samtools flagstats give me 0 + 0 duplicates for BAM from a chipSeq sample. ChIPQC package in R gives me also 0 duplicates.

But when I'm doing a fastqc or Multiqc analyse. The BAM has 40 % duplicate.

So I'm wondering how it works and if it's representing different values. It happens for all my samples...

ADD COMMENTlink modified 19 months ago by Phil Ewels560 • written 3.1 years ago by ZheFrench300
gravatar for genomax
3.1 years ago by
United States
genomax85k wrote:

If you want to reliably identify sequence duplicates (or all kinds) then use clumpify from BBMap suite. You do not need to align the data for clumpify : Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

As for FastQC : Various modules sample different amounts of data. Duplication module and overrepresented sequences module tracks the first 8000 sequences it sees (but then reads them to the end of the file).

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by genomax85k
gravatar for Phil Ewels
19 months ago by
Phil Ewels560
Sweden / Stockholm / SciLifeLab
Phil Ewels560 wrote:

I think that samtools flagstats tells you about the flagged duplicates. You need to mark these duplicate reads first, for example by running Picard MarkDuplicates. If you then run samtools flagstats again it should give you a more realistic figure :)

FastQC uses the raw FastQ sequences instead of alignment positions, so doesn't require any preprocessing steps.

ADD COMMENTlink written 19 months ago by Phil Ewels560
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1089 users visited in the last hour