Question: Difference in duplicate number with samtools flagstat and multiqc/fastqc
0
gravatar for ZheFrench
2.2 years ago by
ZheFrench250
France
ZheFrench250 wrote:

Samtools flagstats give me 0 + 0 duplicates for BAM from a chipSeq sample. ChIPQC package in R gives me also 0 duplicates.

But when I'm doing a fastqc or Multiqc analyse. The BAM has 40 % duplicate.

So I'm wondering how it works and if it's representing different values. It happens for all my samples...

ADD COMMENTlink modified 8 months ago by Phil Ewels430 • written 2.2 years ago by ZheFrench250
3
gravatar for genomax
2.2 years ago by
genomax70k
United States
genomax70k wrote:

If you want to reliably identify sequence duplicates (or all kinds) then use clumpify from BBMap suite. You do not need to align the data for clumpify : Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

As for FastQC : Various modules sample different amounts of data. Duplication module and overrepresented sequences module tracks the first 8000 sequences it sees (but then reads them to the end of the file).

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by genomax70k
1
gravatar for Phil Ewels
8 months ago by
Phil Ewels430
Sweden / Stockholm / SciLifeLab
Phil Ewels430 wrote:

I think that samtools flagstats tells you about the flagged duplicates. You need to mark these duplicate reads first, for example by running Picard MarkDuplicates. If you then run samtools flagstats again it should give you a more realistic figure :)

FastQC uses the raw FastQ sequences instead of alignment positions, so doesn't require any preprocessing steps.

ADD COMMENTlink written 8 months ago by Phil Ewels430
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1615 users visited in the last hour