Question: Divergent number of replicates between FastQC and samtools
1
gravatar for user31888
17 months ago by
user3188890
United States
user3188890 wrote:

I have a total of 300 million reads in my R1 and R2 raw fastq files.

How come fastQC count about 50% duplicates in each .fastq files, but after alignment with STAR (default parameters), none of the reads have the 1024 flag (PCR or optical duplicates) when doing samtools view -c -f 1024 mydata.bam?

fastqc star samtools • 404 views
ADD COMMENTlink modified 17 months ago by i.sudbery9.4k • written 17 months ago by user3188890
3
gravatar for h.mon
17 months ago by
h.mon31k
Brazil
h.mon31k wrote:

Because STAR with default settings doesn't mark duplicates, you have to use --bamRemoveDuplicatesType UniqueIdentical to have duplicates marked (or --bamRemoveDuplicatesType UniqueIdenticalNotMulti).

edit: also, to elaborate on the point genomax makes:

STAR (if you choose to mark duplicates with the correct parameter, which apparently you didn't) and FastQC use very different methods of detecting duplicates, so it shouldn't be surprising if the number os estimated duplicates is different. The documentation for FastQC states:

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed

And also:

Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis.

Whereas STAR uses mapping position:

UniqueIdentical

mark all multimappers, and duplicate unique mappers.

The coordinates, FLAG, CIGAR must be identical

ADD COMMENTlink modified 17 months ago • written 17 months ago by h.mon31k
1

Thanks for your very complete answer ! I actually did use --bamRemoveDuplicatesType UniqueIdentical, but my version of STAR (2.5.1b) does not tag duplicates in the same time as aligning fastq reads. I needed to (1) align the .fastq reads; (2) tag duplicates using star --runMode inputAlignmentsFromBAM --inputBAMfile input.bam --bamRemoveDuplicatesType UniqueIdentical.

ADD REPLYlink written 17 months ago by user3188890

You are correct, and current version (2.7.1a) works the same as the version you are using (2.5.1b). The pdf manual is incomplete and doesn't explain this, but the command-line help does:

### BAM processing
bamRemoveDuplicatesType  -
    string: mark duplicates in the BAM file, for now only works with (i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only
                        -                       ... no duplicate removal/marking
                        UniqueIdentical         ... mark all multimappers, and duplicate unique mappers. The coordinates, FLAG, CIGAR must be identical
                        UniqueIdenticalNotMulti  ... mark duplicate unique mappers but not multimappers.
ADD REPLYlink modified 17 months ago • written 17 months ago by h.mon31k
3
gravatar for i.sudbery
17 months ago by
i.sudbery9.4k
Sheffield, UK
i.sudbery9.4k wrote:

In addition to the points made by @h.mon and @genomax, another point is that if I remember correctly, FastQC only gives you the single-ended duplication rate - that is from a given fastq file, how many reads have exactly the same sequence.

However, this is not how tools that function on alignments (like STAR's --bamRemoveDuplicatesType UniqueIdentical, or Picard's MarkDuplicates or samtools's markdup) function. They look at a pair of reads. If they have the same outer alignment co-ordinates, they are marked as duplicates. Thus in the following pairs:

    R1                    R2
---------->          <----------

    R3                           R4
---------->                  <----------

          R5               R6
      --------->      <---------     

          R7                        R8
      --------->             <----------

fastqc would mark everything as a duplicate because (R1=R3, R2=R6, R5=R7 and R4=R8). This assumes no sequencing errors. samtools/Picard/STAR would mark nothing as a duplicate because no pair has identical coordinates at both ends.

ADD COMMENTlink modified 17 months ago • written 17 months ago by i.sudbery9.4k

Thanks for your explanation. It's a bit out of the topic here, but I am just curious. According to you, if one have single-end reads, fastqc and samtools / picard / STAR should output the same number of duplicates then?

ADD REPLYlink written 17 months ago by user3188890

Well almost. FastQC tags on having the same sequence. samtools/picard/STAR tags on having the same alignment coordinates. The two might not be equivalent in cases where you have a sequencing error in a read that leads to it having a different sequence, but still aligning to the same location.

ADD REPLYlink written 17 months ago by i.sudbery9.4k
1
gravatar for genomax
17 months ago by
genomax91k
United States
genomax91k wrote:

FastQC does not look at your entire dataset for some of the parameters it is testing. That would be memory/time consuming. It samples your data as it goes along.

So that 50% number is an estimate. It may vary depending how reads are distributed in your data.

ADD COMMENTlink modified 17 months ago • written 17 months ago by genomax91k

Correct. After counting them all "manually", I found a different number as well.

ADD REPLYlink written 17 months ago by user3188890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1710 users visited in the last hour