Are not all duplicated reads removed when applying removal duplicated algorithms?
2
0
Entering edit mode
6 months ago
ManuelDB ▴ 30

I am trying to understand a bit deeper how duplications occur and how to deal with that in NGS analysis. First, of all, I wanted to understand the FastQC read duplication report for which the tutorial of Istvan Albert is really good (Revisiting the FastQC read duplication report).

My FASTQ file has shown this report

The title shows the proportion of duplicated read what is (as far I can undertand) so high. I have run Rmdup and MarkDuplicate in this file and the proportion of duplicated reads detected and removed/marked is around 15%.

So my question is, are not all duplicated reads removed when applying removal duplicated algorithms?

My second question is, for the simple simulation that Istvan Albert does in his post, I can understand what the red and blue lines is telling me. However, what my red and blue lines are telling me when working in a more realistic scenario like this (e.g. why is there a pick between 9 and >10)?

ngs FASTQC • 888 views
1
Entering edit mode

• Keep in mind that removing duplicates is not always needed/allowed. So it's not because you see a fraction of duplicated reads one needs to remover duplicates. It all depends on your data, ore specifically if you run RNAseq, then there should be duplication !!! (or your counts will be pretty low)
• There is also a binning factor in play here. You can see the plot is on a per 1 up to 9 , the next bin is 10-49 so you will always see a bump there which might have nothing to do with duplicated reads but just because it's a very large bin compared to the ones on the left.
0
Entering edit mode

Thanks for your answer. This is for clinical analysis and GATK best practices recommend removing duplicates.

I am still wondering if having 50% of duplicates reads is normal, and why FASTQC says I have 48%, but with RmDup and MarkDuplicates, only the 15% are removed???

1
Entering edit mode

indeed for such an analysis it's advised to remove duplicates.

duplication rate is also dependent on biological complexity of your sample(s) combined with the sequencing depth.

Why one removes or detects more duplication than another tool I don't really know but it is very well possible they all have different definitions of duplication and/or different levels of sensitivity to pick them up

1
Entering edit mode

What type of libraries do you have? Is this RNaseq, shotgun whole genome, exome? For RNAseq and exome, it would be normal do have a high duplication rate.

1
Entering edit mode

To do a true estimation of duplicates you can use clumpify.sh from BBMap suite which does purely sequence based analysis, no alignments needed like other tools you mention. It can do perfect matches and can work on paired-end reads at the same time.

2
Entering edit mode
6 months ago
h.mon 34k

The issue here is a somewhat hidden (in plain sight, but hidden anyways) feature of FastQC and Istvan's code: duplicates are estimated over the first 50 bp of a read, the rest is ignored - actually, FastQC and Istvan's implementations probably are not identical, as we will see bellow.

The FastQC docs have a verbal description of the details of the duplication module. Two important parts are:

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed

Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis.

Istvan's code says the fastq file has been processed with:

cat data.fq \
| bioawk -c fastx '{ print substr(\$seq,1, 50) } ' \
| sort \
| uniq -c \
| sort -k1,1 -rn \
> data.uniq.txt


This states every read of the fastq file has been cut down to 50 bp to perform the analysis - in opposition to FastQC's, which processes the first 100K reads, with reads longer than 75 bp being cut down to 50 bp.

Anyways, and to reinforce the point, FastQC and Istvan's duplicate estimation do not take into account the whole read, so may show duplicates even if the original fastq does not contain duplicates.

0
Entering edit mode

interesting point, the duplication as a concept always has another twist to it

1
Entering edit mode
6 months ago
d-cameron ★ 2.7k

So my question is, are not all duplicated reads removed when applying removal duplicated algorithms?

They are not. samtools rmdup and picard MarkDuplicates both use read alignments to determine duplicate reads. If two reads with identical sequence are mapped to different locations in the genome (e.g. they come from a L1HS LINE element, or an alpha satellite repeat) then the duplicate removal tools will fail to remove them. Last time I checked, they also had problems correctly reduplicating split read alignments.

This doesn't explain a 15% vs 50% difference though. Did you do any fastq compression? Tools like BBMap clumpify.sh will reorder the fastq so similar sequences appear near each other - an approach that doesn't play nicely with a duplication rate estimation tool that expects the reads to be random.

why is there a pick between 9 and >10

Because >10 contains 11,12,13,14,15,...,50. That peaks is what you would expect to see based on the duplicate profile we observe for the 1-9 bins in that plot.

0
Entering edit mode

are you sure what you post here?

So if no genome available those tools can't do anything? Moreover, if two reads have the exact same sequence they should align to the same region unless they assigned random because no unique mapping exists.

1
Entering edit mode

So if no genome available those tools can't do anything?

Correct. Duplicate marking tools such as samtools & picard use a sorted BAM file as input. If there is no genome then those tool will indeed fail to run because the input file isn't going to be in a valid format.

if two reads have the exact same sequence they should align to the same region unless they assigned random because no unique mapping exists.

No popular aligner uses this approach. They will either a) not map them (if the entire read consists of very highly abundant sequences above the aligner-defined seeding threshold (bwa mem default is 500)), or b) randomly assign to one of the homologs. The reason aligners don't systematically assign to the same location is because it stuffs up CNV calling. If they did what you are suggesting, the coverage of a diploid genome would go haywire at every repeat. The 'random' assignment is usually a determinist pseudo-random one so rerunning on the same input fastq will always give you the same alignments.