My RNAseq PE data has near 60% sequence duplication. I am using fastp to delete duplicate reads. I am doing my rna seq analysis in galaxy. So when I use multiqc along with fastp json output it shows that all duplication is removed. but when I analyze the processed reads by fastqc and check through multiqc same level of sequence duplication is still there. Can anyone help me deal with this.
Sequence duplication % in raw reads
Sequence duplcation level in % when fastp_processedread->fastQC->multiqc
fastp json reports directly directly combined by multiqc
What is happening? can anyone tell. and can I proceed with fastp processed reads
I would not put a lot of confidence in "duplication detection" by FastQC. FastQC looks at only first 100K reads in checking this (LINK) so if you are really interested in finding sequence duplication then you will want to use a program like
clumpify.sh
that works on sequence level (see: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. )Yes. In some experiments you expect some duplication to be present because that is normal (e.g. RNAseq where there can be multiple copies of transcripts).