Question

fastp json -> multiqc and fastp processed reads ->fastqc -> multiqc show different level of sequence duplication in my RNAseq data

0

Entering edit mode

2.8 years ago

Rakesh Tiwari • 0

My RNAseq PE data has near 60% sequence duplication. I am using fastp to delete duplicate reads. I am doing my rna seq analysis in galaxy. So when I use multiqc along with fastp json output it shows that all duplication is removed. but when I analyze the processed reads by fastqc and check through multiqc same level of sequence duplication is still there. Can anyone help me deal with this. Sequence duplication % in raw reads

Sequence duplcation level in % when fastp_processedread->fastQC->multiqc

Multiqc directly from fastp json reports directly combined by multiqc fastp json reports directly directly combined by multiqc

What is happening? can anyone tell. and can I proceed with fastp processed reads

fastp duplication sequence fastqc • 2.3k views

ADD COMMENT • link updated 2.8 years ago by GenoMax 154k • written 2.8 years ago by Rakesh Tiwari • 0

0

Entering edit mode

I would not put a lot of confidence in "duplication detection" by FastQC. FastQC looks at only first 100K reads in checking this (LINK) so if you are really interested in finding sequence duplication then you will want to use a program like clumpify.sh that works on sequence level (see: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. )

can I proceed with fastp processed reads

Yes. In some experiments you expect some duplication to be present because that is normal (e.g. RNAseq where there can be multiple copies of transcripts).

ADD REPLY • link 2.8 years ago by GenoMax 154k