Question: Removing Duplicates From Rna-Seq Data
gravatar for hpapoli
7.7 years ago by
hpapoli90 wrote:


I have RNA-seq data sequenced in Illumina platform.

I have run the quality control with FASTQC and indeed, I have detected duplicates. As I am going to use these sequence data to do the SNP calling, I must remove the duplicates.

Does anyone have any experience with this? what is the best way to remove the duplicates, before mapping or when I start with the SNP calling with gatk?

Also, what are the software suggested for this purpose.

Thanks a lot in advance.

rna-seq • 8.8k views
ADD COMMENTlink modified 7.7 years ago by Ashutosh Pandey12k • written 7.7 years ago by hpapoli90
gravatar for Ashutosh Pandey
7.7 years ago by
Ashutosh Pandey12k wrote:

As a thumb rule, you don't remove the duplicates in RNAseq data for quantification purpose. But in case you want to detect variants you should certainly remove or mark duplicates. I don't think GATK has a module that lets you remove or mark duplicates. You will have to use Picard ( and remove or mark duplicates. You can use the processed BAM file from picard and give it to GATK for calling variants.

ADD COMMENTlink written 7.7 years ago by Ashutosh Pandey12k

The rationale for removing duplicates does not change because the purpose of the experiment changes (expression analysis vs. SNP finding). You would need to remove PCR duplicates in either case. The complication is supplied by the fact that these are RNA-Seq reads, not shotgun genomic. With RNA-Seq data you expect some transcripts to be very deeply sampled (oversampled really) thus making it difficult to determine with certainty whether an observed duplicate is due to PCR or really from two independent transcripts. A duplicate caused by the former should be removed while a duplicate due to the latter should not. When sequencing genomic DNA it is more likely that a duplicate is observed due to PCR; for RNA-Seq the likelihood is that a duplicate is the result of oversampling a transcript. This is why duplicate removal is typically not performed on RNA-Seq data, whatever the type of analysis.

ADD REPLYlink modified 7.7 years ago • written 7.7 years ago by kmcarr00280

Thanks a lot for your comment. I am going to map my RNA-seq data against a de novo assembly using BWA, so, this means before mapping, I must first use Picard to remove the duplicates and then map them. Sorry for my simple question, I am quite new to the field.

ADD REPLYlink written 7.7 years ago by hpapoli90

You first map the reads and then remove the duplicates. Duplicates are the reads that align to the same location so you need to align them first. One more thing is that duplicates are marked at the library level. so if you two different libraries and they both contain sequences that align at the same location they wont be considered as PCR duplicates.

ADD REPLYlink written 7.7 years ago by Ashutosh Pandey12k

Take a look at these biostar threads, could be helpful Workflow Or Tutorial For Snp Calling? What Is The Best Pipeline For Human Whole Exome Sequencing?

ADD REPLYlink written 7.7 years ago by Sudeep1.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1277 users visited in the last hour