Removing Duplicates From Rna-Seq Data
1
1
Entering edit mode
11.1 years ago
hpapoli ▴ 140

Hello,

I have RNA-seq data sequenced in Illumina platform.

I have run the quality control with FASTQC and indeed, I have detected duplicates. As I am going to use these sequence data to do the SNP calling, I must remove the duplicates.

Does anyone have any experience with this? what is the best way to remove the duplicates, before mapping or when I start with the SNP calling with gatk?

Also, what are the software suggested for this purpose.

Thanks a lot in advance.

rna-seq • 11k views
ADD COMMENT
6
Entering edit mode
11.1 years ago

As a thumb rule, you don't remove the duplicates in RNAseq data for quantification purpose. But in case you want to detect variants you should certainly remove or mark duplicates. I don't think GATK has a module that lets you remove or mark duplicates. You will have to use Picard (http://picard.sourceforge.net/) and remove or mark duplicates. You can use the processed BAM file from picard and give it to GATK for calling variants.

ADD COMMENT
10
Entering edit mode

The rationale for removing duplicates does not change because the purpose of the experiment changes (expression analysis vs. SNP finding). You would need to remove PCR duplicates in either case. The complication is supplied by the fact that these are RNA-Seq reads, not shotgun genomic. With RNA-Seq data you expect some transcripts to be very deeply sampled (oversampled really) thus making it difficult to determine with certainty whether an observed duplicate is due to PCR or really from two independent transcripts. A duplicate caused by the former should be removed while a duplicate due to the latter should not. When sequencing genomic DNA it is more likely that a duplicate is observed due to PCR; for RNA-Seq the likelihood is that a duplicate is the result of oversampling a transcript. This is why duplicate removal is typically not performed on RNA-Seq data, whatever the type of analysis.

ADD REPLY
0
Entering edit mode

Thanks a lot for your comment. I am going to map my RNA-seq data against a de novo assembly using BWA, so, this means before mapping, I must first use Picard to remove the duplicates and then map them. Sorry for my simple question, I am quite new to the field.

ADD REPLY
0
Entering edit mode

You first map the reads and then remove the duplicates. Duplicates are the reads that align to the same location so you need to align them first. One more thing is that duplicates are marked at the library level. so if you two different libraries and they both contain sequences that align at the same location they wont be considered as PCR duplicates.

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 1523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6