Question

Removing Duplicates From Rna-Seq Data

1

Entering edit mode

11.1 years ago

hpapoli ▴ 140

Hello,

I have RNA-seq data sequenced in Illumina platform.

I have run the quality control with FASTQC and indeed, I have detected duplicates. As I am going to use these sequence data to do the SNP calling, I must remove the duplicates.

Does anyone have any experience with this? what is the best way to remove the duplicates, before mapping or when I start with the SNP calling with gatk?

Also, what are the software suggested for this purpose.

Thanks a lot in advance.

rna-seq • 11k views

ADD COMMENT • link updated 11.1 years ago by Ashutosh Pandey 12k • written 11.1 years ago by hpapoli ▴ 140

score 6 · Answer 1 · 2013-03-19

6

Entering edit mode

11.1 years ago

Ashutosh Pandey 12k

As a thumb rule, you don't remove the duplicates in RNAseq data for quantification purpose. But in case you want to detect variants you should certainly remove or mark duplicates. I don't think GATK has a module that lets you remove or mark duplicates. You will have to use Picard (http://picard.sourceforge.net/) and remove or mark duplicates. You can use the processed BAM file from picard and give it to GATK for calling variants.

ADD COMMENT • link 11.1 years ago by Ashutosh Pandey 12k

10

Entering edit mode

The rationale for removing duplicates does not change because the purpose of the experiment changes (expression analysis vs. SNP finding). You would need to remove PCR duplicates in either case. The complication is supplied by the fact that these are RNA-Seq reads, not shotgun genomic. With RNA-Seq data you expect some transcripts to be very deeply sampled (oversampled really) thus making it difficult to determine with certainty whether an observed duplicate is due to PCR or really from two independent transcripts. A duplicate caused by the former should be removed while a duplicate due to the latter should not. When sequencing genomic DNA it is more likely that a duplicate is observed due to PCR; for RNA-Seq the likelihood is that a duplicate is the result of oversampling a transcript. This is why duplicate removal is typically not performed on RNA-Seq data, whatever the type of analysis.

ADD REPLY • link 11.1 years ago by kmcarr00 ▴ 290

0

Entering edit mode

Thanks a lot for your comment. I am going to map my RNA-seq data against a de novo assembly using BWA, so, this means before mapping, I must first use Picard to remove the duplicates and then map them. Sorry for my simple question, I am quite new to the field.

ADD REPLY • link 11.1 years ago by hpapoli ▴ 140

0

Entering edit mode

You first map the reads and then remove the duplicates. Duplicates are the reads that align to the same location so you need to align them first. One more thing is that duplicates are marked at the library level. so if you two different libraries and they both contain sequences that align at the same location they wont be considered as PCR duplicates.

ADD REPLY • link 11.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Take a look at these biostar threads, could be helpful Workflow Or Tutorial For Snp Calling? What Is The Best Pipeline For Human Whole Exome Sequencing?

ADD REPLY • link 11.1 years ago by Sudeep ★ 1.7k