Hello,
I have RNA-seq data sequenced in Illumina platform.
I have run the quality control with FASTQC and indeed, I have detected duplicates. As I am going to use these sequence data to do the SNP calling, I must remove the duplicates.
Does anyone have any experience with this? what is the best way to remove the duplicates, before mapping or when I start with the SNP calling with gatk?
Also, what are the software suggested for this purpose.
Thanks a lot in advance.
The rationale for removing duplicates does not change because the purpose of the experiment changes (expression analysis vs. SNP finding). You would need to remove PCR duplicates in either case. The complication is supplied by the fact that these are RNA-Seq reads, not shotgun genomic. With RNA-Seq data you expect some transcripts to be very deeply sampled (oversampled really) thus making it difficult to determine with certainty whether an observed duplicate is due to PCR or really from two independent transcripts. A duplicate caused by the former should be removed while a duplicate due to the latter should not. When sequencing genomic DNA it is more likely that a duplicate is observed due to PCR; for RNA-Seq the likelihood is that a duplicate is the result of oversampling a transcript. This is why duplicate removal is typically not performed on RNA-Seq data, whatever the type of analysis.
Thanks a lot for your comment. I am going to map my RNA-seq data against a de novo assembly using BWA, so, this means before mapping, I must first use Picard to remove the duplicates and then map them. Sorry for my simple question, I am quite new to the field.
You first map the reads and then remove the duplicates. Duplicates are the reads that align to the same location so you need to align them first. One more thing is that duplicates are marked at the library level. so if you two different libraries and they both contain sequences that align at the same location they wont be considered as PCR duplicates.
Take a look at these biostar threads, could be helpful Workflow Or Tutorial For Snp Calling? What Is The Best Pipeline For Human Whole Exome Sequencing?