Question: RNA-seq dedupe PCR contamination before or after mapping
4.1 years ago by
umn_bist370 wrote:

Will deduping (marking duplicates) with Picard before mapping affect my variant calling? Is it common practice to dedupe after mapping?

Reason why I ask is because my FastQC reports still have a lot of Kmer, overrepresented sequences, and bad GC content. I figured these can be corrected by removing PCR contamination. This is after trimming adapter and low quality (10) bases using BBDuk.

rna-seq • 2.0k views
written 4.1 years ago by umn_bist370

Depends on what data you have, but a slight bimodal distribution of GC content in whole exome data, seems to be the norm (I haven't figured out a reason why, but it appears to be commonplace)

written 4.1 years ago by andrew.j.skelton735.9k

I'm working with tumor/normal PE RNA-seq samples from TCGA. The distribution varies across the board. Some are slight, some are drastic. I fear that mapping my reads without correcting GC and Kmer bias may muddle my variant calling downstream.

written 4.1 years ago by umn_bist370

I highly recommend you look at the GATK best practises, it includes caveats for using RNA seq data (providing the samples have suitable depth)

written 4.1 years ago by andrew.j.skelton735.9k
4.1 years ago by
Carlo Yague4.9k
Carlo Yague4.9k wrote:

Overrepresented sequences / skewed GC content is expected in RNA-seq data. It usually comes from the most highly expressed transcripts (such as rRNA). However, it can also come from PCR duplicates and those can completely skew variant calling. For this reason, while people usually don't dedupe RNA-seq data for differential expression analysis, it is still recommended to do so for variant calling.

Some reference :

written 4.1 years ago by Carlo Yague4.9k
