Question: Duplicate Reads In Rnaseq
gravatar for Ashutosh Pandey
6.3 years ago by
Ashutosh Pandey11k wrote:

In a genomic analysis including variant discovery it is advisable to not to consider OR remove duplicate reads because a replication error could be easily misunderstood as a SNP. So we usually use reads with unique start positions.

Library generation protocol for RNAseq involves amplification at some point. I know amplification is required for sequencing and I am also sure that amplification is uniformly carried out for the transcriptome (at least theoretically). For example, if Gene A has 4 RNA copies in the sample and Gene B has 10 copies. After amplification, if gene A has 16 copies then Gene B should have 40 copies. I also know if you have very deep library then lot of duplicate reads are expected not because of the amplification.

a) Has someone any idea about what percentage (range) of duplicate reads in total reads is considered to be normal for the RNASeq data ?

Also, do we also need to discard duplicate reads in case of RNASeq experiments where we have to compare gene expression between two samples where the number of duplicate reads differ significantly between these two samples(One sample was more amplified and the other was less). I know dividing the read counts with the total number of mapped read in the sample removes the bias when you have unequal number of reads for different samples but will this normalization step take care of duplicates so that we get the true representation of the transcriptome after normalization.

For some protocols dealing with very small quantity of RNA as a starting material, lot of amplification is required before sequencing. I have a RNAseq data for two of such samples where first sample has only 30% of unique or non-duplicate reads and the second sample has around 50% of unique or non-duplicate reads. Can I still carry out the RPKM normalisations and use tools like DEGseq, EdgeR to get the list of differentially expressed genes ?


duplicates rna-seq rpkm • 8.5k views
ADD COMMENTlink written 6.3 years ago by Ashutosh Pandey11k

I think we DON'T need to discard duplicate reads from RNASeq experiments

ADD REPLYlink written 6.3 years ago by Rm7.8k

I agree with that too

ADD REPLYlink written 6.3 years ago by JC7.0k
gravatar for Rm
6.3 years ago by
Danville, PA
Rm7.8k wrote:

for some insights on PCR duplicates in RNAseq data : follow this thread on seqanswers

ADD COMMENTlink written 6.3 years ago by Rm7.8k
gravatar for Ketil
6.2 years ago by
Ketil3.9k wrote:

I think the number of duplicates depend on many factors, so it is hard to give any general and useful rules of thumb. Usually, duplicates are correlated with too little sample material, and/or difficulties in the lab. I expect more complex procedures may cause more duplicates, but I don't have any hard numbers on that. In my experience, duplication rates seem to be higher and less evenly distributed with 454 than on Illumina, but that could be a bias from the types of data I've seen.

I'm curious if others' experiences agree with mine.

ADD COMMENTlink written 6.2 years ago by Ketil3.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1101 users visited in the last hour