Question: How detrimental are duplicate reads in RNAseq experiments?
9
gravatar for 14134125465346445
5.7 years ago by
United Kingdom
141341254653464453.5k wrote:

How much of an issue is the fact that reads can be duplicates in RNA seq experiments?

Both single fragment and PE sequenced, is read duplication affecting a large proportion of the data? What are the duplicate rates like for an average human sample experiment? Is it in the order of 1% duplicates? Or 10% duplicates?

I have seen it mentioned in another question here in Biostars, I would like to get a feeling of how important this is in the field:

Duplicated Reads In Rna-Seq Experiment

rna-seq duplicate-reads • 6.4k views
ADD COMMENTlink modified 3.7 years ago by Eichan20 • written 5.7 years ago by 141341254653464453.5k
5
gravatar for igor
5.1 years ago by
igor12k
United States
igor12k wrote:

This is a long-running debate. There was recently a paper that finally attempted to answer this question: http://biorxiv.org/content/early/2016/01/15/035493

Update: Now in Scientific Reports: http://www.nature.com/articles/srep25533

ADD COMMENTlink modified 4.1 years ago • written 5.1 years ago by igor12k
1

Very interesting paper, thanks for pointing this out!

ADD REPLYlink written 5.1 years ago by Chris Fields2.2k
4
gravatar for Brian Bushnell
5.1 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

The "average duplication rate" is not useful in this context (or probably any context). It varies, depending on your amount of genetic material, amplification protocol, and sequencing methodology; furthermore, even for a supposedly fixed protocol, it still varies wildly and can easily exceed 1000% in some experiments. Duplicates should never be removed in any quantitative experiment, such as RNA-seq. Also, as much as possible, amplification should be avoided in quantitative experiments. If there is no amplification, duplicate removal should not be performed.

ADD COMMENTlink modified 14 months ago by Ram32k • written 5.1 years ago by Brian Bushnell17k
2
gravatar for Chris Fields
5.1 years ago by
Chris Fields2.2k
University of Illinois Urbana-Champaign
Chris Fields2.2k wrote:

In general we mark duplicates (e.g. do not remove them) and only for data from WGS/exome expts or from analyses where amplification artifacts might be a problem (ChIP-Seq for example). I believe some folks also do this for some single-cell analyses where amplification may be used.

ADD COMMENTlink modified 14 months ago by Ram32k • written 5.1 years ago by Chris Fields2.2k
2

Keep in mind the methods used to detect duplicates (such as Picard) are actually assessing potential PCR duplicates based on sequence alignment position and CIGAR string, so your chances of having false-positive 'duplicates' goes up quite dramatically for high-coverage data (as seen w/ some regions using RNA-Seq). I recall reading elsewhere (possibly seqanswers) that true PCR duplicates, assessed using random barcoding, are actually quite a bit rarer than predicted using these methods, particularly if amplification is kept to a minimum.

Also, IIRC optical duplicates are now noted and removed during the run for newer versions of Illumina's pipeline, so these aren't as commonly detected as they had been with older versions of the CASAVA pipeline.

ADD REPLYlink modified 14 months ago by Ram32k • written 5.1 years ago by Chris Fields2.2k
2
gravatar for dariober
5.1 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

Duplicates should never be removed in any quantitative experiment, such as RNA-seq. Mmm... I think it's a debatable issue. In my experience removing duplicates from most pull-down or enrichment experiments (e.g. ChIP-Seq, FAIRE-Seq, etc.) gives better signal to noise. On the other hand, enrichment experiments are expected to generate duplicates. But definitively RNA-Seq should not have duplicates removed.

EDIT: My apologies, this should have been a comment to Brian's answer. Clicked the wrong button!

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by dariober11k
1

Most ChIP-Seq and related methods highly recommend removal of potential duplicates (including the ENCODE SOPs), as well as variant calling procedures. I noticed the GATK RNA-Seq protocol recommends duplicate removal as well. So it truly depends on the procedure, but I agree in these cases removal/marking is helpful in better signal to noise.

ADD REPLYlink modified 14 months ago by Ram32k • written 5.1 years ago by Chris Fields2.2k
1

Interesting. I disagree on theoretical grounds with removing duplicates in anything quantitative - meaning the number of reads mapped to a locus is the ultimate output (or a linear function of the ultimate output), as in RNA-seq. Variant-calling is not quantitative. I'm only somewhat familiar with ChIP-Seq, so I don't know whether the size or shape of the peaks is more important. But, with high enough coverage, duplicate removal will destroy both the size and shape of your peaks, so it should not be done. With low coverage... it shouldn't be necessary, but might be useful if you used very high amplification.

That said - if you amplify to the point that duplicate-removal improves your experimental results in a quantitative experiment, I would say that your entire experiment has already been compromised.

ADD REPLYlink modified 14 months ago by Ram32k • written 5.1 years ago by Brian Bushnell17k

Yep, agree completely.

ADD REPLYlink written 5.1 years ago by Chris Fields2.2k
2
gravatar for Eichan
3.7 years ago by
Eichan20
Japan/Tokyo
Eichan20 wrote:

Maybe it's very late to return this topic back, but I was searching for answers and after reading here I found a very fresh article in Nature Methods about duplicates from PCR. They said duplicates are artefacts that should be removed and suggest using softwares such as Picard's MarkDuplicates or samtools rmdup to have this task done easily. Hope this informations still useful... ;)

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Eichan20
1

Good find. However, the paper mostly discusses single-cell RNA-seq.

Regarding normal bulk RNA-seq, they say:

In the Illumina labs, the team also experimented with purposefully generating a large number of PCR duplicates. The team compared data from unique reads and duplicates—'good' and 'bad' data. “Essentially—I was even a little surprised by this—you couldn't really tell the difference; the good and the bad data were identical,” says Schroth. This experimental outcome reinforced his notion that under certain conditions, such as typical RNA-seq assays, PCR duplicates are not problematic.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by igor12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2721 users visited in the last hour
_