How detrimental are duplicate reads in RNAseq experiments?
4
11
Entering edit mode
10.3 years ago

How much of an issue is the fact that reads can be duplicates in RNA seq experiments?

Both single fragment and PE sequenced, is read duplication affecting a large proportion of the data? What are the duplicate rates like for an average human sample experiment? Is it in the order of 1% duplicates? Or 10% duplicates?

I have seen it mentioned in another question here in Biostars, I would like to get a feeling of how important this is in the field:

Duplicated Reads In Rna-Seq Experiment

RNA-Seq duplicate-reads • 14k views
ADD COMMENT
6
Entering edit mode
9.8 years ago
igor 13k

This is a long-running debate. There was recently a paper that finally attempted to answer this question: http://biorxiv.org/content/early/2016/01/15/035493

Update: Now in Scientific Reports: http://www.nature.com/articles/srep25533

ADD COMMENT
1
Entering edit mode

Very interesting paper, thanks for pointing this out!

ADD REPLY
4
Entering edit mode
9.8 years ago

The "average duplication rate" is not useful in this context (or probably any context). It varies, depending on your amount of genetic material, amplification protocol, and sequencing methodology; furthermore, even for a supposedly fixed protocol, it still varies wildly and can easily exceed 1000% in some experiments. Duplicates should never be removed in any quantitative experiment, such as RNA-seq. Also, as much as possible, amplification should be avoided in quantitative experiments. If there is no amplification, duplicate removal should not be performed.

ADD COMMENT
2
Entering edit mode

Duplicates should never be removed in any quantitative experiment, such as RNA-seq. Mmm... I think it's a debatable issue. In my experience removing duplicates from most pull-down or enrichment experiments (e.g. ChIP-Seq, FAIRE-Seq, etc.) gives better signal to noise. On the other hand, enrichment experiments are expected to generate duplicates. But definitively RNA-Seq should not have duplicates removed.

EDIT: My apologies, this should have been a comment to Brian's answer. Clicked the wrong button!

ADD REPLY
1
Entering edit mode

Most ChIP-Seq and related methods highly recommend removal of potential duplicates (including the ENCODE SOPs), as well as variant calling procedures. I noticed the GATK RNA-Seq protocol recommends duplicate removal as well. So it truly depends on the procedure, but I agree in these cases removal/marking is helpful in better signal to noise.

ADD REPLY
1
Entering edit mode

Interesting. I disagree on theoretical grounds with removing duplicates in anything quantitative - meaning the number of reads mapped to a locus is the ultimate output (or a linear function of the ultimate output), as in RNA-seq. Variant-calling is not quantitative. I'm only somewhat familiar with ChIP-Seq, so I don't know whether the size or shape of the peaks is more important. But, with high enough coverage, duplicate removal will destroy both the size and shape of your peaks, so it should not be done. With low coverage... it shouldn't be necessary, but might be useful if you used very high amplification.

That said - if you amplify to the point that duplicate-removal improves your experimental results in a quantitative experiment, I would say that your entire experiment has already been compromised.

ADD REPLY
0
Entering edit mode

Yep, agree completely.

ADD REPLY
2
Entering edit mode
9.8 years ago
Chris Fields ★ 2.2k

In general we mark duplicates (e.g. do not remove them) and only for data from WGS/exome expts or from analyses where amplification artifacts might be a problem (ChIP-Seq for example). I believe some folks also do this for some single-cell analyses where amplification may be used.

ADD COMMENT
2
Entering edit mode

Keep in mind the methods used to detect duplicates (such as Picard) are actually assessing potential PCR duplicates based on sequence alignment position and CIGAR string, so your chances of having false-positive 'duplicates' goes up quite dramatically for high-coverage data (as seen w/ some regions using RNA-Seq). I recall reading elsewhere (possibly seqanswers) that true PCR duplicates, assessed using random barcoding, are actually quite a bit rarer than predicted using these methods, particularly if amplification is kept to a minimum.

Also, IIRC optical duplicates are now noted and removed during the run for newer versions of Illumina's pipeline, so these aren't as commonly detected as they had been with older versions of the CASAVA pipeline.

ADD REPLY
2
Entering edit mode
8.3 years ago
Eijy Nagai ▴ 90

Maybe it's very late to return this topic back, but I was searching for answers and after reading here I found a very fresh article in Nature Methods about duplicates from PCR. They said duplicates are artefacts that should be removed and suggest using softwares such as Picard's MarkDuplicates or samtools rmdup to have this task done easily. Hope this informations still useful... ;)

ADD COMMENT
1
Entering edit mode

Good find. However, the paper mostly discusses single-cell RNA-seq.

Regarding normal bulk RNA-seq, they say:

In the Illumina labs, the team also experimented with purposefully generating a large number of PCR duplicates. The team compared data from unique reads and duplicates—'good' and 'bad' data. “Essentially—I was even a little surprised by this—you couldn't really tell the difference; the good and the bad data were identical,” says Schroth. This experimental outcome reinforced his notion that under certain conditions, such as typical RNA-seq assays, PCR duplicates are not problematic.

ADD REPLY

Login before adding your answer.

Traffic: 4125 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6