Question: Should We Remove Duplicated Reads In Rna-Seq ?
10
gravatar for siddharth.sethi5
6.4 years ago by
Medical Research Council, Oxford, United Kingdom
siddharth.sethi5220 wrote:

Is it OK to remove duplicated reads in RNA-seq before testing for differential expression?

read rna-seq • 28k views
ADD COMMENTlink modified 5.3 years ago by earonesty220 • written 6.4 years ago by siddharth.sethi5220

how do you define a duplicated read? If you delete duplicate reads, how will you get information on expression levels?

ADD REPLYlink written 6.4 years ago by Whetting1.5k
1

Duplicated read is not the same a duplicated transcript. The level of natural read duplications will be a function of coverage and expression levels. And since most of the time the latter is not known we can only estimate the natural duplication rate.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Istvan Albert ♦♦ 79k

Reads which are exactly the same and on the same strand. They get aligned at the same place and looks like multiple copies of one read. So, I assume the answer is NO?

ADD REPLYlink written 6.4 years ago by siddharth.sethi5220
23
gravatar for Obi Griffith
6.4 years ago by
Obi Griffith17k
Washington University, St Louis, USA
Obi Griffith17k wrote:

The general consensus seems to be to NOT remove duplicates from RNA-seq. Read the biostar discussions:

and this seqanswers thread and the other threads it links to.

ADD COMMENTlink modified 6.4 years ago by Istvan Albert ♦♦ 79k • written 6.4 years ago by Obi Griffith17k
5

The general consensus I see from those discussions, rather than "no", is "it depends". The first link: Malachi Griffith writes: "Observing high rates of read duplicates in RNA-seq libraries is common. It may not be an indication of poor library complexity caused by low sample input or over-amplification. It might be caused by such problems but it is often because of very high abundance of a small number of genes." Second link: Istvan Albert writes: "My personal opinion is to investigate the duplication rates and remove them if there is indication that these are artificial ones." 3rd link: Ketil writes: "I think the number of duplicates depend on many factors, so it is hard to give any general and useful rules of thumb. Usually, duplicates are correlated with too little sample material, and/or difficulties in the lab." And the seqanswers thread goes both ways.

ADD REPLYlink written 6.4 years ago by Mikael Huss4.6k

Yes. I should have said that the general consensus seems to be to not blindly remove duplicates.

ADD REPLYlink written 6.4 years ago by Obi Griffith17k

But are type I/II errors from duplication rates expression/quartile dependent?

Coefficient of Variation is often the lowest in the most abundant genes, which are also the genes with the highest probability of duplicates (optical, amplification, or true duplicates) by virtue of the amount of data they consume. Variance is often artificially inflated for these genes in the end anyways in most regularization protocols for differential expression, so the conclusions from DE analyses should be largely the same... In contrast, genes with low expression stand to be the most affected by duplicates for this exact same reason (i.e. their variance is artificially reduced by regularization AND duplicates are a larger proportion of their count totals to begin with, thus increasing type I error). The odds of these arguments affecting opinions is low, but I think it's a conversation worth having.

I would recommend re-running differential expression analyses with/without deduplication and then comparing the resulting gene sets for each quartile of expression. My educated guess is that conclusions for the top quartile would be fairly consistent (do to the low CoV) while results could be quite different in genes with low expression where a larger percentage of the counts may be optical or technical duplicates (in comparison to genes in the top quartile).

Edit: rephrasing and grammar

ADD REPLYlink modified 23 months ago • written 23 months ago by mrals8910

This is an interesting comment and I'd like to understand it better. Are you using the word "regularization" in the statistical sense, for example as in "Tikhonov regularization"? If so, can you describe the statistical methods you have in mind? Also, how do you know that there are more duplicates per read for low-expressed genes?

ADD REPLYlink written 19 months ago by eric.kern13130
4
gravatar for Biomonika (Noolean)
6.4 years ago by
State College, PA, USA
Biomonika (Noolean)3.0k wrote:

I really depends. I have seen alignments with so many same reads at a certain position (5 and 3 prime exactly matching), that it were for sure PCR duplicates (on other positions in the alignment there was no such coverage). However, you might loose information about strongly expressed genes this way. I would check for un/equal coverage for every gene/contig.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Biomonika (Noolean)3.0k
4
gravatar for Mikael Huss
6.4 years ago by
Mikael Huss4.6k
Stockholm
Mikael Huss4.6k wrote:

If you have paired-end reads, I definitely think you should remove duplicates (alignments that start at the same locations at both read 1 and read 2). These are very unlikely to occur by chance because of the variation in fragment size.

If you have a small amount of RNA going into the experiment, you will have run a lot of PCR cycles before sequencing and the representation of some fragments will have become very biased. Duplicate removal is a way to mitigate this effect although it will not solve it.

It continues to baffle me why people keep saying that you have to keep duplicates because you will lose information - but apparently it's perfectly fine to get grossly distorted read counts because of amplification artifacts! There is no way to avoid bias completely.

ADD COMMENTlink written 6.4 years ago by Mikael Huss4.6k
7

The problem is that your first premise that "These are very unlikely to occur by chance because of the variation in fragment size" is true for DNA but NOT necessarily true for RNA. If you have a short RNA (i.e., small transcript, miRNA, etc) that is very highly expressed there might be many, many legitimate duplicate copies of that RNA with exactly the same fragment size/position. The size of that transcript might even be such that effectively no fragmentation occurs. Also fragmentation may not occur in a totally random fashion across the transcript. By removing those duplicates you may be introducing a new source of "grossly distorted read counts". As, you say, there is no way to avoid bias completely. But, depending on the complexity and quality of your library you may be better off living with a small amount of amplification artifacts.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Obi Griffith17k

OK, but for miRNA etc you wouldn't do paired-end sequencing anyway; what I had in mind was the typical mRNA case. Anyway, your point is noted, and again underlines that one has to think about the biases, and not just blindly follow a rule, such as "always remove duplicates" or "never remove duplicates".

ADD REPLYlink written 6.4 years ago by Mikael Huss4.6k
4
gravatar for earonesty
5.3 years ago by
earonesty220
United States
earonesty220 wrote:

The EXPRESS pipeline (berkeley) does this for you. It calculates a coverage distribution and then removes the "spikes" from that distribtion. Because duplicates are expected in RNA Seq, removing them "blindly" is a bad idea. You should be smoothing, not removing.

ADD COMMENTlink written 5.3 years ago by earonesty220

How much of an issue is read duplicates in RNA seq experiments, both single fragment and PE sequenced? Is it affecting a large proportion of the data? What are the duplicate rates like for an average human sample experiment? Is it in the order of 1% duplicates? Or 10% duplicates?

ADD REPLYlink written 3.7 years ago by 141341254653464453.4k

The rate of natural duplication of the reads coming from a transcript will be a function of the length of the transcript, the expression levels of the transcript and the overall coverage of the sample. Hence there is no rule that you could use a priori to separate natural duplicates (that should be kept) from artificial duplicates (that should be removed)

ADD REPLYlink written 3.7 years ago by Istvan Albert ♦♦ 79k
I have seen genome-wide duplication rates up to 50% in our own data (50bp SE) and think this is nothing unusual.
ADD REPLYlink written 3.7 years ago by Christian2.7k
3
gravatar for VS
6.4 years ago by
VS700
USA
VS700 wrote:

Short answer - NO!

You are checking for differential expression of genes, number of reads mapping to a gene is a measure of its expression. By removing reads, you will bias the true expression measurements.

ADD COMMENTlink written 6.4 years ago by VS700
2

I don't agree at all. By not removing duplicates, you bias the expression measurements towards PCR artifacts, especially from paired-end data.

ADD REPLYlink written 6.4 years ago by Mikael Huss4.6k
3

Well I have not worked with paired-end data so far, can't say about them. But I produced 'tag' files to track the number of copies of 'duplicate reads' for single-end seq. data. The very highly expressed genes tend to have more of such 'duplicate' reads. We also did validations with qPCR that confirm the higher expression. So I don't think you can say duplicate reads result purely from PCR amplification artifacts.

ADD REPLYlink written 6.4 years ago by VS700

OK, that's an interesting finding actually. Sure, duplicate reads are not always from PCR amplication artifacts, but sometimes they are! Probably you should ideally go on a case-by-case.

ADD REPLYlink written 6.4 years ago by Mikael Huss4.6k

But highly expressed genes will have higher number of reads in general - I would be very interested in ratios (dupl/all) for highly/medium/low expressed genes if you could provide this information please.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Biomonika (Noolean)3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 890 users visited in the last hour