Technical replicates in RNAseq
3
14
Entering edit mode
5.4 years ago

Hi Biostars,

Is it legitimate to sum up raw read counts from technical replicates of RNAseq, and use these summed counts for DE analysis. Would appreciate detailed and justified answers.

Thanks

RNA-Seq technical_replicates • 9.1k views
2
Entering edit mode
11 months ago
hermidalc ▴ 20

There is a theoretical justification for summing and not averaging. Read counts follow a Poisson distribution so averaging them results in data that is not Poisson, but summing is still Poisson. See Mike Love's explanation here

0
Entering edit mode

0
Entering edit mode

It has been moved

0
Entering edit mode

great, cheers!

5
Entering edit mode
5.4 years ago

Technical reproducibility in RNA-seq is considered to be excellent (provided that the same kit/lab/...) is used. So yes, technical replicates can be combined. I think the best stage to do this is as fastq or bam, although I can't think of problems of just adding the read counts.

2
Entering edit mode

So who is closer to the truth?:)

2
Entering edit mode

You can concatenate the fastq files or add the counts, provided you check first for batch effects. As Wouter pointed, technical reproducibility is generally pretty high for NGS datasets - until it isn't.

0
Entering edit mode

Wouter has more biostars points, I am not going against moderators :-P

1
Entering edit mode

You better don't before I suspend your account ;-)

But more seriously, I might very well be wrong as well! Let's wait for someone to break the tie.

2
Entering edit mode

Glad to see a constructive scientific discussion here :)

1
Entering edit mode
5.4 years ago
Macspider ★ 3.6k

My 2 cents: no, it's very bad practice to do that.

Detailed explanation:

technical replicates are different RNASeq runs performed on the same sample, they have to be averaged. You are taking a snapshot of the transcriptional profile of your sample, and you're sequencing it twice to avoid biases and to reduce the chance to fall for sequencing errors.

The same can be said for biological replicates, but in that case you are sequencing more than one sample, to avoid batch effects and sample preparation errors to bias your downstream analysis.

1
Entering edit mode

I have asked the question because there were two different opinions google searching, and and it seems to be the same here:)

0
Entering edit mode

There are two equally respectable opinions, the ones that we just brought up. I gave you detailed explanation for my opinion, I am truly convinced of what I say. But I am eager to hear the reasons for which you could add them, perhaps I can change my mind if provided with enough evidence (ain't that what science is about?)

1
Entering edit mode

I guess it also depends on when you make technical replicates:

Are those the same cells from which you isolate twice, or the same RNA from which you created two libraries, or the same library sequenced in two runs, or even the same library sequenced in multiple lanes on the same sequencer?

Wouldn't adding up and averaging result in the same, provided that you normalize for total readcount before doing the rest of your analysis? You just get bigger numbers - as if you sequenced deeper.

1
Entering edit mode

But the amount of sequenced reads has to be connected to the amount of transcripts that you have in your sample, which ultimately defines the expression concept. To me, adding technical replicates means logically taking the transcripts of the same sample twice. I totally get your point, I'm more on a phylosophical / theoretical dimension now.

0
Entering edit mode

Agree and I think what important is not the seq. depth itself and the numbers we get, but rather the proportionality between them, since in the end what we compare are the relative expression values, which are basically proportions. I think it would not be wrong if you sum up read counts that come form more or less similar library sizes (you catch lowly expressed genes (which are the most of the genes) in the same proportions), but you would violate proportionality if you sum up very different library sizes.

0
Entering edit mode

But calculating the relative expression values of the average of two replicates is the same as calculating the relative expression values of the sum of two replicates...

In the hypothetical situation that one replicate is sequenced double as deep as the other, we have the following read counts:

raw counts
geneA 2 4
geneB 3 6
geneC 5 10
total 10 20

Averaging means:
geneA 3
geneB 4.5
geneC 7.5
total 15

Proportional average means:
geneA 0.2
geneB 0.3
geneC 0.5

Sum means:
geneA 6
geneB 9
geneC 15
total 30

Proportional average means:
geneA 0.2
geneB 0.3
geneC 0.5

1
Entering edit mode

In your example, gene counts and totals are exactly proportional to each other twice. In real experiment, it's not always the case - imagine you did DE analysis with 10 mln reads and got 100 DE genes. If the proportionality always is the same in theory if you do the same analysis with 20 million reads you will get the exactly same 100 genes. Then it wouldn't make sense to generate more read depth:) This is of course specific to genome size, for example for human you can have a look on these paper https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt688, after a certain depth we don't see many changes, but with lower depth you can skew proportions because of lowly expressed genes, and I think this is where the problem comes. Please correct me I am wrong, would be glad to discuss.

1
Entering edit mode

You are very right - what my example didn't include is variability due to sampling (a Poisson distribution).
Note that sequencing deeper will give you a better estimate of the true proportions of a transcript in the total mixture (since the variance of a Poisson distribution depends on the mean).

Indeed, sequencing deeper will influence the differential expression analysis - exactly because your abundance estimate is getting more accurate with increased depth.

0
Entering edit mode

I couldn't agree more. Basically, on the theoretical level I agree with @WouterDeCoster but on the practical level I don't because I saw what @grant.hovhannisyan said happen.

0
Entering edit mode

Technical replicates in my case will be the same library sequenced on the same machine but on the different days. For our experiment we need to reach 20mln reads depth for a given sample (biological replicate). For one of the samples we reached only 10 mln. And now need to decide ether to make a new run and seqeunce 10 mln reads (this will be a technical replicate) and add this to existing replicate, or make a run with 20 mln reads and use only these data.

0
Entering edit mode

For our experiment we need to reach 20mln reads depth for a given sample

I think, if the funding allows you to do it, making a 20 million reads run is better. When in one year you'll be sending the paper, the revisors could easily asked you 1) why did you sum the read counts 2) why did you sequence twice 10x instead of once 20x. I think it's better to be safe than sorry :)

0
Entering edit mode

This is exactly what I told to my boss:)

1
Entering edit mode

The new comment in this discussion just made me realize that we had this thread ~10 months ago here, and now me and @grant are sitting next to each other in the same lab. XD

Science: it's a small world!

Apart from that, maybe it would be interesting to tell the readers what, in the end, you did with this experiment (I also don't know!) :)

1
Entering edit mode

Hehehe indeed very small :) That particular experiment was quite complex and in the first round of sequencing we tried to sequence as deep as possible. But eventually we still had to resequence some samples using the same libraries, so we didn't avoid technical replicates. However, in the end instead of having 20mln reads (as was planned), we got around 30-35 in total and decided to use all those technical replicates:) From the data analysis point I have first ensured that technical replicates were similar by calculating correlations and making PCA plots (and they actually were very similar). Then I have summed up the counts and performed dif. analysis with DESeq2:)

1
Entering edit mode

Haha, that's awesome! What lab? Maybe we'll see there in 10 months. ;)

0
Entering edit mode

Comparative Genomics @ CRG, Barcelona

http://www.crg.eu/en/toni_gabaldon

0
Entering edit mode

Or at least I hope that in 10 months the project will be published and then I'll tell you if any reviewer has complained about the technical replicates :)

0
Entering edit mode

they have to be averaged.

If you average over the technical replicates you are loosing sequencing depth, however. And sequencing depth is also important for analysis reliability.

1
Entering edit mode

How are you losing on it?

• if you are sequencing the same sample twice in the same run, the two replicates should have the same sequencing depth
• if you are sequencing the same sample once again in a different run, you might have different sequencing depths for different runs according to your setup

Sequencing twice the same sample is not the same as sequencing it once but twice as deep. The rare transcripts are gonna pop out only in the second one, to my experience.

2
Entering edit mode

Sequencing twice the same sample is not the same as sequencing it once but twice as deep. The rare transcripts are gonna pop out only in the second one, to my experience.

That doesn't make sense to me, because you are sampling from a distribution of molecules and the likelihood of "catching" low abundant transcripts is directly proportional to the "amount of sampling" you perform, whether once 20M molecules or twice 10M molecules.

1
Entering edit mode

I think the problem is underlying in the NGS technology itself. Imagine I sequence 10 mln reads and for a given gene I get 1 read of depth. Now if you sequence 5 mln reads, will you get 0 reads for that gene, and If you sequence 20 mln, will you get 2 reads?

0
Entering edit mode

It's all probabilistic, let's say you have a 1/10mln chance to get one read of the given gene we're talking about for every single read. Now the likelihood catching at least one read of the gene is 1 - (9 999 999 / 10 000 000)^5 000 000 = 0.3934694 for 5M reads, 1 - (9 999 999 / 10 000 000)^10 000 000 = 0.6321206 for 10M reads and 1 - (9 999 999 / 10 000 000)^20 000 000 = 0.8646647 for 20M. I think Wouter was thinking this way, and at first I tended to agree with him.

But, this kind of probabilistic maths assumes that the probabilities don't change as a function of time during the sequencing, which might be false depending on the technology or the library construction method. I'm not sure how this might be modeled or conceived on the level of cells, molecules and ultimately the technology, but there might be some Monty Hall-problemesque weirdness going on. Maybe an exhaustion phenomenon with the sample in question. Anyway, I'm just a beginning bioinformatician so this was an interesting discussion to follow.