Question: Technical replicates in RNAseq
3
gravatar for grant.hovhannisyan
14 months ago by
grant.hovhannisyan1.2k wrote:

Hi Biostars,

Is it legitimate to sum up raw read counts from technical replicates of RNAseq, and use these summed counts for DE analysis. Would appreciate detailed and justified answers.

Thanks

rna-seq technical_replicates • 1.7k views
ADD COMMENTlink modified 14 months ago by Macspider2.6k • written 14 months ago by grant.hovhannisyan1.2k
2
gravatar for WouterDeCoster
14 months ago by
Belgium
WouterDeCoster34k wrote:

Technical reproducibility in RNA-seq is considered to be excellent (provided that the same kit/lab/...) is used. So yes, technical replicates can be combined. I think the best stage to do this is as fastq or bam, although I can't think of problems of just adding the read counts.

ADD COMMENTlink written 14 months ago by WouterDeCoster34k
2

So who is closer to the truth?:)

ADD REPLYlink written 14 months ago by grant.hovhannisyan1.2k
2

You can concatenate the fastq files or add the counts, provided you check first for batch effects. As Wouter pointed, technical reproducibility is generally pretty high for NGS datasets - until it isn't.

ADD REPLYlink written 14 months ago by h.mon21k

Wouter has more biostars points, I am not going against moderators :-P

ADD REPLYlink written 14 months ago by Macspider2.6k
1

You better don't before I suspend your account ;-)

But more seriously, I might very well be wrong as well! Let's wait for someone to break the tie.

ADD REPLYlink written 14 months ago by WouterDeCoster34k
2

Glad to see a constructive scientific discussion here :)

ADD REPLYlink written 14 months ago by grant.hovhannisyan1.2k
0
gravatar for Macspider
14 months ago by
Macspider2.6k
Vienna - BOKU
Macspider2.6k wrote:

My 2 cents: no, it's very bad practice to do that.

Detailed explanation:

technical replicates are different RNASeq runs performed on the same sample, they have to be averaged. You are taking a snapshot of the transcriptional profile of your sample, and you're sequencing it twice to avoid biases and to reduce the chance to fall for sequencing errors.

The same can be said for biological replicates, but in that case you are sequencing more than one sample, to avoid batch effects and sample preparation errors to bias your downstream analysis.

ADD COMMENTlink modified 14 months ago • written 14 months ago by Macspider2.6k
1

I have asked the question because there were two different opinions google searching, and and it seems to be the same here:)

ADD REPLYlink written 14 months ago by grant.hovhannisyan1.2k

There are two equally respectable opinions, the ones that we just brought up. I gave you detailed explanation for my opinion, I am truly convinced of what I say. But I am eager to hear the reasons for which you could add them, perhaps I can change my mind if provided with enough evidence (ain't that what science is about?)

ADD REPLYlink written 14 months ago by Macspider2.6k
1

I guess it also depends on when you make technical replicates:

Are those the same cells from which you isolate twice, or the same RNA from which you created two libraries, or the same library sequenced in two runs, or even the same library sequenced in multiple lanes on the same sequencer?

Wouldn't adding up and averaging result in the same, provided that you normalize for total readcount before doing the rest of your analysis? You just get bigger numbers - as if you sequenced deeper.

ADD REPLYlink written 14 months ago by WouterDeCoster34k
1

But the amount of sequenced reads has to be connected to the amount of transcripts that you have in your sample, which ultimately defines the expression concept. To me, adding technical replicates means logically taking the transcripts of the same sample twice. I totally get your point, I'm more on a phylosophical / theoretical dimension now.

ADD REPLYlink written 14 months ago by Macspider2.6k

Agree and I think what important is not the seq. depth itself and the numbers we get, but rather the proportionality between them, since in the end what we compare are the relative expression values, which are basically proportions. I think it would not be wrong if you sum up read counts that come form more or less similar library sizes (you catch lowly expressed genes (which are the most of the genes) in the same proportions), but you would violate proportionality if you sum up very different library sizes.

ADD REPLYlink modified 14 months ago • written 14 months ago by grant.hovhannisyan1.2k

But calculating the relative expression values of the average of two replicates is the same as calculating the relative expression values of the sum of two replicates...

In the hypothetical situation that one replicate is sequenced double as deep as the other, we have the following read counts:

raw counts
geneA 2 4
geneB 3 6
geneC 5 10
total 10 20

Averaging means:
geneA 3
geneB 4.5
geneC 7.5
total 15

Proportional average means:
geneA 0.2
geneB 0.3
geneC 0.5

Sum means:
geneA 6
geneB 9
geneC 15
total 30

Proportional average means:
geneA 0.2  
geneB 0.3  
geneC 0.5
ADD REPLYlink written 14 months ago by WouterDeCoster34k
1

In your example, gene counts and totals are exactly proportional to each other twice. In real experiment, it's not always the case - imagine you did DE analysis with 10 mln reads and got 100 DE genes. If the proportionality always is the same in theory if you do the same analysis with 20 million reads you will get the exactly same 100 genes. Then it wouldn't make sense to generate more read depth:) This is of course specific to genome size, for example for human you can have a look on these paper https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt688, after a certain depth we don't see many changes, but with lower depth you can skew proportions because of lowly expressed genes, and I think this is where the problem comes. Please correct me I am wrong, would be glad to discuss.

ADD REPLYlink written 14 months ago by grant.hovhannisyan1.2k
1

You are very right - what my example didn't include is variability due to sampling (a Poisson distribution).
Note that sequencing deeper will give you a better estimate of the true proportions of a transcript in the total mixture (since the variance of a Poisson distribution depends on the mean).

Indeed, sequencing deeper will influence the differential expression analysis - exactly because your abundance estimate is getting more accurate with increased depth.

ADD REPLYlink written 14 months ago by WouterDeCoster34k

I couldn't agree more. Basically, on the theoretical level I agree with @WouterDeCoster but on the practical level I don't because I saw what @grant.hovhannisyan said happen.

ADD REPLYlink written 14 months ago by Macspider2.6k

Technical replicates in my case will be the same library sequenced on the same machine but on the different days. For our experiment we need to reach 20mln reads depth for a given sample (biological replicate). For one of the samples we reached only 10 mln. And now need to decide ether to make a new run and seqeunce 10 mln reads (this will be a technical replicate) and add this to existing replicate, or make a run with 20 mln reads and use only these data.

ADD REPLYlink written 14 months ago by grant.hovhannisyan1.2k

For our experiment we need to reach 20mln reads depth for a given sample

I think, if the funding allows you to do it, making a 20 million reads run is better. When in one year you'll be sending the paper, the revisors could easily asked you 1) why did you sum the read counts 2) why did you sequence twice 10x instead of once 20x. I think it's better to be safe than sorry :)

ADD REPLYlink written 14 months ago by Macspider2.6k

This is exactly what I told to my boss:)

ADD REPLYlink modified 14 months ago • written 14 months ago by grant.hovhannisyan1.2k
1

The new comment in this discussion just made me realize that we had this thread ~10 months ago here, and now me and @grant are sitting next to each other in the same lab. XD

Science: it's a small world!

Apart from that, maybe it would be interesting to tell the readers what, in the end, you did with this experiment (I also don't know!) :)

ADD REPLYlink written 3 months ago by Macspider2.6k
1

Hehehe indeed very small :) That particular experiment was quite complex and in the first round of sequencing we tried to sequence as deep as possible. But eventually we still had to resequence some samples using the same libraries, so we didn't avoid technical replicates. However, in the end instead of having 20mln reads (as was planned), we got around 30-35 in total and decided to use all those technical replicates:) From the data analysis point I have first ensured that technical replicates were similar by calculating correlations and making PCA plots (and they actually were very similar). Then I have summed up the counts and performed dif. analysis with DESeq2:)

ADD REPLYlink written 3 months ago by grant.hovhannisyan1.2k
1

Haha, that's awesome! What lab? Maybe we'll see there in 10 months. ;)

ADD REPLYlink written 3 months ago by anstonius10

Comparative Genomics @ CRG, Barcelona

http://www.crg.eu/en/toni_gabaldon

ADD REPLYlink modified 3 months ago • written 3 months ago by Macspider2.6k

Or at least I hope that in 10 months the project will be published and then I'll tell you if any reviewer has complained about the technical replicates :)

ADD REPLYlink written 3 months ago by grant.hovhannisyan1.2k

they have to be averaged.

If you average over the technical replicates you are loosing sequencing depth, however. And sequencing depth is also important for analysis reliability.

ADD REPLYlink modified 14 months ago • written 14 months ago by h.mon21k
1

How are you losing on it?

  • if you are sequencing the same sample twice in the same run, the two replicates should have the same sequencing depth
  • if you are sequencing the same sample once again in a different run, you might have different sequencing depths for different runs according to your setup

Sequencing twice the same sample is not the same as sequencing it once but twice as deep. The rare transcripts are gonna pop out only in the second one, to my experience.

ADD REPLYlink written 14 months ago by Macspider2.6k
1

Sequencing twice the same sample is not the same as sequencing it once but twice as deep. The rare transcripts are gonna pop out only in the second one, to my experience.

That doesn't make sense to me, because you are sampling from a distribution of molecules and the likelihood of "catching" low abundant transcripts is directly proportional to the "amount of sampling" you perform, whether once 20M molecules or twice 10M molecules.

ADD REPLYlink written 14 months ago by WouterDeCoster34k
1

I think the problem is underlying in the NGS technology itself. Imagine I sequence 10 mln reads and for a given gene I get 1 read of depth. Now if you sequence 5 mln reads, will you get 0 reads for that gene, and If you sequence 20 mln, will you get 2 reads?

ADD REPLYlink modified 14 months ago • written 14 months ago by grant.hovhannisyan1.2k

It's all probabilistic, let's say you have a 1/10mln chance to get one read of the given gene we're talking about for every single read. Now the likelihood catching at least one read of the gene is 1 - (9 999 999 / 10 000 000)^5 000 000 = 0.3934694 for 5M reads, 1 - (9 999 999 / 10 000 000)^10 000 000 = 0.6321206 for 10M reads and 1 - (9 999 999 / 10 000 000)^20 000 000 = 0.8646647 for 20M. I think Wouter was thinking this way, and at first I tended to agree with him.

But, this kind of probabilistic maths assumes that the probabilities don't change as a function of time during the sequencing, which might be false depending on the technology or the library construction method. I'm not sure how this might be modeled or conceived on the level of cells, molecules and ultimately the technology, but there might be some Monty Hall-problemesque weirdness going on. Maybe an exhaustion phenomenon with the sample in question. Anyway, I'm just a beginning bioinformatician so this was an interesting discussion to follow.

ADD REPLYlink modified 3 months ago • written 3 months ago by anstonius10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1358 users visited in the last hour