Question: Technical replicates in RNAseq
0
gravatar for grant.hovhannisyan
12 weeks ago by
grant.hovhannisyan300 wrote:

Hi Biostars,

Is it legitimate to sum up raw read counts from technical replicates of RNAseq, and use these summed counts for DE analysis. Would appreciate detailed and justified answers.

Thanks

ADD COMMENTlink modified 12 weeks ago by Macspider1.7k • written 12 weeks ago by grant.hovhannisyan300
1
gravatar for WouterDeCoster
12 weeks ago by
Belgium
WouterDeCoster24k wrote:

Technical reproducibility in RNA-seq is considered to be excellent (provided that the same kit/lab/...) is used. So yes, technical replicates can be combined. I think the best stage to do this is as fastq or bam, although I can't think of problems of just adding the read counts.

ADD COMMENTlink written 12 weeks ago by WouterDeCoster24k
2

So who is closer to the truth?:)

ADD REPLYlink written 12 weeks ago by grant.hovhannisyan300
2

You can concatenate the fastq files or add the counts, provided you check first for batch effects. As Wouter pointed, technical reproducibility is generally pretty high for NGS datasets - until it isn't.

ADD REPLYlink written 12 weeks ago by h.mon9.8k

Wouter has more biostars points, I am not going against moderators :-P

ADD REPLYlink written 12 weeks ago by Macspider1.7k
1

You better don't before I suspend your account ;-)

But more seriously, I might very well be wrong as well! Let's wait for someone to break the tie.

ADD REPLYlink written 12 weeks ago by WouterDeCoster24k
1

Glad to see a constructive scientific discussion here :)

ADD REPLYlink written 12 weeks ago by grant.hovhannisyan300
0
gravatar for Macspider
12 weeks ago by
Macspider1.7k
Vienna - BOKU
Macspider1.7k wrote:

My 2 cents: no, it's very bad practice to do that.

Detailed explanation:

technical replicates are different RNASeq runs performed on the same sample, they have to be averaged. You are taking a snapshot of the transcriptional profile of your sample, and you're sequencing it twice to avoid biases and to reduce the chance to fall for sequencing errors.

The same can be said for biological replicates, but in that case you are sequencing more than one sample, to avoid batch effects and sample preparation errors to bias your downstream analysis.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by Macspider1.7k
1

I have asked the question because there were two different opinions google searching, and and it seems to be the same here:)

ADD REPLYlink written 12 weeks ago by grant.hovhannisyan300

There are two equally respectable opinions, the ones that we just brought up. I gave you detailed explanation for my opinion, I am truly convinced of what I say. But I am eager to hear the reasons for which you could add them, perhaps I can change my mind if provided with enough evidence (ain't that what science is about?)

ADD REPLYlink written 12 weeks ago by Macspider1.7k

I guess it also depends on when you make technical replicates:

Are those the same cells from which you isolate twice, or the same RNA from which you created two libraries, or the same library sequenced in two runs, or even the same library sequenced in multiple lanes on the same sequencer?

Wouldn't adding up and averaging result in the same, provided that you normalize for total readcount before doing the rest of your analysis? You just get bigger numbers - as if you sequenced deeper.

ADD REPLYlink written 12 weeks ago by WouterDeCoster24k
1

But the amount of sequenced reads has to be connected to the amount of transcripts that you have in your sample, which ultimately defines the expression concept. To me, adding technical replicates means logically taking the transcripts of the same sample twice. I totally get your point, I'm more on a phylosophical / theoretical dimension now.

ADD REPLYlink written 12 weeks ago by Macspider1.7k

Agree and I think what important is not the seq. depth itself and the numbers we get, but rather the proportionality between them, since in the end what we compare are the relative expression values, which are basically proportions. I think it would not be wrong if you sum up read counts that come form more or less similar library sizes (you catch lowly expressed genes (which are the most of the genes) in the same proportions), but you would violate proportionality if you sum up very different library sizes.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by grant.hovhannisyan300

But calculating the relative expression values of the average of two replicates is the same as calculating the relative expression values of the sum of two replicates...

In the hypothetical situation that one replicate is sequenced double as deep as the other, we have the following read counts:

raw counts
geneA 2 4
geneB 3 6
geneC 5 10
total 10 20

Averaging means:
geneA 3
geneB 4.5
geneC 7.5
total 15

Proportional average means:
geneA 0.2
geneB 0.3
geneC 0.5

Sum means:
geneA 6
geneB 9
geneC 15
total 30

Proportional average means:
geneA 0.2  
geneB 0.3  
geneC 0.5
ADD REPLYlink written 12 weeks ago by WouterDeCoster24k
1

In your example, gene counts and totals are exactly proportional to each other twice. In real experiment, it's not always the case - imagine you did DE analysis with 10 mln reads and got 100 DE genes. If the proportionality always is the same in theory if you do the same analysis with 20 million reads you will get the exactly same 100 genes. Then it wouldn't make sense to generate more read depth:) This is of course specific to genome size, for example for human you can have a look on these paper https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt688, after a certain depth we don't see many changes, but with lower depth you can skew proportions because of lowly expressed genes, and I think this is where the problem comes. Please correct me I am wrong, would be glad to discuss.

ADD REPLYlink written 12 weeks ago by grant.hovhannisyan300
1

You are very right - what my example didn't include is variability due to sampling (a Poisson distribution).
Note that sequencing deeper will give you a better estimate of the true proportions of a transcript in the total mixture (since the variance of a Poisson distribution depends on the mean).

Indeed, sequencing deeper will influence the differential expression analysis - exactly because your abundance estimate is getting more accurate with increased depth.

ADD REPLYlink written 12 weeks ago by WouterDeCoster24k

I couldn't agree more. Basically, on the theoretical level I agree with @WouterDeCoster but on the practical level I don't because I saw what @grant.hovhannisyan said happen.

ADD REPLYlink written 12 weeks ago by Macspider1.7k

Technical replicates in my case will be the same library sequenced on the same machine but on the different days. For our experiment we need to reach 20mln reads depth for a given sample (biological replicate). For one of the samples we reached only 10 mln. And now need to decide ether to make a new run and seqeunce 10 mln reads (this will be a technical replicate) and add this to existing replicate, or make a run with 20 mln reads and use only these data.

ADD REPLYlink written 12 weeks ago by grant.hovhannisyan300

For our experiment we need to reach 20mln reads depth for a given sample

I think, if the funding allows you to do it, making a 20 million reads run is better. When in one year you'll be sending the paper, the revisors could easily asked you 1) why did you sum the read counts 2) why did you sequence twice 10x instead of once 20x. I think it's better to be safe than sorry :)

ADD REPLYlink written 12 weeks ago by Macspider1.7k

This is exactly what I told to my boss:)

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by grant.hovhannisyan300

...indeed, as is always the case in bioinformatics and research! :)

The reason why there are different opinions, I believe, is this:

In the past I have both merged (including both averaging counts and summing counts), and also analysed these separately, and saw no difference in the main results.

Technical replicates are useful to analyse separately if you want to see how well they align by PCA, but I see no issues merging them together and averaging counts.

ADD REPLYlink written 12 weeks ago by Kevin Blighe9.3k

they have to be averaged.

If you average over the technical replicates you are loosing sequencing depth, however. And sequencing depth is also important for analysis reliability.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by h.mon9.8k
1

How are you losing on it?

  • if you are sequencing the same sample twice in the same run, the two replicates should have the same sequencing depth
  • if you are sequencing the same sample once again in a different run, you might have different sequencing depths for different runs according to your setup

Sequencing twice the same sample is not the same as sequencing it once but twice as deep. The rare transcripts are gonna pop out only in the second one, to my experience.

ADD REPLYlink written 12 weeks ago by Macspider1.7k

Sequencing twice the same sample is not the same as sequencing it once but twice as deep. The rare transcripts are gonna pop out only in the second one, to my experience.

That doesn't make sense to me, because you are sampling from a distribution of molecules and the likelihood of "catching" low abundant transcripts is directly proportional to the "amount of sampling" you perform, whether once 20M molecules or twice 10M molecules.

ADD REPLYlink written 12 weeks ago by WouterDeCoster24k
1

I think the problem is underlying in the NGS technology itself. Imagine I sequence 10 mln reads and for a given gene I get 1 read of depth. Now if you sequence 5 mln reads, will you get 0 reads for that gene, and If you sequence 20 mln, will you get 2 reads?

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by grant.hovhannisyan300
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1267 users visited in the last hour