How to convince the replication in RNAseq
1
2
Entering edit mode
6.9 years ago
wonseongsik ▴ 50

Hi all,

I have a kind of replicates for three samples from two RNAseq works. Two RNAseq works mean I performed one RNAseq earlier, and the other RNAseq around 3 month later. I'm not sure If I can say this is biological replicates. Probably it is.

The RNAs was prepared from the same cell line. It means that the source of two RNAs are the same. But each sample was in different condition each like below.

RNAseq works-----Condition1---------Condition2-------Condition3


First RNAseq-------RNA-Sample1-----RNA-Sample2-----RNA-Sample3


Second RNAseq----RNA-Sample1-----RNA-Sample2-----RNA-Sample3

From pearson's correlation, the coefficient are 0.92 (0.80), 0.98 (0.96), 0.98 (0.96) for sample1, 2, 3, respectively when expected count (TPM value) was used.

In theory, two samples in the same condition are the same, and the expression profiles are supposed to be the same with almost coefficient 1. But, I understand technical variance.

From this, my opinion is that I can use the replicated data for sample 2 and sample 3, but I'm not sure about replication for sample 1.

Considering the coefficients of sample 2 and 3, I think technical variance didn't affect a lot between two RNAseq works. If the way I think is wrong, please point it out.

Is it OK if I think sample1 is replicated and use the data for the further analysis with that coefficient score? or Do I have to discard it?

Plus, is there any other reliable or relevant method to check replication?

I'm not very new, but don't have enough knowledge and experience in this field. Looking forward to good comment and advice.

Thanks, SS

RNA-Seq • 2.7k views
ADD COMMENT
2
Entering edit mode

You need to find out the source of the RNA. The question of biological replicates or technical replicates is very important and due to the source material. Where did the (each of six) RNA come from? Your collaborators will know if they are the same cells or different. Pearson correlation is irrelevant for this answer.

ADD REPLY
1
Entering edit mode

Hi karl.stamn

Thank you for the reply and comments.

As you can see some kind of table above. It's biological replicates.

RNAs were prepared from sample1, 2, and 3, and each sample were treated in different condition, like sample 1 was in condition 1, and sample2 in condition 2, and sample 3 in condition 3. Prepared RNAs were used first RNAseq.

And three months later, I did the exactly the same thing. So, in theory, data from sample 1 of the first RNAseq is supposed to be the same as data from sample 1 of the second RNAseq, like this.

Could you let me know why pearson correlation is irrelevant for this?

Is that because of some kind of variation that might cause significant different read counts between replicates?

Then, would Spearman correlation can give better explain between replicates?

Do I have to normalize expected counts to do Spearman correlation? Would just standardization works, instead of normalization?

Thanks, SS

ADD REPLY
1
Entering edit mode

I did Spearman correlation, and I got this results 0.95 (0.95) for sample 1, 0.97 (0.96) for sample2, 0.98 (0.97) for sample3 (EC values (TPM values)) similar to the results of Pearson correlation.

I expected that the Spearman correlation gives higher coefficient because it eliminates variance caused by the differences in read counts.

But the question unanswered is whether this coefficiency 0.95 is enough to convince the replication (Second RNAseq counts = the first RNAseq counts, not exactly same, very similar enough to ignore the minor differences in whole gene expression profiles) or not.

Please, comment and point out things that I miss or misunderstand.

Thanks, SS

ADD REPLY
3
Entering edit mode
6.9 years ago

First of all, it is not really clear what you are aiming for with the correlation measures. In general, all samples from the same cell would come out as highly correlated even when the gene expression changes. This is because the vast majority of genes are not expected to change their levels of expression - hence, these non-changing values would dominate the correlation coefficient.

When it comes to defining "replication" it all depends on what you are trying to prove.

You may group your samples any way you want - replication simply means that you are expecting the samples to behave identically. You don't need to "prove" beforehand that these do behave identically. The goal of replication is to allow you to find those changes that characterize the difference between the groups and ignore the changes within the replicates.

It is the job of the statistical method to infer which variations are valid. If your samples are not actually replicated then (hopefully) the methods will state that no changes can be detected.

ADD COMMENT
0
Entering edit mode

Hi Istvan,

Thank you for your comments and kind explanation.

I was going to say the replicates are faithful if the coefficient is high like 0.97.

Now that I read your comments, correlation analysis is not useful for this purpose.

However, I'm still a little concerned about technical variation.

Since the first RNAseq was performed a little different platform (not totally different, as I know, it's different version of Illumina, probably, HiSeq 2500 for the first RNAse, and HiSeq 300 for the second RNAseq), some kind of technical variation affect a lot of counting reads during sequencing process even if it's assumed that there's no biological variation.

This is why I thought I need to do something and prove the replicates is identical.

I happened to find a paper about SERE (simple error ratio estimate). If you know this paper, do you think this paper would be good to show faithful replication?

Thank you, again, for your comments and advice. I learned a lot.. SS

ADD REPLY
2
Entering edit mode

As I mentioned before you never need to prove that the replicates are "similar" beforehand.

Replication is used to identify which genes are unrelated to the condition that you are testing for. There could be many genes that vary quite a bit even within replicates. The purpose of replication is to find which genes vary only for the tested condition.

If your replicates are not actual replicates then you won't find differentially expressed genes. That's all.

Of course in a research paper you will need to argue of why your selection of replication makes sense when this selection is not of the type commonly used. And unfortunately, since most reviewers don't fully understand the subject matter you have to write this in a way that will be acceptable to them - that means computing and stating the correlations.

ADD REPLY
0
Entering edit mode

Understood... Thank you so much for your clear comments... I've got a lot of clear answers...

One simple thing, speaking of biological replication.

I understand negative binomial distribution (NBD) is better than poisson distribution for modeling count-based data like RNAseq data, when it comes to biological replication.

Here's a naive question... I'm just wondering if I can use binomial distribution instead of NDB...

Thanks, SS

ADD REPLY
0
Entering edit mode

The way this site works is that unrelated questions need to be asked as a new, separate question. This helps other users navigate the site and helps to bring in new expertise.

ADD REPLY
0
Entering edit mode

Oh~ sorry... I'll make a separate question... Thanks,

ADD REPLY

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6