Question: Analysis Without Replicates
0
gravatar for stevenmarkgalway
5 weeks ago by
stevenmarkgalway10 wrote:

I am carrying out an RNA-seq analysis on mesenchymal stem cells retrieved from disease and control state. However, the exp is without replicates (bad idea I know). The analysis is exploratory, to identify potential factors that can be validated in the lab. So far, I performed transcript quantification using salmon and summarised output to gene level using tximport. So now I have a counts matrix. Just to clarify my only real approach at this point is to rank genes based on a transformed or regularized LogFC?

rna-seq without replicates • 306 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by stevenmarkgalway10
2
gravatar for Kevin Blighe
5 weeks ago by
Kevin Blighe35k
Republic of Ireland
Kevin Blighe35k wrote:

EdgeR claims to be able to derive p-values for single-replicate experiments. however, you would never be able to publish such results in any reputable journal (only one of those bogus journals that publish anything about anything). Obviously, 3 replicates is ideal; better with 4 or 5.

Things to try:

  • look at max and min expressed transcripts by normalised counts
  • look at fold-change ratios between the replicates (as you mentioned)

I trust that at least these have been processed in the same batch and in exactly the same way. The last thing that you want is a batch effect with a n=2!

Kevin

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe35k
1

Thanks for your feedback guys, much appreciated. Yeah, I have absolutely no replicates and yes they have been processed in the same batch. I did try duplicating the data for DESeq2 analysis, but no joy. I think going of estimated p-values for experiment without replicates is prob not a great idea anyway. So I am looking at the fold change for indication of significant difference. I found that when I rank based on log2fold change I have quite large log2fold change values up to as far as +/- 8 and 9, with many highly ranked genes having lower counts compared to lower ranked genes. The ranking seems look a lot better when I apply DESeq's rlog for a regularised logfold change. From here I am applying a 0.6 cut off to obtain genes that have a normal log fold change of 1.5 or greater. From this I have ~200 up-regulated genes. I plan to use signalP to pull out secreted proteins that may be validated in the lab. Also ill be using IPA for pathway analysis, causal network analysis and building context around genes of interest! I am guessing this is the best I can do with what I have.

ADD REPLYlink written 5 weeks ago by stevenmarkgalway10
2

You choose 0.6 presumably because log2(1.5) = 0.5849625 ? Did you try to apply the fold change shrinkage method of DESeq2, too? In any case, it seems like you know what you are doing.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe35k

That's exactly right Kevin, it's the Log2(1.5). After reading around I gathered some ideas for what I might be able to do in this situation. I guess being new to this kind of analysis there is a degree of uncertainty on my behalf. I just wanted to see what people thought. Thanks a lot for your input. I'll be sure to check out DESeq2s LFC shrinkage method.

ADD REPLYlink written 5 weeks ago by stevenmarkgalway10
1

In case the shrinkage does not give satisfying results and you have to use unshrunken FCs, I would suggest to use a decent read cutoff to avoid high fold changes due to small counts ("mean-variance relationship").

ADD REPLYlink written 5 weeks ago by ATpoint12k

It seems I can't run the LFC shrinkage without first running the DESeq function, which in-turn requires replicates - "Treating samples as replicates was deprecated in v1.20" ~ output from DESeq2. So I will stick with the regularised log fold change. I was considering a high read count cutoff, but I absolutely have no idea how high or low I should draw the cutoff. I really hate asking for advice on choosing arbitrary cutoffs. However, I was thinking of summing up the rows across the two samples and applying a cutoff of 50. So genes that have a summed read count greater than 50 across the two samples would be retained for downstream analysis. I seen this approach in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4934518/ . Would a summed read count cutoff of 50 be "decent"? I really am in the dark with this one.

ADD REPLYlink written 5 weeks ago by stevenmarkgalway10
1

There is no standard cut-off at all. All that you need to ensure is that you remain transparent about it in your methods. Could be 20, 30, 50, ..., 5, ..., 100, ...

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe35k
1

Sounds like a good plan, especially with the log2FC shrinkage - just remember log2FC != pvalues

ADD REPLYlink written 5 weeks ago by kristoffer.vittingseerup1.2k

Older DESeq2 version were able to do it. Unfortunately they removed that ability from recent versions. Still as Kevin said those aren't recommended. You could try duplicating the data and then get some results, but still it will be bogus.

ADD REPLYlink written 5 weeks ago by piyushjo100
3

Don’t ever duplicate results to get pseudoreplicates. See miRNA-seq DE analysis with duplicate replicates. Am i correct?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by ATpoint12k
1

Didn't mean to suggest that as the correct method, but rather a method to get top genes to validate using qPCR. That data is not publishable with pseudoreplicates.

ADD REPLYlink written 5 weeks ago by piyushjo100
2

Still, I do not think that creating pseudoreplicates gives you anything meaningful because you artificially create data without variance and therefore excessive false-positives and probably also false fold-changes. For qPCR candidates, I would use those that have reasonable fold log2FCs and proper mean expression as discussed above.

ADD REPLYlink written 5 weeks ago by ATpoint12k

OK something that I am trying to get my head around. Given this scenario: I have a read count for GeneA for both control and case samples (no replicates). Gene A(control) has a read count =2, GeneA(case) has a read count=1000. That is approximately a 9 fold increase for GeneA(case) at the log2scale (which is massive). So when you say "I would suggest to use a decent read cutoff to avoid high fold changes due to small counts" you are not referring to the above scenario. However, your are referring to a scenario where Gene B(control) has a read count =2, GeneB(case) has a read count=30, which returns a relatively high Log2fold change ~4, based on low counts?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by stevenmarkgalway10
1

Exactly. The situation with 2 vs 1000 is probably/maybe a true biological effect, but 2 vs. 30 is rather the usual noise that comes from low abundances.

ADD REPLYlink written 5 weeks ago by ATpoint12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1192 users visited in the last hour