Question: Rnaseq Processed Data To Genes Up/Down
0
gravatar for daniel.soronellas
5.3 years ago by
Spain
daniel.soronellas330 wrote:

Hi to all community!!

First of all thank everyone that helped me!

Have a question: I have analyzed RNAseq data using TopHat, then CuffLinks and finally parsing results to R and analyzing via Limma Package. I obtained following the tutorial the TopTable with the: Name of the Gene, log2FC, Average Expression, adjusted p value, among others (t, B,...); and also computed the FPKM and FC and added to the table.

Here you can find a preview of my data analyzed:

Here you can find a preview:

My objective at this point is to retrieve the full list of genes (of nearly 23,000) that appear to be UP or DOWN according to fold change.

The filtering that usually I do is to set:

  • adj.P.Val < 0.05 (FDR)
  • FC > 1.4 (UP) or < -1.4 (DOWN)

But only applying this criteria I reach too many genes UP or DOWN (3000 up and 4000 down). Considering previous experiments in our lab I expect to find close to 2000 up/down genes.

My question is about the FPKM and if I should use it to be more stringent in my filters: do you have experience with FPKM filtering ? which threshold do you use in your experiments? What can you recommend me?

I read some literature about this and found some papers which use an FPKM > 1 together with FDR and FC, and others not use the same FPKM or even use the RPKM > 0.1 and I don't know exactly how to proceed to set my filters with more statistically significant limits for genes UP/DOWN.

Thanks to all community! Any help will be kindly appreciated! Dani.

fpkm • 5.1k views
ADD COMMENTlink modified 5.3 years ago by biorepine1.3k • written 5.3 years ago by daniel.soronellas330
2

There is no gold standard for thresholding expression. The best you can do is to try to threshold based on biological observations. One method might be to look at the expression of introns/intergenic regions that you know probably should not be expressing anything and use that as a baseline. However, even that method can quickly get complicated.

ADD REPLYlink written 5.3 years ago by Damian Kao14k
1

I just want to throw a caution here about your analysis: it's not clear to me how you have used the output of cufflinks to send data to R and limma. If you are using limma, hopefully you are using the voom function, which would require a table of read counts per gene (per expt) as input, and not FPKM values.

Is that what you are using?

ADD REPLYlink written 5.3 years ago by Steve Lianoglou4.9k

Hi Steve,

Yes you are right :) But My question is what to do to be stringent having FC, FPKM and Ajusted P value (FDR) to filter genes UP-regulated and DOWN-regulated.

Thanks for your comment!

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by daniel.soronellas330

To continue Steve's point, what exactly do you mean when you say then Cufflinks and finally parse.... How or what is Cufflinks' output doing in limma? Could you clarify?

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Arun2.2k

That is,

I have two timepoints: 0h and 6h of treatment.

So to continue analysis after CuffLinks the genes.fpkm.tracking files from both treatments that the program outputs was taken by the FPKM, FPKM.high and FPKM.low together with gene symbol always where FPKM status was OK.

Then joined files by Gene Symbol to obtain a file with the following columns:

Gene_symbol, FPKM(0h), FPKM.high(0h), FPKM.low(0h),FPKM(6h), FPKM.high(6h), FPKM.low(6h).

This file was used to be analyzed with Limma and the result is the Table with computations of log2FC, Average Expressions, FDR, etc...

Sorry for the bad explanation, I hope this will be clear enough :)

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by daniel.soronellas330

Thanks, I think I get it. So input to limma was FPKM. Right?

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Arun2.2k

Yes,

The input was all FPKMs in both conditions

ADD REPLYlink written 5.3 years ago by daniel.soronellas330
1

Okay, before continuing with the original question you'd asked, the method of providing input as FPKM to limma, a package that assumes count values (and the statistical tests are designed for count data), is wrong. This is what Steve was trying to explain. Since your statistical analysis is invalid, it is difficult to tell anything about the 3000-4000 genes you have obtained. So, I suggest you re-run limma with read counts to find the differentially expressed genes correctly.

ADD REPLYlink written 5.3 years ago by Arun2.2k
3

To nitpick a bit, limma does not actually assume count values - it was developed for microarray analysis, after all. What voom() is doing is preparing count data for limma by stabilizing the variance (making sure that the variance does not depend to much on the mean). If the FPKM data have the proper distribution limma should work fine, but they probably don't, so it may be safer to use counts -> voom().

ADD REPLYlink written 5.3 years ago by Mikael Huss4.6k

Yes, thanks for pointing it out Mikael. I've been using this and edgeR exchangeably that much that I forgot the difference! :)

ADD REPLYlink written 5.3 years ago by Arun2.2k

That's what I thought :-) edgeR and limma are very similar in some ways

ADD REPLYlink written 5.3 years ago by Mikael Huss4.6k

Ok I understand the problem,

Since I'm new in RNA-seq, how I could obtain read counts from FPKM?, I was searching in the web and the solution is not trivial and I couldn't find any clear solution

Thanks

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by daniel.soronellas330
1

You can not. You'll have to obtain read counts from your sam/bam mapped file directly. As for the thresholding, people also increase the cut-off of the log2FC to >=2. Note: This is to be done after you get your analysis right.

ADD REPLYlink written 5.3 years ago by Arun2.2k
2
gravatar for biorepine
5.3 years ago by
biorepine1.3k
Spain
biorepine1.3k wrote:

Forget about these microarray R packages, when you using cufflinks output. I strongly suggest you to use cuffdiff to calculate differential expression of two time points. If you want to use R for further downstream analysis, apply cummerbund.

ADD COMMENTlink written 5.3 years ago by biorepine1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1179 users visited in the last hour