Question: FPKM values / Data analysis
gravatar for Biogeek
6.8 years ago by
Biogeek400 wrote:

Hey ,

Still getting the hang of the whole RNA-seq and gene annotation process. One factor I have been thinking about lately and after reading a publication I would like to ask a question in regards to:

1.In regards to filtering FPKM values in a spreadsheet, what value determines if a transcript is present/expressed / silenced and up-regulated. I generally assumed a value 'greater than' 0 would mean the transcript is expressed but I have read different. Can anyone shed light on this? Although different from FPKM, one paper was using the assumption that if RPKM > or equal to 3 then the transcript was over expressed. Can anyone explain?

2.I also have 3 different conditions: control, low and medium. When making comparisons I am looking at the top 1000 up-regulations and down-regulations and comparing these between Medium/Control Vs Low/Control ( Taking FPKM values,working out fold changes and Log2 values) using Blast2GO software.  I have also thought about just comparing the top 1000 expressed transcripts of control Vs Medium Vs low based on FPKM values without working out top upreg/ downregs via fold-changes.Ultimately what would be the best way to go about comparisons, does my method make sense? What logical approaches have you guys taken to analysing an experiment with 3 treatments?

So far I have created some Venn diagrams to show transcripts present in each condition, shared between conditions, present in all conditions, heat maps using EdgeR and I am currently making use of Blast2GO for GO and annotation comparisons.I just need to be 100% confident in my methods.

Apologies for the very noobish questions but I guess I have to learn somewhere :D

Kind regards.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Biogeek400

In principle you cannot use FPKM, to define cutoffs for non-transcribed/transcribed genes, see: Does FPKM scale incorrectly in case of unequal mapping rates?, what FPKM means in one sample. It is not clear to me if there can be any sensible way of determining such cutoff other than arbitrarily from a single sample alone.

ADD REPLYlink written 6.8 years ago by Michael Dondrup48k

So in essence am I safe going ahead with FPKM values? I used Trinity followed by RSEM and EdgeR, data was normalised before EdgeR was applied. I took for granted that by loading the .FPKM file into excel I could use custom filters to sort the data to what I needed to interpret. Is this wrong, and calculating fold change then log2 in Excel wrong too? Any help greatly appreciated. My idea is to see the change in gene representation over the different conditions as GO terms.

Can I also ask, we get a gene and also a transcripts FPKM file, am I best going with the gene FPKM file for analysis rather than transcripts?

Again, thank you.

ADD REPLYlink modified 14 months ago by Ram32k • written 6.8 years ago by Biogeek400

I can't recommend using edgeR with normalized estimated counts. Perhaps you get vaguely correct results, perhaps not, it's tough to know since the counts kind of violate the statistical model used by edgeR.

If you're interested in looking at GO enrichment, just use the gene-level metrics. While one could theoretically hope to find different GO annotation per-transcript, this never occurs (practically, at least).

ADD REPLYlink written 6.8 years ago by Devon Ryan98k
gravatar for Devon Ryan
6.8 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:
  1. There is no predefined threshold for denoting a transcript as expressed (some papers will claim this, but unless they've done the spike-ins and validation then just ignore their values). Realistically speaking, a value somewhere between 0.5 and 5 is probably correct, but will vary with dataset (and possibly sample). See this blog post from Lior Pachter where he goes into more detail (I recommend his blog in general).
  2. Are you actually computing the fold-changes yourself and going off of that rather than using one of the analysis packages (EBseq, cuffdiff, etc.)? I wouldn't base much purely on the fold-changes, since a big difference of a highly variable measurement isn't likely to be significant (thus, use the adjusted p-value and then prioritize based on fold-change, expression level, etc.). Regarding looking at GO changes, you typically take the differentially regulated genes/transcripts as a unit. There was a recent post here on biostars (somewhere, I can't find the post at the moment) with a link to a paper where they split things by up/down-regulation prior to looking at GO enrichment changes. I've never done that and have issues with it due to how biology works, but I guess if you did that you wouldn't be completely alone.
ADD COMMENTlink written 6.8 years ago by Devon Ryan98k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1571 users visited in the last hour