Hi all,
I want to identify tissue specific genes
I have 2 data sets:
Set 1) RPKM values for all gene in the tissue of interest
Set 2) RPKM values for all genes of many other tissue types
Should I choose a low RPKM cutoff for the genes in the tissue of interest? If so, how do I choose the cutoff?
I have 2 concerns:
A) If I dont choose a cuttoff then any expression (no matter how small) in the tissue that is not found in other tissues indicates tissue specificity For example: RPKM values for gene X in set 1 is 1.0 and in set 2 is 0.0 - Would it be a mistake to say gene X is is tissue specific?
B) If I do choose a cutoff am I incorrectly excluding rarely expressed genes? For example: A cutoff RPKM value of 2.0 would exclude gene X above.
Thanks in advance. Kenneth.
Excellent resources thank you very much
All data is from mice.
I have been given Dataset 1 which has a single RPKM value for each gene. The value is the mean of the 3 replicate RPKM values. I've got a bad feeling about combining the data like that. Is it worth raising this issue? Or is it a waste of time considering I only want to know if the gene is specific for that tissue, not how much of the gene is expressed by the tissue.
Regarding log-trans: Im assuming this is necessary to create a normal distribution only if the RPKM range is too large to plot. I raised doing this with my supervisor and was told to just use the raw RPKM values....is this a mistake?
Thanks for your help. Kenneth.
hi, RPKM values from RNA-seq don't assume normal distribution. See this - As I said about large number of genes being at 0 or minimal expression, so you get a massive peak and then a long tail extending miles away to very high values (but very less density there). Thats characteristic distribution of RNA-seq data. Doing log-trans (bottom panel of image) here only helps you "see" the peak which is otherwise not possible with raw RPKM (top panel).
Imp. - 1) Don't club replicates into a single value. Let, whatever statistics you choose, to see whats the mean and more importantly, whats the standard deviation, and make inferences. If you do a PCA where each tissue is represented by only one (averaged) sample, I don't know if anything meaningful would come out.
2) If you do PCA actually, working with log2 trans values would be helpful as the raw values are in exponential scale (i.e. from ~10 to >10k)
EDIT - The plots are made on TPM values (more comparable version of RPKM), but the scenario would be same with RPKM/ FPKM.
Hi, I have selected RPKM cut off of 2 for my data set. Then how to exercise the cut off. Like, should I consider genes having average RPKM > 2 or genes with RPKM > 2 in all the taken replicates?
Hi I completed the pipeline successfuly - In my case I chose a cut-off of 1 RPKM -
Whether you choose to 1) average first and cut off after or 2) cut off first and average after is only a question of practicality (i.e. which ever is easiest in your script) because there is very little difference in the result.
In practice for me it was easier get the mean rpkm first....then get the log2 of the mean....then later in the pipeline I removed any genes where log2(meanrpkm)=0 ... because my cut off is 1 and log2(1)=0.
If you were to follow mine then you would remove genes where log2(meanrpkm)=1 .... because your cutoff is 2 and log2(2)=1