Question: Rna Seq Ranking Genes Based On Principal Component Analysis
2
gravatar for Sudeep
6.8 years ago by
Sudeep1.6k
.
Sudeep1.6k wrote:

Hi all
Did anybody try PCA based gene ranking on read count data or do you know any papers on that ? I was searching for a while and almost all the papers I came across used PCA for plotting sample separation. What should be taken into consideration for doing a PCA based gene ranking on read count data (ie to start from scratch ) ?


EDIT I actually meant prioritizing genes based on read counts (expression values) between case and control samples using PCA

Thank you

pca rna-seq • 7.8k views
ADD COMMENTlink modified 6.8 years ago by Michael Dondrup46k • written 6.8 years ago by Sudeep1.6k
1

What do you mean by "gene ranking"? What's the criteria for ranking?

ADD REPLYlink written 6.8 years ago by Arun2.3k

Well what I actually meant was "gene prioritization" based on expression values, not "ranking" as such, I have edited my post.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Sudeep1.6k
1

Have you done a more traditional differential expression analysis using DESeq or edgeR, for example? This will rank genes based on expression value differences between cases and controls.

ADD REPLYlink written 6.8 years ago by Sean Davis25k

Yes, I already have the DEG's from DESeq, I was just a bit curious if somebody has tried any of the PCA based approaches and what are the caveats in doing such an analysis

ADD REPLYlink written 6.8 years ago by Sudeep1.6k

So you want to use PCA for differential expression ranking? I am interested in how this works, can you link any papers of this approach? Are they just using PCA as some kind of a smoothing function?

ADD REPLYlink written 6.8 years ago by Damian Kao15k
1

Here's an old but nice one on time-course analysis using PCA.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Arun2.3k
1

So given 2 datasets, A and B. They perform PCA on data set A, project B on to A and use the newly projected coordinates to get differential expression. I am not sure what test they are using for the differential expression though. Some kind of ANOVA? I guess the advantage of this is: 1) It is taking the time-course relationship into account. 2) using only the dominant components is kind of a smoothing function as it de-noises the dataset.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Damian Kao15k

You can have a look at this paper for other PCA based applications, I found sparse PCA and supervised PCA to be quite interesting

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Sudeep1.6k
2

Thanks for the papers. I am actually working with time-course RNA-seq data right now, so this is of interest to me. BTW, I posted some brief code on how to do PCA and visualize it with python in matlibplot couple days ago: http://blog.nextgenetics.net/?e=42

ADD REPLYlink written 6.8 years ago by Damian Kao15k

Dk, nice post. However, I find that it would be nice to explain the actual concept behind (PCA) and purpose (why in time-series?) in addition to just the code. I love theory! :)

ADD REPLYlink written 6.8 years ago by Arun2.3k

I've actually been working on a post to explain PCA, just haven't gotten around to finishing it. It's a surprisingly simple concept if you ignore all the crazy maths which I suck at anyways. :) It is essentially just changing the coordinate system's axis (x,y,z..) into a series of orthogonal (perpendicular) best fit lines.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Damian Kao15k

Sudeep, I get "content not found"

ADD REPLYlink written 6.8 years ago by Arun2.3k
1

Sorry for that, I was logged in from my institute account with direct access to the manuscript, now edited that, please try again.

ADD REPLYlink written 6.8 years ago by Sudeep1.6k

Unfortunately I couldn't find any interesting papers for read count data. As I said in the post all the papers I saw used PCA just to cluster samples but for microarray I found a couple of papers like the one posted by Arun in reply but I am not sure how the statistics works out for read count data

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Sudeep1.6k
1

It makes sense because PCA is a tool for either clustering or dimensionality reduction, as far as I've understood. So, it doesn't make much sense to me in comparing replicates of a gene over two conditions using PCA.

ADD REPLYlink written 6.8 years ago by Arun2.3k
4
gravatar for Michael Dondrup
6.8 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

To resolve the unanswered state for this question. In agreement with most of the comments already given, the answer is, that at least in this case it doesn't make much sense to use PCA for gene ranking. This is because you have a Case vs. Control setting, which means you have a "2-dimensional" problem, the applications of PCA described in the paper are directed towards time series or other higher dimensional measurements. Therefore you will get max. 2 principal components, and if you wanted to remove one, for eg. dimension reduction or noise reduction, you have one left. That is not good for doing a statistical test where you wish to compare two conditions.

Of course, one could rank the genes by their factor loadings (projection of the data on the first principle axis), but that doesn't seem to have any advantage in a case-control setting. A statistical test has the advantage of providing estimate of significance (aka. p-values), and allows to estimate power, etc. A PCA is a totally different technique, and doesn't provide these estimates. Unless you can better define the use-case and answer the question why a non-standard method should be applied I would stick with an established method.

You didn't tell if you have replication, but I guess so; therefore if you wanted to use PCA you need to decide at which point in your analysis you wish to summarize the replicates. At that point however, you are going to loose information about within group variance. In a statistical test, for example ANOVA, within group variance would be needed and compared to between group variance. Therefore, it is important to keep within group variance until the statistical test.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Michael Dondrup46k

Thank you for this long explanation. I am following the traditional methods for analysis, but as I said in the one of the comments, I posted this question just out of curiosity to see if anywork has been done on PCA based methods.

ADD REPLYlink written 6.8 years ago by Sudeep1.6k
1

Generally speaking I'd say that application of PCA will be the same for gene expression data, whether they come from microarrays or RNA-seq. For RNA-seq, PCA should be applicable for various gene level normalized read counts, eg. RPKM, FPKM, etc.

ADD REPLYlink written 6.8 years ago by Michael Dondrup46k

what if you have biological replicates of different parts of the same tissue and would like to use PCA to exclude the biological replicate in which there is a contamination of one part of tissue by cells from the other part due to improper handeling/microdissection/surgery. is it not easier to detect such a contaminated sample by using PCA?

ADD REPLYlink written 6.6 years ago by psola0

easier than what? I have the impression this is a possible application, another method is to cluster samples, still no big difference between rna-seq and microarray data.

ADD REPLYlink written 6.6 years ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 813 users visited in the last hour