Question: Selecting genes that best separate between two groups
0
gravatar for lirongrossmann
5 weeks ago by
lirongrossmann20 wrote:

Hi everyone! I'm working on a gene expression file from an RNA seq experiment and try to compare between two groups of samples. I'm pretty new to this world of gene expressions and need some help. I have a strong reason to believe that there should be a biological difference between the two groups (from the biological aspect). When I do different gene expression I get many differentially expressed genes and when I cluster the samples according to the genses, I don't see two clear separation between the groups. Is there a way to filter the genes I get from my differential expression algorithm so I can see a better clustering effect? I am interested in selecting genes from the list of differentially expressed genes that will separate the two groups in the best way, however, there are so many genes that are differentially expressed that I don't know how to effectively do it. Also, I have been trying two different differential gene expression methods (ttest and limma) and the genes I get from each limma don't appear in the gene list I get from the ttest. Which one should I use? Thanks a lot!

ADD COMMENTlink modified 4 weeks ago • written 5 weeks ago by lirongrossmann20
2

How did you normalize the samples in the ttest method? I usually use DESeq2 with great results. You can use PCA to see if the samples can be clustered.

ADD REPLYlink written 4 weeks ago by Asaf4.5k

Yes, I have also used PCA in the past.

For the OP: What you do is first transform the data into eigenvectors (principal components) and eigenvalues, and then visually inspect the pairwise bi-plots. If you see a pair of principal components that segregate your groups of interest, you can take a look at the eigenvalues corresponding to these principal components. The genes most responsible for variation along the principal components will have the highest absolute eigenvalues. It is then possible to use these eigenvalues in later modeling - this has been done in the past.

In my figure here (fro this thread: https://www.biostars.org/p/271694/), the third component, in its extremities, segregates the South-Asian group (mostly Indians) from the other groups. So, the variables with high absolute eigenvalues on this principal component would be indicative of differences between these. Screen

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe3.3k

Thanks! I work with DESeq2 too. Do you run the pca on the rlog output of DESeq2?

ADD REPLYlink written 4 weeks ago by lirongrossmann20
1

How are you clustering the genes? Have you tried any sort of pathway or functional enrichment analyses (GSEA, DAVID, etc) of your differentially expressed genes? What is the ttest method, literally just doing t-tests between your two groups?

Personally, I've had good success with limma and was able to make biologically interesting (and reasonable) conclusions from its output.

ADD REPLYlink written 5 weeks ago by jared.andrews07140
3
gravatar for Jean-Karim Heriche
4 weeks ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche13k wrote:

If you already know the two groups and want to find the genes that discriminate between them, try linear discriminant analysis or any other suitable supervised method.

ADD COMMENTlink written 4 weeks ago by Jean-Karim Heriche13k

I gave an upvote for this as it is quite a powerful method, i.e., linear discriminant analysis (LDA).

ADD REPLYlink written 4 weeks ago by Kevin Blighe3.3k

Thanks! have you used LDA on a high throughput data? I was looking for an R code to run it, but could not find one suitable for large scale data.

ADD REPLYlink written 29 days ago by lirongrossmann20

It depends on what you call large scale data. In R, one typically uses the lda() function in the MASS package.
Consider that you probably don't need to use all your data to build the model. Presumably a subset is enough to use as training set. If for some reason the lda() function can't handle your data, you could try to do the computation yourself. The most intensive operation is the eigenvector-eigenvalue computation and you could try using solvers for large eigenvalue problems (e.g. the RSpectra package)

ADD REPLYlink written 29 days ago by Jean-Karim Heriche13k

Sorry about the confusion. By large scale data I mean the gene expression matrix. It has more than 30,000 genes. Will lda() handle that large amount of features? Also, should I give lda() the raw count as an input? I guess not, because of the difference in library depth and variance between the samples, so should I give it the normalized expression count generated by Deseq2? Thanks

ADD REPLYlink written 29 days ago by lirongrossmann20

Did you use the LDA on the log transformed expression matrix?

ADD REPLYlink written 28 days ago by lirongrossmann20
1
gravatar for Kevin Blighe
4 weeks ago by
Kevin Blighe3.3k
Republic of Ireland (Éire)
Kevin Blighe3.3k wrote:

Just some things to look out for from my own perspective:

  • is your data too flat? - low variance data will never be capable of being separated. How does a simple boxplot of your data look?
  • What P value cut-offs are you using? Always aim to use adjusted P of 0.05 (i.e. 5% FDR) and absolute log base 2 fold change >2
  • When you cluster, which distance metric and linkage function are you using? - you may see greatest separation using Euclidean distance and Ward's or Complete linkage. Mess around with different combinations of these. Take a look at the thread here for other options including Pearson correlation distance: A: How to plot a heatmap with two different distance matrices for X and Y LDA and PCA mentioned by the others are good suggestions that you can also try.

Finally, I'd just like to add that clustering and heatmaps should not be the deciding factors on how good your segregation has worked. I would ask: How well does your panel of genes predict your outcome of interest or group assignment through logistic regression modelling? Through that coupled with ROC analysis, you can derive AUC, sensitivity, specificity, precision, and accuracy.

Hope that this helps! Kevin

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe3.3k
0
gravatar for lirongrossmann
4 weeks ago by
lirongrossmann20 wrote:

Thank you all very much! Interestingly, when I do pca with the DE genes I can see a pretty nice separation between my two groups. For some reason though, in the heatmap the samples don't cluster as nicely. I used Euclidean and weighted for building the heatmap. My data is actually genetically heterogeneous. I have few driving translocation that each have a different gene expression signature and I am trying to find a difference between two groups that each group contains samples with different types of translocations. That is, I am trying to find the difference between groups where there is a lot of genetic variability within each group to begin with.

ADD COMMENTlink written 4 weeks ago by lirongrossmann20

You appear to have removed your previous comment and duplicated this answer?

PCA and heatmaps / hierarchical clustering are very different. PCA involves a single value decompositon transformation whereas a clustering algorithm will normally just be performed on the data as you provide it (it ma also scale to Z-scores for the actual heatmap colour shading).

What is the percent variance explained by your principal components 1 and 2 (PC1 and PC2)? Low variance on PC1 especially will infer that there is in fact not much difference between your groups, even if they are visually separated.

ADD REPLYlink written 29 days ago by Kevin Blighe3.3k

I initially wrote my response as a comment but was asked to write it as a reply to your comment. Sorry for the confusion. When I look at my data using PCA the variance explained by PC1 is 14% and 10% by PC2. I know the samples are biologically different regardless of my grouping as some of them have known mutations while other don't (it may not directly affect the gene expression, but previous studies have shown that within these samples are biologically different from each other. Also, each sample is from different patient, so there is another level of biological variability). Looking at these samples I divided them into two groups that I think should have distinct gene expression signature and colored the pca plot by the these two groups, where it showed that they do separate. The PCA plot was obtained using the expression matrix for all of the samples (i.e. the samples in the two groups) and more than 30,000 genes. I then did differential analysis between the two groups using Deseq2 and used the genes I got to draw a heatmap, but in the heatmap the separation was not that obvious.

ADD REPLYlink written 29 days ago by lirongrossmann20

Okay, and you did find many genes that were statistically differentially expressed? In the heatmap, do the groups separate by the dendrogram?; or are you saying that they don't separate based on the shading in the heatmap itself?

You should do the heatmap on logged counts, and you try the following tricks in order to make visualisation better:

#Set colour scheme and breaks
require("RColorBrewer")
myCol <- colorRampPalette(c("limegreen", "black", "purple"))(100)
myBreaks <- seq(-3, 3, length.out=101)

require("gplots")
heat <- t(scale(t(MyDataMatrix)))

heatmap.2(heat, col=myCol, breaks=myBreaks, main="MyHeatmap", key=T, keysize=1.0, scale="none", ColSideColors=MySampleCols, density.info="none", reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean), trace="none", cexRow=0.8, labCol=NA, distfun=function(x) dist(x, method="euclidean"), hclustfun=function(x) hclust(x, method="ward.D2"))
legend("topright", bty="n", cex=0.8, title="MySampleGroups", c("Group1", "Group2"), fill=c("red2", "royalblue"))

You can also experiment by changing the breaks to -2 +2, and the distance function to distfun=function(x) as.dist(1-cor(t(x)))

ADD REPLYlink written 28 days ago by Kevin Blighe3.3k

Thank you so much for your help! They don't cluster so nicely in the dendogram. I'll try to describe in more detail what I did: I created a Deseq2 object using my raw count matrix and metadata. I then applied variance stabilization transformation on the data and plot PCA. I see two distinct groups in my plot. The groups on the pca correlate with the groups in the design matrix. Here is the code I was using:

dds <-DESeqDataSetFromMatrix(countData = ep,colData = cp,design = ~Risk)

dds <- estimateSizeFactors(dds)

vsd <- varianceStabilizingTransformation(dds)

plotPCA(vsd, intgroup="Risk")

Now I ran Deseq on dds and got 150 genes which are differentially expressed between the two groups in my design matrix. I used those genes to create the heatmap and the clustering wasn't as clear.

Is there a way to know which genes separate the two groups I see in the pca plot? If so, I can try giving those genes to the clustering algorithm and see I get a better separation in the heatmap.

ADD REPLYlink written 28 days ago by lirongrossmann20
1

Hey, I see, that you're using variance-stabilised counts, which normally wouldn't be that much of a problem. Have you also tried the regularised log transformation (rlogTransformation())? It can be use in place of vst().

In any case, it's quite easy to see which genes are driving the separation of your samples via PCA but not many people know about (or even look) at this type of thing, so, I'm really glad that you asked. PCA is an underappreciated type of data transformation. I actually published on it years ago but I had no funding: https://benthamopen.com/contents/pdf/TOBIOIJ/TOBIOIJ-7-19.pdf

To do it, you can do PCA using the base R code as follows:

project.pca <- prcomp(t(vsd))
summary(project.pca)

#Determine the proportion of variance of each component
#Proportion of variance equals (PC stdev^2) / (sum all PCs stdev^2)
project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100

#Check how much variance is explained by each PC
barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal component (PC), 1-", length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot", ylim=c(0,100))

#Produce bi-plots
par(cex.axis=0.6, cex.main=0.6)
pairs(project.pca$x[,1:10], col=MySampleColorVector, cex=0.2, main="Principal components analysis bi-plot\nPCs 1-10", pch=16)

#Plot PC1 vs 2
par(mar=c(5,5,5,5), cex=0.8)
plot(project.pca$x, type="n", cex.main=0.8, cex.axis=0.8, main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"))
points(project.pca$x, col=MySampleColorVector, pch=16, cex=1)
legend("topleft", bty="n", cex=0.8, title="Group", c("Group1", "Group2"), fill=c("red2", "royalblue"))

Now the key part is to access the eigenvalues for each gene to each eigenvector (principal component). These are contained in the data-frame: project.pca$x

I'm also going to let you in on a little-known fact: when you do PCA using DESeq2's in-built function, it actually filters out a proportion of your genes based on low variance (I asked the developer of DESeq2 and he confirmed this operation). In this way, it magnifies the differences between your groups. The result that you get with my code above will, therefore, be slightly different than what you already got.

ADD REPLYlink modified 28 days ago • written 28 days ago by Kevin Blighe3.3k
1

Thank you for the code! Is it supposed to be project.pca <- prcomp(assay(vsd))? The column in my vsd are the samples and the row are the genes. If I go by t(assay(vsd)) I got PCA for the samples rather than the genes.

ADD REPLYlink written 28 days ago by lirongrossmann20

It depends on your data. In the end, what you should see in project.pca$x is an eigenvalue for each of your genes to each principal component. If instead you see your samples with eigenvalues, then you know that you'll have to use transpose (t())

It could be that my data was arranged differently from yours.

ADD REPLYlink written 28 days ago by Kevin Blighe3.3k

Thanks! I'm now stuck with another related problem on a different dataset. I run Deseq2 on the dataset, I get a handful of DE genes between two groups of interests in my dataset. I run pca on the expression matrix with only the DE genes selected, and I don't see the two groups separating. How can it be that when I run pca on the dataset using the DE genes (with adj p value less than 0.05 and log fold on 2) the two groups don't separate? Did you run into a similar issue in the past? Which visualization method is best to show two groups clearly separated by the DE genes? Thanks for your help again!

ADD REPLYlink modified 25 days ago • written 25 days ago by lirongrossmann20
0
gravatar for lirongrossmann
4 weeks ago by
lirongrossmann20 wrote:

Thank you all very much! Interestingly, when I do pca with the DE genes I can see a pretty nice separation between my two groups. For some reason though, in the heatmap the samples don't cluster as nicely. I used Euclidean and weighted for building the heatmap. My data is actually genetically heterogeneous. I have few driving translocation that each have a different gene expression signature and I am trying to find a difference between two groups that each group contains samples with different types of translocations. That is, I am trying to find the difference between groups where there is a lot of genetic variability within each group to begin with.

ADD COMMENTlink written 4 weeks ago by lirongrossmann20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 955 users visited in the last hour