Selecting genes that best separate between two groups
2
0
Entering edit mode
3.6 years ago

Hi everyone! I'm working on a gene expression file from an RNA seq experiment and try to compare between two groups of samples. I'm pretty new to this world of gene expressions and need some help. I have a strong reason to believe that there should be a biological difference between the two groups (from the biological aspect). When I do different gene expression I get many differentially expressed genes and when I cluster the samples according to the genses, I don't see two clear separation between the groups. Is there a way to filter the genes I get from my differential expression algorithm so I can see a better clustering effect? I am interested in selecting genes from the list of differentially expressed genes that will separate the two groups in the best way, however, there are so many genes that are differentially expressed that I don't know how to effectively do it. Also, I have been trying two different differential gene expression methods (ttest and limma) and the genes I get from each limma don't appear in the gene list I get from the ttest. Which one should I use? Thanks a lot!

RNA-Seq filtering feature selection ttest limma • 4.1k views
2
Entering edit mode

How did you normalize the samples in the ttest method? I usually use DESeq2 with great results. You can use PCA to see if the samples can be clustered.

0
Entering edit mode

Thanks! I work with DESeq2 too. Do you run the pca on the rlog output of DESeq2?

1
Entering edit mode

How are you clustering the genes? Have you tried any sort of pathway or functional enrichment analyses (GSEA, DAVID, etc) of your differentially expressed genes? What is the ttest method, literally just doing t-tests between your two groups?

Personally, I've had good success with limma and was able to make biologically interesting (and reasonable) conclusions from its output.

0
Entering edit mode

Thank you all very much! Interestingly, when I do pca with the DE genes I can see a pretty nice separation between my two groups. For some reason though, in the heatmap the samples don't cluster as nicely. I used Euclidean and weighted for building the heatmap. My data is actually genetically heterogeneous. I have few driving translocation that each have a different gene expression signature and I am trying to find a difference between two groups that each group contains samples with different types of translocations. That is, I am trying to find the difference between groups where there is a lot of genetic variability within each group to begin with.

0
Entering edit mode

Thank you all very much! Interestingly, when I do pca with the DE genes I can see a pretty nice separation between my two groups. For some reason though, in the heatmap the samples don't cluster as nicely. I used Euclidean and weighted for building the heatmap. My data is actually genetically heterogeneous. I have few driving translocation that each have a different gene expression signature and I am trying to find a difference between two groups that each group contains samples with different types of translocations. That is, I am trying to find the difference between groups where there is a lot of genetic variability within each group to begin with.

3
Entering edit mode
3.6 years ago

If you already know the two groups and want to find the genes that discriminate between them, try linear discriminant analysis or any other suitable supervised method.

0
Entering edit mode

I gave an upvote for this as it is quite a powerful method, i.e., linear discriminant analysis (LDA).

0
Entering edit mode

Thanks! have you used LDA on a high throughput data? I was looking for an R code to run it, but could not find one suitable for large scale data.

0
Entering edit mode

It depends on what you call large scale data. In R, one typically uses the lda() function in the MASS package.
Consider that you probably don't need to use all your data to build the model. Presumably a subset is enough to use as training set. If for some reason the lda() function can't handle your data, you could try to do the computation yourself. The most intensive operation is the eigenvector-eigenvalue computation and you could try using solvers for large eigenvalue problems (e.g. the RSpectra package)

0
Entering edit mode

Sorry about the confusion. By large scale data I mean the gene expression matrix. It has more than 30,000 genes. Will lda() handle that large amount of features? Also, should I give lda() the raw count as an input? I guess not, because of the difference in library depth and variance between the samples, so should I give it the normalized expression count generated by Deseq2? Thanks

0
Entering edit mode

Did you use the LDA on the log transformed expression matrix?

Traffic: 2067 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.