Post Deseq2 Analysis
1
0
Entering edit mode
5.8 years ago
David_emir ▴ 490

Hi All,

I have Differential gene expression (DGE) results between TCGA-LUSC & TCGA-LUAD samples (LUAD vs. LUSC). I wanted to know what are the ways to take this analysis beyond differential gene expression. How can I use Regression analysis/ Random forest classifications on genes or SVM? Should I use raw counts, Normalised counts or Deseq2 results for post DGE analysis? I am interested in looking how genes behave between two group in the same cancer cells. Your suggestions will help me a lot. Thanks for Help,

Sincerely,

Dav.

deseq2 machine learning • 1.8k views
ADD COMMENT
2
Entering edit mode
5.8 years ago

Dear David,

Firstly, for RNA-seq, I'll say that you should not use raw counts; also, the normalised counts are neither recommended unless your downstream functions can handle negative binomially distributed data. You should obtain logged counts, be it log CPM, regularised log, or something else, i.e., data that follows a binomial distribution. Log FPKM or RPKM is not appropriate.

With regard to ideas, please see previous answers that I have given:

A choice you should make is whether to use all genes or else just differentially expressed genes, and also differentially expressed between which groups / conditions?). From my own perspective and experience, things like 'Random Forest' and other 'machine learning' strategies don't create improved models when compared to well-curated regression analysis. I'm referring to model AUCs through ROC analysis here.

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin, Thanks a lot for your help, I am a biologist and I have a limited experience in using coding. I have files on my desk and willing to analyze myself (Cost cutting - funding issue :)). After DGE I have a set of genes which are expressed (+/- log 2 & <=0.05 cut off). I am planning to qualify these set of genes as BIOMARKERS. to do so I need a solid groundwork, I hope ML will help me in this case. I am going through all your posts, and it is very informative. Thanks for handholding, it's indeed a great help. It would be great if you can share any advice with me on this. Have a great weekend, Dave P.S: I am a non-native English speaker. Pardon me for bad English!

ADD REPLY
0
Entering edit mode

Hey Dave, your English seems very good - do not worry.

  • How many genes are statistically significant?
  • Do you know how to implement the Random Forest algorithm on your data? - was the use of Random Forest just a suggestion from somebody else?
  • What have you compared so far: Tumour versus Normal or within Tumour comparisons (like, EGFR-positive versus EGFR-negative)?
ADD REPLY
0
Entering edit mode

Hi Kevin, Thanks a lot for your prompt reply,

  1. How many genes are statistically significant? I have used the standard filter with padj value of <= 0.05, for genes which show increased expression are (log2 >/= +2) 1438 and the genes which shows decreased expressions are (log2 <= -2) are 2304.

  2. Do you know how to implement the Random Forest algorithm on your data? - was the use of Random Forest just a suggestion from somebody else? I have never used any ML techniques before, at most I have done Deseq2, One of my colleague suggested to trim dowcolleaguesa using Random forest or any other ML techniques.

  3. What have you compared so far: Tumour versus Normal or within Tumour comparisons (like, EGFR-positive versus EGFR-negative)? I have compared between two Lung cancer subtypes - LUSC Vs LUAD.

.

resultsNames(DESeq.ds_lusc_luad) [1] "Intercept" "status_luad_vs_lusc"`
DESeq.ds_lusc_luad class: DESeqDataSet dim: 55237 518 metadata(1): version assays(1): counts rownames(55237): TSPAN6 TNMD ... LINC01144 ENSG00000281920 rowData names(0): colnames(518): TCGA-50-5946-01A TCGA-55-8089-01A ... TCGA-39-5019-01A TCGA-39-5030-01A colData names(3): sample status sizeFactor

Thanks a lot, Kevin, it means a lot to me. Sincerely, Dave.

ADD REPLY
1
Entering edit mode

Hey, no problem. Sounds interesting. My recent work tells me that adenocarcinoma and squamous cell are quite different, in terms of how they fair with immunotherapy. Regarding the Random Forest, do you actually know where to start with it?

Your list of statistically significant genes is actually large. So, you do need a way to reduce its size.

What I would try are the following:

  • Lasso-penalized regression, with LUAD/LUSC as y variable (end-point)
  • logistic regression independently for each gene, with LUAD/LUSC as y variable (end-point)
  • Random Forest

Regarding lasso-penalised regression, I have put some code here, which you could probably replicate (even includes ROC analyss): A: How to exclude some of breast cancer subtypes just by looking at gene expressio

For the logistic regression independently for each gene, you'd have to set it up as a loop that tests each gene. I have code for this, too: Question about generalized linear model fitting

Obviously there's a lot more than just running these scripts, but it's good that you are least wondering what to do after a simple differential expression analysis. Most researchers don't know what to do, and a study typically stops with the fold-changes and p-values. However, it is what happens after these which is where the real 'translational' science comes into play.

ADD REPLY
0
Entering edit mode

HI kevin, Thanks a lot for your kind help. It's of great help to me. I was looking at your code. The input file format is DESeq2 output file or its a count data(I may be stupid in asking this but, sorry for this). I have never used Random forest before, this will be the first time i am gonna use this and i am totally excited to see the results. Thanks a lot, Kevin, Sincerely, Dave

ADD REPLY

Login before adding your answer.

Traffic: 1518 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6