randomForestSRC multivariate regression analysis for differential gene expression (RNAseq) and phys meas.
0
0
Entering edit mode
2.2 years ago
gmchaput ▴ 10

I have a data set of significant differentially expressed genes (1028) from my DESeq2 analysis. I also have 5 measurements of physiology for my organism of interest. I have a total of 35 samples.

I ran a random forest analysis using rfsrc() from package, randomForestSRC. My y/response variables are the phys measurements (3 numeric, 2 categorical) whereas my x-variables are the genes (1028 numeric). I have an output but I am struggling in how to interpret my train dataset output and my test dataset output as well as how to visualize a tree from the forest.

I tried ggRandomForest but it appears that this is not set up for the multivariate (regr+) of randomForestSRC.

Basically, I want to know:

1) How to know if my model is correct?

2) How to determine which genes were the best predictors for the x-variables.

3) How to visualize the decision tree of the forest in order to see how the terminal nodes were decided.

I've reviewed Udaya Kogalur & Hemant Ishwaran's webpage (https://kogalur.github.io/randomForestSRC/theory.html) as well as other websites/forums but am still having trouble understanding how to proceed.

My summaries for the training set (80% of dataset) and test set (20% of dataset) are below:

        > print(RFmodel)
Sample size: 28
Number of trees: 1000
Forest terminal node size: 3
Average no. of terminal nodes: 5.68
No. of variables tried at each split: 33
Total no. of variables: 1028
Total no. of responses: 5
User has requested response: Biomass.z
Resampling used to grow trees: swor
Resample size used to grow trees: 18
Analysis: mRF-RC
Family: mix+
Splitting rule: mv.mix *random*
Number of random split points: 10
% variance explained: -0.23
Error rate: 0.71

> print(RFpred)
Sample size of test (predict) data: 7
Number of grow trees: 1000
Average no. of grow terminal nodes: 5.68
Total no. of grow variables: 1028
Total no. of grow responses: 5
User has requested response: Biomass.z
Resampling used to grow trees: swor
Resample size used to grow trees: 4
Analysis: mRF-RC
Family: mix+
% variance explained: 15.77
Test set error rate: 2.84

random forest gene expression multivariate • 869 views