Question: Question about generalized linear model fitting
0
12 months ago by
kanwarjag910
United States
kanwarjag910 wrote:

I am analyzing a scRNAseq data trying to follow a protocol suggested in this paper- https://www.nature.com/articles/nature24489 mentioned under Computational analysis The statement reads- " Selection of variable genes was performed by fitting a generalized linear model to the relationship between the squared coefficient of variation and the mean expression level in logarithmic space, and selecting genes that deviated significantly (P < 0.05) from the fitted curve"

I have two naive questions: 1. In general GLM can be done in two data columns. However in my case there are several columns of data corresponding to single cells. How can I use the matrix for GLM. I looked into JMP and past3 documentation.

1. What is the best GUI based software or an R package that can fit the model for multiple columns data and out put p value for each row (gene).

Apologies, if it looks like a very naive question. However, I tried but could not overcome this bump.

Thanks

rna-seq • 625 views
modified 12 months ago by Kevin Blighe33k • written 12 months ago by kanwarjag910
2
12 months ago by
Kevin Blighe33k
Republic of Ireland
Kevin Blighe33k wrote:

The data that you have to supply to the GLM function in R will have genes as he columns and sample values as rows, with the first column as the factor / categorical variable of interest, i.e., the one that you are modelling your expression to, like this:

``````modelling

CaseControlStatus   MYC    EGFR   KIF2C   CDC20
Sample 1   Case               11.39  10.62   9.75   10.34
Sample 2   Case               10.16   8.63   8.68    9.08
Sample 3   Case                9.29  10.24   9.89   10.11
Sample 4   Control            11.53   9.22   9.35    9.13
Sample 5   Control             8.35  10.62  10.25   10.01
Sample 6   Control            11.71  10.43   8.87    9.44
...
``````

You should conduct the modelling independently over each gene and note the model coefficient and P value. Your final list of genes will then be those that pass significance at α=5%. Here is a simple loop to do this:

``````formula.start <- "CaseControlStatus ~ "
listModels <- list()

for (i in 2:ncol(modeling))
{
formula <- paste(formula.start, colnames(modeling)[i], sep="")
listModels[[i]] <- model
names(listModels)[i]) <- colnames(modeling)[i]
print(summary(model))
}
``````

This will loop over each gene (column) in your data-set and perform the regression independently. In this simple example, each model summary will be printed to screen and each model will be stored in list object.

For further information, take a look at my other threads on creating gene signatures and generally performing regression analyses:

i followed it from the other thread so its irrepective of the data like it can work for both microarray and rna-seq data ? The GLM which you have used to here ..

1

Yes, it can be used for both microarray and RNA-seq data. In both cases, the data should preferably be normally-distributed. Otherwise, you will have to modify the `family` parameter to `glm()`

okay so another question so if i use rna seq data can i used the count normalised data set which the rlog normalised or i will have to plot the data and see if its normally distributed first then proceed?

1

The normalised count data from RNA-seq will not follow a normal distribution - it follows a negative binomial distribution, which resembles a type of Poisson. Technically you can use these and specify a different family (DESeq2 itself just uses a generalised linear model with a different family for the purposes of deriving test statistics via the Wald test).

In your case, I think that it may be better to use the rlog counts from RNA-seq (which should follow a normal distribution but may have 'humps' here and there), and the log2 values from microarray obviously.

``````Error in eval(predvars, data, env) : object 'ARID1ATAF12' not found
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
``````

Im running into this error "ARID1ATAF12" this thing corresponds to my first two column containing genes "ARID1A " and " TAF12 " im not sure what is going wrong i have made my data frame as you suggested

Must be some formatting error... check the contents of your model formula for the `for` loop indices 2 and 3.

On that note, note that my `for` loop goes from 2 to `ncol(modelling)`. This is because my first column is the categorical variable that I am predicting (e.g. case/control). This may not be the case for your data.

``````head(MyData)
CaseControlStatus   ARID1A     TAF12      PHC2     SCMH1     DMAP1   RAD54L   PRKAA2     PRMT6     VPS72
Sample1                WT 2367.372 150.69990 495.92550 841.92010 230.66300 34.59946 33.83058 180.68610 264.49360
Sample2                WT 2522.512  82.90774 199.33136 635.03797 137.59156 35.27989 58.21181 132.29958 170.22546
Sample3                WT 1507.287 114.27604 134.52749 520.75159 112.82951 67.98701 18.80492  82.45234  95.47112
Sample4                WT 2888.628 116.45255 229.88036 609.48542 182.99686 62.00720 18.14845 238.95458 149.72471
Sample5              TEST 1127.155 222.20356  26.06857

formula <- "CaseControlStatus ~ "
listModels <- list()

for (i in 2:ncol(MyData))
{
formula <- paste(formula, colnames(MyData)[i], sep="")
listModels[[i]] <- model
names(listModels)[i] <-colnames(MyData)[i]
print(summary(model))
}
``````

Error in eval(predvars, data, env) : object 'ARID1ATAF12' not found In addition: Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred

have a look

1

Oh, I now see what is happening. You are concatenating more and more text to the formula variable with each loop (most likely my fault as it wasn't specified in the previous code that I pasted).

You could try this:

``````formula.start <- "CaseControlStatus ~ "
listModels <- list()

for (i in 2:ncol(MyData))
{
formula <- paste(formula.start, colnames(MyData)[i], sep="")
listModels[[i]] <- model
names(listModels)[i] <-colnames(MyData)[i]
print(summary(model))
}
``````

@Kevin thanks a lot these are pretty small things but since I don't have that proper grasp of R so i struggle for all these, now its looping over the whole data set ,how do i filter now based on those observed output lets say i get something even if doesn't make any real biological sense which im not sure would i still consider it or any proper way of filtering to go to the next step of analysis

1

Hi krushnach, yes, this simple code will just output the model. You will have to decide which values from the model that you want to retain. You should look at the information under the 'Value' header here: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/glm.html

You can also check what an R object contains using the `str()` function applied to the object. These are then usually access via the use of the dollar (\$), e.g., `model\$coefficients`. In this situation, the summary() function already provides useful information that can then be extracted via array subsetting ( `[ ... ]` ).