Question: Finding the best model for expression data (RNA-Seq counts)
3
User 7754250 wrote:

Hi,

I am analyzing RNA seq data (counts) and trying to study the impact of conservation on gene expression. I have a data frame with one row per gene, and one column for gene expression called "data\$counts" (a vector of length 10,000, with a value for each gene) and another column for gene conservation called "data\$var1" (another vector of the same length, 10,000). I would like to identify the variance explained in gene expression by var1, using regression models, but I can't tell which model fits the data better.

I found many links on here suggesting to use un-transformed count data (not FPKM) and to use a negative binomial distribution (model1).

I also saw suggestions to transform the data to make it normal, and run linear regression. So I tried that too (model2 and model3). However when I plot the histogram none of the distributions look normal:

``````hist(data\$counts) # very skewed

hist(log(data\$counts)) # very skewed

hist(log(data\$counts+1)) # creates two curves/distributions
``````

Since I can't tell with visual inspection (nothing seems to fit a normal distribution), is there a way to tell which model fits the data better using e.g. AIC or likelihoods from the regressions fitted below, or in any other ways?

Model 1:

``````summary(glm.nb(data\$counts ~ data\$var1))
Call:
glm.nb(formula = data\$counts ~ data\$var1, init.theta = 0.6950337825,

Deviance Residuals:
Min       1Q   Median       3Q      Max
-3.1858  -1.0122  -0.5362  -0.0004  13.6390

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)        6.90055    0.01359 507.839   <2e-16 ***
data\$var1  0.01474    0.01359   1.085    0.278
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.695) family taken to be 1)

Null deviance: 9431.6  on 7797  degrees of freedom
Residual deviance: 9430.4  on 7796  degrees of freedom
AIC: 122463

Number of Fisher Scoring iterations: 1
Theta:  0.69503
Std. Err.:  0.00959

2 x log-likelihood:  -122457.34300
``````

Model 2:

``````Call:
glm(formula = log(data\$counts + 1) ~ data\$var1, family = gaussian)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-6.3919  -0.6436   0.1400   0.8445   5.6291

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        6.04901    0.01604  377.12   <2e-16 ***
data\$var1 -0.21405    0.01604  -13.34   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 2.006264)

Null deviance: 15998  on 7797  degrees of freedom
Residual deviance: 15641  on 7796  degrees of freedom
AIC: 27563

Number of Fisher Scoring iterations: 2
``````

Model 3:

``````Call:
glm.nb(formula = data\$counts ~ data\$var1, init.theta = 0.7290633088,

Deviance Residuals:
Min       1Q   Median       3Q      Max
-3.3386  -1.0042  -0.5105   0.0336  12.7096

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)        6.85552    0.01327  516.70   <2e-16 ***
data\$var1 -0.29469    0.01327  -22.21   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.7291) family taken to be 1)

Null deviance: 9886.8  on 7797  degrees of freedom
Residual deviance: 9374.1  on 7796  degrees of freedom
AIC: 121964

Number of Fisher Scoring iterations: 1

Theta:  0.7291
Std. Err.:  0.0101
2 x log-likelihood:  -121957.8830
``````
regression rna-seq glm model fit • 1.8k views
modified 22 months ago by i.sudbery10k • written 22 months ago by User 7754250
3
i.sudbery10k wrote:

If I understand your question correctly you are trying to test whether genes that are more highly conserved tend to be more highly (or lowly) expressed. Thus would want to do what is conceptually similar to a linear correlation with conservation on one access and expression in another?

Comparing the expression of two different genes within the same sample is difficult because all sorts of things affect the number of counts a gene receives other than its expression level. This is generally not a problem in differential analysis because we assume these factors are similar in the two conditions (of course this might not be the case, but its seems okay in most cases).

The biggest such factor, is the length of the gene - longer genes gather more counts! It is for this reason that people use length normalised, transformed values, such as TPM or FPKM when comparing genes within samples.

It is generally believed that the distribution of gene expression in a cell is a mixture of two log normals, see Hebenstreit et al 2011. This is of course confused by the zeros. One solution to this would be to use some sort of variance stabilising transform such as VST or rLog (both form the DEseq2 pacakge) and then to calculate TPMs, and take the log. Generally log(x+1) is not a good transform.

Another alternative would be to use negative binomial regression, but including the gene length as an "offset". An "offset" is an explanatory variable whose coefficient is forced to be 1, and is generally used as a normaliser. In poisson-like models (including nb), it acts like a divisor to normalise for gene length (this is how both DESeq2 and edgeR deal with library depth in their statistical models). I believe this is what is happening in Hafemeister et al.

As for judging the models, you might want to look at somewhere with a more statistical focus, like Cross Validated, but I would check the AICs and check weather the residuals are of roughly constant size across the range of the independent variable. You could plot observed against predicited counts/TPM and look how well the model looks like its fitting by eye.

Thank you i.sudbery! That is very clear. Yes, I was actually including gene length as covariate to control for that, so doing: expression ~ conservation + gene_length. I don't know how that would be different but will read the paper you linked. I thought I needed samples to run DESeq2 or edgeR (while I don't have any samples data, but just gene-level variables, so I would not have any "coldata: a table with information about the samples"). Can I use DESeq2 without a case/control set up?Maybe you are suggesting to just use the normalization step in this package. From your answer, it sounds like I should first transform count data using DESeq2 (the histogram of vsd still looks bimodal), then get a TPM from this (how? just dividing by gene length? Would it be something like: TPM = 10^6 * (data\$counts / data\$gene_length) * sum(data\$counts / data\$gene_length) ?), then log transform. I am reading the manual for DESeq2 now in case I am misunderstanding this completely.

I would either fit a negative binomial, using gene length as an offset (not a co-variate, you need to force the coefficient to 1). Unfortunately getting the gene length is not entirely trival as you need to know which transcripts are expressed. If you can run the original data through salmon, that would give you an effective length.

Or use the vst/rlog transform from DESeq2. You need samples to do a differential expression analysis, but I don't think you need samples to apply VST/rlog, which can be applied to a raw matrix without creating a DESeqDataSet. Once you've done the transformation, then you can convert to TPM.

To convert to TPM, first divide by gene length to get an FPK. Now sum all the FPKs and divide each FPK by the sum of FPKs and multiply by 1,000,000.

Ok that makes sense! If I plot the log (FPKM), it finally looks normal! So the final answer is to use this in a linear regression model. I do think I need to do log(FPKM +1), for the log to work. I just have one more question. I imagine the extra step of "dividing each FPK by the sum of FPKs and multiply by 1,000,000" is for normalizing for sequencing depth. What is the logic of using 1,000,000? I am asking because I have also seen other values used instead of the 1million (granted this was on ATAC-Seq data and not on expression), for example: normalization = log2( counts/sum(counts) x 100,000,000 +1). Just wondering what that multiplication is there for, and whether it is really necessary in this scenario. Thank you!

1

Depending on which normalisation you've used (VST or rlog), I was under the impression that there should be no zeros, and thus no need for +1. In fact rlog should is already adding be a pseudocount and taking the log.

1,000,000 is chosen because it it the value that puts the numbers and a "human scale". You could choose another scale, but TPM is what people understand.

1

Yes - I believe the rlog is mostly similar to log2(FPKM+0.1) values.

So, it is more like the log2-transformed values (rather than the linear values that need a rounding factor prior to a log-transformation). There are some examples of the comparisons between rlog and log2(counts/CPM + X) values here: http://cdwscience.blogspot.com/2019/02/variance-stabilization-and-pseudocounts.html

2
Charles Warden8.0k wrote:

Since it seems like there is some confusion about the question. So, I don't know if this will actually help, but I thought I should also bring up another point where I was confused:

I would typically expect a data frame or matrix of counts and a vector as a dependent variable.

In the following example, gene1 would not be differentially expressed and gene2 would be differentially expressed:

``````cat = c("Trt","Trt","Trt","Ctrl","Ctrl","Ctrl")
var.mat = data.frame(gene1=c(1,2,3,1,2,3),
gene2=c(1,1,1,3,3,3))
``````

My original interpretation was that you were trying to provide a representative example, but you wanted to ask about selecting a model for all ~20,000 genes (and I was saying the "best" model may not be the same for all genes). While I don't want to emphasize that too much, I also want to see if maybe I can bring up another discussion point that may help make things more clear between you and Kevin.

Namely, if I were to try and apply the function for Model #2, I would get an error with the example above:

``````> glm(formula = log(var.mat + 1) ~ cat, family = gaussian)
Error in model.frame.default(formula = log(var.mat + 1) ~ cat, drop.unused.levels = TRUE) :
invalid type (list) for variable 'log(var.mat + 1)'

> glm(formula = log(as.matrix(var.mat) + 1) ~ cat, family = gaussian)
Error in x[good, , drop = FALSE] : (subscript) logical subscript too long
``````

If `data\$counts` was a data frame or matrix within a list, I think you would need to use some sort of `apply` function for the variable that you are trying to compare (unless I am missing something obvious).

So, here is the question that I hope can clear up some confusion: What are the dimensions of the variables that you are working with? Is `data\$counts` one gene, or a table with multiple genes (where you may be interested in the distribution of genes for one sample versus another sample)? I thought your reply indicated that you were interested in the later (distribution of all genes in different samples), but I think that makes the code that you provided a little confusing.

Thanks Charles. data\$counts is a vector of 20,000 genes. I think the replies are completely misunderstanding my question. I should have clarified this from the beginning. Should I start a new question?

1

You already have two likes for your question, which is more than the last topic that I created on Biostars. You can also be penalized for cross-posting. So, I think we should focus on editing your question to make it as clear as possible - the most relevant answers should have the most votes, causing you to have to scroll down to see the less relevant answers (ideally).

Also, I think this is progress. Now that I know that this is a vector of counts for a sample (one entry per gene or transcript), what is is `data\$var1`?

I thought it was the categorical variable for differential expression, but it must be something that relates to genes. There was somebody in the Biostars moderator discussion group that thought you might have been trying to predict low/medium/high gene expression.

Is this true? If so, I think there is at least one person who understood your question.

1

Ok... so all the variables are per gene, including data\$var1, which is a vector of size 20,000. I have many of these variables (vectors of 20,000 size), and my biological question is which of these have an influence on counts RNA-Seq data, and if they do (p-value is significant), what is the % variance explained of count in RNA-Seq data from each of these variables. My statistical question is how to best choose which model the RNA-Seq are following. Should counts directly be used, or a transformation, or FPKM, and which model should be used, linear regression or negative binomial or Poisson or others. Nothing seems to fit a normal distribution.

1

Can you please provide some examples of the variable `data\$var1` you are using for comparison? I am assuming there are actually other variables named var2, var3, etc., but you were using var1 as an example. Otherwise, can you describe what var1 represents?

In addition to the other person's guess of expression status (low/medium/high, or perhaps expressed / not-expressed, or expressed / background), you could have variables like GO category, conservation, etc. However, I apologize that I am still not completely clear about your goals.

I think it would be a good idea to have some sense of what you could do biologically to assess the statistical result. So, even if you have a way to define a good fit (in the statistical sense) and you get a result with a low p-value, there is still the question of how you would apply that information. For example, if you were talking about gene tests, I'd say identifying the candidate (s) for characterization / validation is what is most important (even if there isn't actually a perfect statistical test - just possible strategies may or may not work for your particular project).

1

Yes exactly, var1 is conservation score for each gene. I have other variables like var2 that is for example coverage. Again, I do not want to find particular genes, rather I just want to know whether the var1, var2, etc. features are important or not.

2
Charles Warden8.0k wrote:

I have a couple thoughts:

1) Make sure you visualize the association between gene expression and your variable.

For example, I had a paper where I did something similar, but I provided some re-analysis ~10 years later:

https://github.com/cwarden45/Coding-fRNA-Comment/blob/master/comment.pdf

https://github.com/cwarden45/Coding-fRNA-Comment/blob/master/Figures.pdf

There are two things that I think could have improved the original paper that may be relevant to your question:

a) Try to think of creative questions that do not fit exactly what was done in previous papers (which I would argue is what I did with the re-analysis, such as comment Figure C1).

b) Try to identify some functional candidates that you can characterize experimentally (which I didn't do, and can't really currently do, but that should have improved the impact and confidence of the paper).

2) Perhaps you should mention you are trying to study the impact of conservation on gene expression in the question?

I'll respond to the Biostars moderator discussion group, and maybe this will help get other responses.

NOTE: Also, I apologize for the confusion, but this was a comment that I moved to an answer (but the person asking the question thought this part was most helpful).

1

Charles, thank you, the paper and links helped so much. Please modify (or let me know how best to modify) the question, so that it is clearer and can help others (can also remove this and put it as an answer?). Thank you for all the help!

You are welcome.

While I technically have the ability to edit your post, you really should be doing that instead of me.

At the bottom of the grey box (near "add reply"), there is an `edit` button. You can also add a note at the end if you wanted to make clear you changed it due to a response from me (so, people don't get confused when I start out not knowing what the var1 represents at the beginning of my answer).

Also, as requested, I will move this to an answer (and edit the beginning to be a little less confusing).

0
Charles Warden8.0k wrote:

I expect the best modeling strategy probably varies by gene, but you can't really visually inspect all genes (and, if you most frequently have duplicates or triplicates, I expect there are limits to the parameter estimation). I would also guess that if you picked a strategy and a measure of fitness for the model, the "optimal" model would probably vary between genes (and the appropriateness of the overall model selection would probably depend upon which genes are most biologically relevant to a given experiment).

I think you are trying to bring up a different point, so I won't emphasize this too much: however, I typically do some testing with different methods for each project. While hard to define precisely, I would do things like i) compare the size of gene lists, ii) visually inspect heatmap with differentially expressed genes (to check things like clustering of replicates), iii) see if functional enrichment can inform the use of upstream strategies (such as the p-value method, differential comparison strategy or strategies), and iv) (if available) check the status of genes that known to change between the conditions that you are comparing.

Hi Charles, thanks but I think there is a misunderstanding: I am not trying to find a model per gene, or particular genes, I am looking at the gene expression across genes to find what influences gene expression overall....

That is different than what I was talking about. I apologize for not being able to provide better feedback for your specific question.

to find what influences gene expression overall.

...can you elaborate? If you want to find out what is driving the expression, you can check how certain genes, e.g., differentially expressed ones, cluster your samples and how this aligns to metadata, like clinical data, for example. These can be easily done in heatmaps with `row` or `col` colour bars. You can also run regression models against the metadata that you have.

PCA is another method that you could use to determine what drives variation in the expression profiles of your samples.

In your examples, you are using a negative binomial regression and then a 'standard' regression function. Why not just run your data through DESeq2, which will properly normalise your counts and fit its own negative binomial model while adjusting for dispersion. You do not do this in your models, above.

Hi Kevin,

Thanks for your reply, but I am not looking to do a differential analysis like in DESeq2 to find differentially expressed genes, neither I am looking to find particular genes or clusters, but I would like to know the determinants of gene expression overall, fitting a model where the outcome is gene expression of all genes with counts in RNA-Seq data.

What is biological question that you are asking? Maybe you are referring to a type of network analysis (?)

Hi... my biological question is "what is the variance explained in gene expression by other variables" (see original posting...). My statistical question is "how to choose between models/how to tell what is the best fitting model for an outcome?". Thank you.

..but are you hoping to fit a single model that includes all genes in the same model? Most (or all?) analysis tools test the genes independently. If you want to look at the effect of each gene against all other genes, then you may want to do a lasso penalised regression, or Bayesian regression with input priors.

There are various ways to test models, but they depend on the type of model. For example, bootstrapping and cross-validation are routinely employed.

Are you hoping to find variance explained or genes that are highly, moderately, or lowly expressed?

If it is variance and variation, then PCA is the technique.

Thank you Kevin, I am hoping to look at all genes in the same model, not testing each gene independently (so not one gene against all others, and I am not looking to contrast any one gene's expression, but all genes).

That is a big model... The Bayesian regression approach may work, but I am not sure. I have done it before with a few hundred variables. If not, lasso penalised regression already does it.

There are likely some other 'machine learning' approaches, but I have no fully explored those yet.

Why are you suggesting lasso penalised regression? How would I test this model in comparison to the other three I have in the question? I still don't understand how I would choose between those models.

In your models, you are just regressing the variance to the counts. This has not already been done by Michael Love and various others?

Can you please provide a reference so I can follow what they did and their model maybe?

Are you a PhD student? - take a look here: https://www.ncbi.nlm.nih.gov/pubmed/25516281