**230**wrote:

Hi,

I am analyzing RNA seq data (counts) and trying to study the impact of conservation on gene expression. I have a data frame with one row per gene, and one column for gene expression called "data$counts" (a vector of length 10,000, with a value for each gene) and another column for gene conservation called "data$var1" (another vector of the same length, 10,000). I would like to identify the variance explained in gene expression by var1, using regression models, but I can't tell which model fits the data better.

I found many links on here suggesting to use un-transformed count data (not FPKM) and to use a negative binomial distribution (model1).

I also saw suggestions to transform the data to make it normal, and run linear regression. So I tried that too (model2 and model3). However when I plot the histogram none of the distributions look normal:

```
hist(data$counts) # very skewed
hist(log(data$counts)) # very skewed
hist(log(data$counts+1)) # creates two curves/distributions
```

Since I can't tell with visual inspection (nothing seems to fit a normal distribution), is there a way to tell which model fits the data better using e.g. AIC or likelihoods from the regressions fitted below, or in any other ways?

Model 1:

```
summary(glm.nb(data$counts ~ data$var1))
Call:
glm.nb(formula = data$counts ~ data$var1, init.theta = 0.6950337825,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1858 -1.0122 -0.5362 -0.0004 13.6390
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.90055 0.01359 507.839 <2e-16 ***
data$var1 0.01474 0.01359 1.085 0.278
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.695) family taken to be 1)
Null deviance: 9431.6 on 7797 degrees of freedom
Residual deviance: 9430.4 on 7796 degrees of freedom
AIC: 122463
Number of Fisher Scoring iterations: 1
Theta: 0.69503
Std. Err.: 0.00959
2 x log-likelihood: -122457.34300
```

Model 2:

```
Call:
glm(formula = log(data$counts + 1) ~ data$var1, family = gaussian)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.3919 -0.6436 0.1400 0.8445 5.6291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.04901 0.01604 377.12 <2e-16 ***
data$var1 -0.21405 0.01604 -13.34 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 2.006264)
Null deviance: 15998 on 7797 degrees of freedom
Residual deviance: 15641 on 7796 degrees of freedom
AIC: 27563
Number of Fisher Scoring iterations: 2
```

Model 3:

```
Call:
glm.nb(formula = data$counts ~ data$var1, init.theta = 0.7290633088,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3386 -1.0042 -0.5105 0.0336 12.7096
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.85552 0.01327 516.70 <2e-16 ***
data$var1 -0.29469 0.01327 -22.21 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.7291) family taken to be 1)
Null deviance: 9886.8 on 7797 degrees of freedom
Residual deviance: 9374.1 on 7796 degrees of freedom
AIC: 121964
Number of Fisher Scoring iterations: 1
Theta: 0.7291
Std. Err.: 0.0101
2 x log-likelihood: -121957.8830
```