What Statistical Test should I use to compare count data (mutation gene count) using two grouping variables?
Entering edit mode
23 months ago

My data represents the count of mutations by gene in two groups cases and ctrl. I would like to understand if there is a significant difference between the number of mutations for the same gene between the two groups (gene-wise comparison). In short, I have to compare means grouping by two variables (gene and Type), but in my case I don’t think anova is appropriate since count data should follow a Poisson distribution.\ Here a small subset of my data, as you can see the samples size (cases and ctrls is different).

gene   Samples_ID value  Type
AB       1    28 Cases
AB     100    22 Cases
AB     101    36 Cases
AB     102    57 Cases
AB     105    29 Cases
AB     106    25 Cases
AB     108    23 Cases
AB 4928    18  Ctrls
AB 4929    18  Ctrls
AB 4930    24  Ctrls
AB 4931    20  Ctrls
AB 4932    25  Ctrls
AB 4933    15  Ctrls
AB 4934    25  Ctrls
AB 4935    22  Ctrls
AB 4936    30  Ctrls
AB 4937    15  Ctrls
AB 4938    18  Ctrls
AB 4939    21  Ctrls

Where "gene" is the gene name, "Samples_ID" represent the patient, "value" is the number of mutation for the given gene, "type" is the group.\

I have tried a generalized linear model as follows using R because i found many similar questions:

fit <- glm(value ~ gene + Type, data = file, family = poisson())

but I'm not convinced it's the right way and I don't know how to proceed. My desired output would be a table with gene name as row, the p-value, adjusted p-value in the columns, in this way I can understand if for a given gene, the mean number of mutations differs in a significant way between groups (cases and ctrls).

PS Please, give me a simple explanation, I'm not a statistician and my background about statistic is ~ 0.
Thank you in advance.

count-data SNV mutation-count mean-comparison regression • 1.1k views
Entering edit mode
23 months ago

I'm not an expert in identifying significantly mutated genes, nor what common bias' might affect it, but A poisson generalized linear model seems a reasonable approach (a slightly more flexible alternative might be a negative binomial), but with the code you have there you are fitting a single model to all genes simultaneously, and you want to do a separate model for each gene if you want a separate p-value for each gene.

The broom package was designed to do this sort of thing.

What you basically do is someting like:


per_gene_models <- file %>%
   group_by(gene) %>%
   do(tidy(glm(value~Type, data=., family=poisson()) %>%

This will give you a dataframe that has two rows per gene, one for each of the terms in the model (effectively one for the number of mutations in the reference Type (I think that will be Cases as you have it set up now) and one for the difference between cases and controls.

As its the differences you are interested in, you need to subset to those rows. As you are doing many tests (one per gene), you also need to correct for multiple testing, by calaculating an adjusted p.value. I'd normally use BH FDRs for this, but others might use bonferroni I guess:

type_diff <- per_gene_models %>%
   filter(term == "Type") %>%
   mutate(padj = P.adjust(p.value, method="BH")

A quick word of warning. In downstream analyses you need to take account of the fact that longer genes are way more likely to be mutated than short genes. If you are looking at GO analysis or something like that, this is likely to be a major bias in your analysis.

Entering edit mode

Thank you very much for the helpful answer. That was exactly what I needed.
Regards the biases that could influence these results. This subset of data come from RNA seq, the mutations were filtered; they (or at least most of them) represent RNA editing events (this to specify they are not random mutational events but it's even true that as longer the gene is higher could be the number of editable sites).
So I agree with you about the two main source of bias.
But in this specific case, the coverage could be not represent a problem since the considered mutations came from the same regions covered at least by 50 reads across the samples. Regarding the gene length, it could be not a problem if I would compare the same gene between the two conditions. Vice versa, it could bias my results in the case I would like to compare genes between each other finding for example, the most mutated gene.
In that case, it could be possible to bypass this issue by dividing the number of mutations per transcript over the length of the corresponding transcript (e.g retrieving the transcript length from sequencing data) and then using another the statistical approach, maybe using beta regression model. I don't know if the last approach is right, it's the first idea off the top of my head.

Entering edit mode

Vice versa, it could bias my results in the case I would like to compare genes between each other finding, for example, the most mutated gene.

or if you do an enrichment analysis: longer genes are more likely to be found to be significantly mutated because there is more power to detect a significant difference in this case. Because of this, there dividing by the length of the gene won't help. You'll still have more power (or conversely, the estimated standard deviation will be lower for the longer genes). The correct way to deal with this is to account for the gene length bias explicitly in the downstream tests - so if you do an enrichment, use a tool that accounts for gene length, such as GOseq.


Login before adding your answer.

Traffic: 1813 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6