Entering edit mode

4 months ago

RK
•
0

I have a contingency table that counts mutations in the foreground and background regions of a genomic region, and I am trying to see if the result I have is significant or not. I randomly shuffled the regions using bedtools shuffle to generate 100 new tables. However, I am not sure how to compare the observed data with the null distribution. Is the test statistic to be compared with the simulated ones?

These are the steps I followed

- Determine the chi square test statistic for the real data contingency table.
- Determine the null distribution of test statistics by computing the test statistic for each of the simulated contingency tables.
- Determine the number of simulated contingency tables that have a test statistic as extreme or more extreme than the test statistic for the real data contingency table. i.e number of times that the test statistic for the simulated contingency tables is greater than or equal to the test statistic for the real data contingency table.
- Calculate the p-value by dividing the number of simulated contingency tables with a test statistic as extreme or more extreme than the test statistic for the real data contingency table by the total number of simulated contingency tables.

Thanks for your help in advance!

What is the structure of your contingency table? If it has foreground and background as one dimension and e.g. samples as the other dimension and you want to know if there's a difference in counts if mutations in the samples come from foreground or background (i.e. the null hypothesis is that the rows and columns are independent), then just do a chi-squared test. In R, this is just

`chisq.test(contingency.table)`

.What you're doing is generating the p-value by bootstraping, which for a sufficiently large number of samples under the null hypothesis, should give you the same result.

It's a 2*2 contingency table. The rows are foreground and background, and the columns are counts of mutated trinucleotides and unmutated trinucleotides.

Thank you

So just do a regular chi-squared test as I explained above. The null hypothesis is that there are as many mutated/unmutated trinucleotides in the foreground as in the background. Rejecting the null with a low p-value means you're convinced that there is an association between number of mutated/unmutated trinucleotides and foreground/background and you'll know the direction of this association by looking at the table of residuals (i.e. substracting the expected value from each cell).

Thank you so much