**9.3k**wrote:

Hi,

Let say I've two mutation matrices G and S respectively representing the number of germline and somatic mutation in x samples. My goal is to detect genes harboring co-occurent germlne-somatic mutation in the same samples.

--

**Edit 20191129 (answer to @Jean-Karim Heriche comment) :**

Samples are human paired normal-cancer samples. In total there is ~450 samples (350 WES and 100 WGS). variant calling and filtering was performed separately for WES and WGS. We defined germline pathogenic mutation using VEP impact (HIGH), CADD score (min 20) and a gnomAD AF < 0.01. Somatic mutations were called using Mutect2 and annotate using Funcotator (only non-synonymous were kept).

--

How will you be able to assess the significance of co-occurence in such case ?

My first idea was to "simply" perform a fisher exact test for each posible pair of genes

Example with 5 samples with both germline and somatic in gene A and B respectively :

```
Gene A :
#samples with germline mutation
--------------------------------
| Y | N |
-----------------------------------------------
Gene B | Y | 5 | 20 |
# samples with -------------------------
somatic mutation | N | 12 | 300 |
```

In R

```
fisher.test(matrix(c(5,20,12,300),nrow=2))
Fisher's Exact Test for Count Data
data: matrix(c(5, 20, 12, 300), nrow = 2)
p-value = 0.005023
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.552992 21.327962
sample estimates:
odds ratio
6.185251
```

However this approach didn't take into account the gene length ( bigger the gene - more mutation by chance -> more co-occurence by chance ) ; and the mutation rate per sample ( higher the mutation rate in a sample -> more mutation by chance -> more co-occurence)

--

So I was thinking to an other approach using permutation of both matrices.

Let say `C(A,B)`

= number of samples with co-occuring mutation in gene A (germline mut) and gene B (somatic mut).

At each iteration I shuffle separatley germline and somatic matrices by keeping the number of mutation per gene; and per sample the same as in the original matrix.

Keeping the total number of mutation per gene the same allows me to take into account the gene length Keeping the total number of mutation per sample the same allows me to take into account the mutation rate per sample

I compute at each iteration `C'(A,B)`

= number of samples with co-occuring mutation in gene A (germline mut) and gene B (somatic mut) using the shuffled germline and somatic matrices.

I redo that N=10000 times. Thus having 10000 `C'(A,B)`

values thus `C'(A,B)`

distribution represents the background of the number of samples with co-occuring mutation in gene A and B

I can then compute an empircal `p-value = sum(C(A,B) <= C'(A,B) / N`

Does this method make sense ?

**subsidiary question** : Do you know known/published method to compute germline/somatic co-occurence ?

Thanks.

To me, there's not enough information on the data. What are the samples? Do they always have gene A and B or do they cover a random collection of genes? How is a mutation defined? Is it always a single base change or can it be anything?

24k@jean-karim : I just edited my post ;)

9.3kAre you interested in co-occurrences within the same gene or within the same sample (i.e. possibly in different genes)? Also have you looked at the R package DISCOVER?

EDIT: Fixed linked to package (CRAN already has a package called discoveR)

24kWithin the same sample. Thank you Jean-Karim, I already checked the DISCOVER package (fyi your link seems to redirect to the wrong package. I guess it's this tool you are proposing : https://ccb.nki.nl/software/discover/ ), however DISCOVER only takes one matrix in input. In my case the subtility is that I have two matrices (germline and somatic).

9.3kIndeed that was the wrong package. Fixed it now. Can't you combine the two matrices? If seems to me that germline or somatic could be treated as an additional attribute. Another approach would be to go with something along the line of mutual information.

24k