I have identified a number of structural variant breakpoints across multiple tumour normal comparisons.
I want to ask the question "How enriched for a particular genomic feature is my set of breakpoints?"
For example, across all samples we find a total of 200 breakpoints, and 15 of these are found in the same class of genomic feature (e.g. exon).
If the genome is 137547960 bps long, and the total fraction of the genome that is exonic is 21.8% (30095000/137547960), then I would expect to find 43.8/200 breakpoints in exons ( 200*(total_exon_length/total_genome) ) across all my samples.That we find only 15, suggests that this feature class is underrepresented in our breakpoint set.
Is this the right way of going about this sort of test? What is an appropriate statistic to use here? Chi-Squared or Fisher's Exact?