Question

Making use of relative density of genomic features between gene sets

0

Entering edit mode

10 months ago

Charles Murtaugh ▴ 50

I am looking at a particular genomic feature in two sets of genes: set A is a positive control set, where I know this mark is overall enriched in the genomic DNA of these genes, and set B is a (much larger) negative control set, where it occurs at a lower, background frequency. I wanted to see if the distribution of this feature within each genomic region differs between set A and set B, and below I have plotted the density of this feature along the length of all genes in each set, from 0-100% of the transcribed gene length, along with 4 kb upstream and downstream (gray shading). Set A is red, set B is blue. enter image description here

The distributions are different (and highly significantly so, by Kolmogorov-Smirnov test): the feature is uniformly distributed in set B, but shows a 3' (rightward) bias in set A, with almost no marks upstream of the transcribed region and an increased density of marks at and beyond the end of the transcribed region.

There are many genes in "set C," not plotted, for which we aren't able to assign membership in set A or B based solely on the number of marks per gene -- I would like to be able to use the information shown here as part of a classifier approach, weighting the value of marks in a new gene based on where they fall in its length. Clearly, a mark upstream of the TSS should be discounted for membership in set A, while a mark near the 3' end should be given greater weight. What I'd like advice on is the best way to extract quantitative information from the density comparison I have performed. Would it make sense to calculate the relative density at a given position here, and then apply that directly as a weight to marks found in other genes? Below is a plot of the relative density, calculated (in R) as density(setA$pos)$y / density(setB$pos)$y, after calling density with identical parameters for the two sets. enter image description here

Another approach I've considered is to break the distribution into bins, like the histogram above, identify individual bins where the relative frequency significantly differs between set A and B, and for those bins specifically use the ratio as a weighting factor. In any event, this would be one of several factors I would use for weighting -- essentially, I'm looking for features beyond the total number of marks that make it possible to distinguish additional members of set A. I should note that I have held out a large number of independent samples, to validate whatever weighting procedure I come up with.

I would appreciate any suggestions for statistically-valid methods to leverage the different distributions that I've identified. I guess this is a two-part question, the first of which is probably easiest to answer: (1) what is the best way to get useful weights from this comparison of feature density, and (2) what is the best way to use such marks? The simplest, and certainly dumbest, thing to do would be to scale my raw counts by a weighting factor, add them up for each gene and round to the nearest integer, and then perform a chi-squared or similar analysis (which is what I originally did, with raw counts, to identify the set A genes). I assume any statistician would consider this a basis for justifiable homicide, though.

distributions weighting genomics • 465 views

ADD COMMENT • link updated 10 months ago by i.sudbery 19k • written 10 months ago by Charles Murtaugh ▴ 50

score 0 · Answer 1 · 2023-06-14

You'll probably get better answers on Cross-Validated than you will on here, although you'd have to explain a bit more of what these distributions represent. But i'll give you what thoughts I have.

Firstly, if the biggest difference between Set A and Set B is the overall frequency of the mark in quesiton, rather than the normalised density, then why not use that to do the classificaiton? Seems like it would be a stronger, clearer signal.

Secondly, I'd be careful of reffering to those densities as "distributions". I'm not 100%, but my feeling is just because they are densities, does not mean they fullfill the other crieteria to be a distribution.

Finally, if what you are interested in is classification, rather than inference, then you should probably not focus on the statisitical properties, but instead just build a classifier. Take a look at the carat package if you are working in R. I'd start with something simple, like logistic regression, and then perhaps try linear discrimant analysis or a random forest.

Make sure yo u do things property - seperating off a set of training and validation genes, so cross-validation etc.