I have some transposons identified using their terminal DNA sequences, lets say 10nt each.
These have been identified based on sequence similarity to a set of pre-defined terminal DNA sequences, so discovery is based on match to a profile / HMM / PSSM if you will.
Note that the lengths of these terminals are not strictly 10nt always, but vary a little. Also, the length of the element, i.e. separation between the terminals is quite variable. Just throwing in these details to provide a more complete picture of the problem. Which is:
I want to calculate a statistic that conveys how non-random the combination of these two terminal DNA sequences is, given the whole genome sequence, their lengths, their composition, their length of separation etc.
For example, I envision that when transposon #1 is flanked by terminal DNA sequences that occur way more often in the genome generally speaking, than for transposon #2, this statistic will convey the higher confidence in transposon #2 than for #1.
Question is what should this statistic be, and how would you advice me to go about calculating it?
Some authors have approached this by randomly shuffling the genome and reporting false discovery rate as the ratio of loci discovered in intact versus shuffled genome. I am not too convinced with this approach. I am not against it however, just want advice on a different statistic to describe the chance or alternatively non-random likelihood of these elements.
I am comfortable with Perl coding, and some R coding. If I have multiple options for calculating / reporting a suitable statistic, I request you keep my skill-set in mind. Thanks folks!