So we did a random compare of 50 transcripts each using two methods. 1. random select from one of A,G,C,T - roughly a 25% distribution 2. we reordered (scrambled) letters in the sequence. Bottom line - Random select is significantly different by a factor of 2x. Reordered letters is close enough - no distinction. We did some further analysis of the symmetry for a men1 transcript T702. We found long interconnected symmetries which we identfy below as Group1-9. Between each group their is a gap of 1 or 2 non-symmetrical letters. We are investigating further.
Group1 [firstSequenceStart=0, firstSequenceEnd=83, secondSequenceStart=597, secondSequenceEnd=680]
[[84, 85], [595, 596]]
Group2 [firstSequenceStart=79, firstSequenceEnd=111, secondSequenceStart=569, secondSequenceEnd=601]
[[112], [568]]
Group3 [firstSequenceStart=106, firstSequenceEnd=143, secondSequenceStart=537, secondSequenceEnd=574]
[[144], [536]]
Group4 [firstSequenceStart=138, firstSequenceEnd=231, secondSequenceStart=449, secondSequenceEnd=542]
[[232], [448]]
Group5 [firstSequenceStart=226, firstSequenceEnd=247, secondSequenceStart=433, secondSequenceEnd=454]
[[248], [432]]
Group6 [firstSequenceStart=242, firstSequenceEnd=260, secondSequenceStart=420, secondSequenceEnd=438]
[[261], [419]]
Group7 [firstSequenceStart=255, firstSequenceEnd=269, secondSequenceStart=411, secondSequenceEnd=425]
[[270], [410]]
Group8 [firstSequenceStart=264, firstSequenceEnd=273, secondSequenceStart=407, secondSequenceEnd=416]
[[274], [406]]
Group9 [firstSequenceStart=268, firstSequenceEnd=411, secondSequenceStart=269, secondSequenceEnd=412]
Umm, OK, so this seems to have something to do with kmers of various sizes or something...but it's difficult to parse out exactly what. Mostly, this looks like gibberish.
These are like a kmer, but variable starting from a 7 kmer. Assuming a sequence L=400 we compute 399 398, 397...7 kmers. Then, for every kmer we query in all kmers and count recurrence.
I tried, before and after drinking coffee, but I can't really see what the message is behind your blog post. Can you explain?
Coffee? You'll need scotch for this! These are like a kmer, but variable starting from a 7 kmer. Assuming a sequence L=400 we compute 399 398, 397...7 kmers. Then, for every kmer we query in all kmers and count recurrence. We found identical kmer recurrence pairs with one other kmer in around 20% of kmer instances. You can fiddle with this symmetry here - http://www.codondex.com/analysis/iscore/demo - Also, we count recurrence in bigger length and ignore length kmers and developed this ranking algorithm 'iScore' - [(ignore length kmer recurrence - bigger length kmer recurrence) / Length]. Despite length normalization, kmer recurrence retains length ordering for +95% kmers of the sequence. So far we found the ranking pinpoints specifics in existing research either miRNA's LINE/SINE, splicing junctions or other...
It sounds rather like you've observed one of the many many fingerprints of evolution.
I interpret this that you find kmers matching to repetitive sequences? Besides those, does this 'recurrence' show statistical significance (not sure how this is related to symmetry). In other words, is the recurrence of some non-repetitive 7-mers more than expected by chance?
kmers are repetitive only by recurrence not by sequence text. We will run some random assessments and get back to you , but i doubt this will be chance. - look at col B [these are k-mer ID's and col L - is their recurrence count. These L=10 sequences each recur 913185 times in bigger length kmers of this transcript.
I think I still don't get it. The transcript ENST00000576024 is 2724bp long, how do you fit 913185x CTGGAGTGCA in there? Do be more precise, the kmer 'CTGGAGTGCA' doesn't occur a single time in the sequence of that transcript.
Presumably this is the recurrence in all transcripts/features/whatever that they looked at. But yeah, the presentation here isn't terribly clear.
Our focus has been on the Intron - you can access it here http://www.codondex.com/sequences/ENST00000576024 - you will see these two occurrences of CTGGAGTGCA and CTGCCTCAGC in the base sequence. Our pattern/amplification produced 1,408,680 k-mers (+150m letters) from the base sequence. Notwithstanding the letter differences, the two k-mers in col B #97022 and #127262 recurred 913185x for each k-mer in the query of all bigger length k-mers. Like this example, identical recurrence for different k-mer pairs occurs in ~15-20% of k-mers for multiple gene transcripts. Hope this helps to clarify.